Introduction to the VFP coprocessor
ARM DDI 0301H
Copyright © 2004-2009 ARM Limited. All rights reserved.
18-16
ID012310
Non-Confidential, Unrestricted Access
18.9
Writing optimal VFP11 code
The following guidelines provide significant performance increases for VFP11 code:
•
Unless there is a read-after-write hazard, program most scalar operations to immediately
follow each other. Instead of a VFP11 FMAC instruction, use either a single ARM11
instruction or a VFP11 load or store instruction after the following instructions:
—
a scalar double-precision multiply
—
a multiply and accumulate
—
a short vector instruction of length greater than one iteration.
•
Avoid short vector divides and square roots. The VFP11 FMAC and DS pipelines are
unavailable until the final iteration of the short vector DS operation issues from the
Execute 1 stage. If the short vector DS operation can be separated, other VFP11
instructions can be issued in the cycles immediately following the divide or square root.
See
Parallel execution
on page 21-20.
•
The best performance for data-intensive applications requires double-buffering looped
short vector instructions. The register banks can be divided to provide multiple
independent working areas. To take advantage of the simultaneous execution of data
transfer and short vector arithmetic instructions, follow the arithmetic instructions on one
bank with load or store instructions on the other bank.
•
Moves to and from control registers are serializing. Avoid placing these in loops or
time-critical code.
•
If fully compliant IEEE 754 standard comparisons are not required, avoid using FCMPE
and FCMPEZ. Using an FMRS instruction with an ARM11 CMP or CMN can be faster
for simple comparisons. See
Comparisons
on page 20-5.