Intel® PXA27x Processor Family
Optimization Guide
4-23
Intel XScale® Microarchitecture & Intel® Wireless MMX™ Technology Optimization
WMACS wR2, wR1, wR0
SUBS r3, r3, #4
BNE Loop_Begin
The parallelism of the filter may be exposed further by unrolling the loop to provide for eight taps
per iteration. In the following code sequence, the loop has been unrolled once allowing several
load-to-use stalls to be eliminated. The loop overhead has also been further amortized reducing it
from two cycles for every four taps to 2 cycles for every eight taps. There is still a single load-to-
use stall present between the second WLDRD instruction and the second WMACS instruction
within the inner loop
; Pointers r0 -> val , r1 -> pResult, r2 -> pTapsQ15 r3 -> tapsLen
WLDRD wR0, [r2] , #8
WZERO wR15
WLDRD wR1, [r4] , #8
Loop_Begin:
WLDRD wR2, [r2] , #8
SUBS r3, r3, #8
WLDRD wR3, [r4] , #8
WMACS wR15, wR1, wR0
WLDRDNE wR0, [r2] , #8
WMACS wR15, wR2, wR3
WLDRDNE wR1, [r4] , #8
BNE Loop_Begin
4.4.1.1
General Remarks on Software Pipelining
In the example for the real block FIR filter, two copies of the basic sequence of code were
interleaved eliminating all but one of the stalls. The throughput for the sequence went from
9 cycles for every four taps to 9 cycles for every eight taps. This corresponds to a throughput of
1.125 cycles per tap represents a 2X throughput improvement.
It is useful to define a metric to describe the number of copies of a basic sequence of instructions
which need to be interleaved in order to remove all stalls. We can call this the interleave factor, k.
The real block FIR filter requires k=2 to eliminate all possible stalls primarily because it is a small
sequence which must take into account the long load-to-use latency. In practice, k=2 is sufficient
for most loops encountered in real applications. This is fortunate because each interleaving requires
its own set of temporary registers and with some algorithms interleaving with k=3 is not possible.
A good rule of thumb is to try k=2 first, as it is usually the right choice.
4.4.2
Multi-Sample Technique
The multi-sample optimization technique provides for calculating multiple outputs with each loop
iteration similar to loop unrolling. The disadvantages of applying this technique include, increases
in code size for critical loops. Restrictions on the minimum and multiples of taps or samples are
also imposed. The obvious advantage is in reduced cycle consumption.
•
Memory bandwidth is reduced by data re-use.
•
Load-to-use stalls may be easily eliminated with scheduling.
Summary of Contents for PXA270
Page 1: ...Order Number 280004 001 Intel PXA27x Processor Family Optimization Guide April 2004...
Page 10: ...x Intel PXA27x Processor Family Optimization Guide Contents...
Page 20: ...1 10 Intel PXA27x Processor Family Optimization Guide Introduction...
Page 30: ...2 10 Intel PXA27x Processor Family Optimization Guide Microarchitecture Overview...
Page 48: ...3 18 Intel PXA27x Processor Family Optimization Guide System Level Optimization...
Page 114: ...5 16 Intel PXA27x Processor Family Optimization Guide High Level Language Optimization...
Page 122: ...6 8 Intel PXA27x Processor Family Optimization Guide Power Optimization...
Page 143: ...Intel PXA27x Processor Family Optimization Guide Index 5 Index...
Page 144: ......