Intel® PXA27x Processor Family
Optimization Guide
4-25
Intel XScale® Microarchitecture & Intel® Wireless MMX™ Technology Optimization
WALIGNI wR3 ,wR0 , wR1, #2
WALIGNI wR4 ,wR0 , wR1, #4
WMAC wR15,wR9 , wR1 ; y(n) +=
WALIGNI wR5 ,wR0 , wR1, #6
WMAC wR14,wR9 , wR3 ; y(n+1) +=
WLDRD wR1, [R1], #8 ; even groups of 4 inputs
WMAC wR13,wR9 , wR4 ; y(n+2) +=
WLDRD wR8, [R2], #8 ; even groups of 4 coeff.
WMAC wR12,wR8 , wR5 ; y(n+3) +=
BNE Inner_Loop
; ** Outer loop code calculates the last four taps for
; y(n), y(n+1), y(n+2), y(n+3)**
; ** Store results
BNE Outer_Loop
4.4.2.1
General Remarks on Multi-Sample Technique
In the example for the real block FIR filter, four outputs are computed simultaneously in the same
inner loop. This has allowed the re-use of coefficients and sample data loaded into the register for
computation of the first output to be used for the computation of the next three outputs. The
interleave factor is set at k=2, which results in the elimination of load-to-use stalls. The throughput
for the sequence is 20 cycles for every 32 taps, or 0.625 cycles per tap. This represents near ideal
saturation of the execution resources.
The multi-sample technique may be applied whenever the same data is being utilized for multiple
calculations. The large register file on Intel® Wireless MMX™ Technology facilitates this
approach and a number of variations are possible.
4.4.3
Data Alignment Techniques
The exploitation of the data parallelism present in multimedia algorithms is accomplished by
executing the same operation on different elements in parallel. This is accomplished by packing
several data elements into a single register and using the packed data instructions provided by the
Intel® Wireless MMX™ Technology.
An important guideline for achieving optimum performance is always to align memory references.
This means that an N-byte memory read or write should always be on an N-byte boundary. In some
it is easy to align data so that all of the reads and writes are aligned. In other cases it is more
difficult because an algorithm naturally reads data in a misaligned fashion. A couple of examples
of this include the single-sample FIR and video motion estimation.
The Intel® Wireless MMX™ Technology provides a mechanism for reducing the overhead
associated with the classes of algorithms which require data to be accessed on 32-bit, 16-bit, or 8-
bit binaries. The ALIGNI instruction is useful when the sequence of alignment is known
beforehand as with the single-sample FIR filter. The ALIGNR instruction is useful when sequence
of alignments are calculated when the algorithm executes as with the fast motion search algorithms
used in video compression. Both of these instructions operate on register pairs which may be
effectively ping-ponged with alternate loads reducing the alignments overhead significantly.
Summary of Contents for PXA270
Page 1: ...Order Number 280004 001 Intel PXA27x Processor Family Optimization Guide April 2004...
Page 10: ...x Intel PXA27x Processor Family Optimization Guide Contents...
Page 20: ...1 10 Intel PXA27x Processor Family Optimization Guide Introduction...
Page 30: ...2 10 Intel PXA27x Processor Family Optimization Guide Microarchitecture Overview...
Page 48: ...3 18 Intel PXA27x Processor Family Optimization Guide System Level Optimization...
Page 114: ...5 16 Intel PXA27x Processor Family Optimization Guide High Level Language Optimization...
Page 122: ...6 8 Intel PXA27x Processor Family Optimization Guide Power Optimization...
Page 143: ...Intel PXA27x Processor Family Optimization Guide Index 5 Index...
Page 144: ......