Intel® PXA27x Processor Family
Optimization Guide
4-21
Intel XScale® Microarchitecture & Intel® Wireless MMX™ Technology Optimization
4.3.2.4
Scheduling the WMUL and WMADD Instructions
The issue latency of the WMUL and WMADD instructions is one cycle and the result and resource
latency are two cycles. The second WMUL instruction in the following example stalls for one
cycle due to the two cycle resource latency.
WMUL wR0, wR1, wR2
WMUL wR3, wR4, wR5
The WADD instruction in the following example stalls for one cycle due to the two cycle result
latency.
WMUL wR0, wR1, wR2
WADD wR1, wR0, wR2
4.4
SIMD Optimization Techniques
The Single Instruction Multiple Data, (SIMD), architectures provided by the Intel® Wireless
MMX™ Technology enables us to exploit the inherent parallelism found in the wide domain of
multimedia and communication applications. The most time-consuming code sequences have
certain characteristics in common:
•
Operations are performed on small-native-data types (8-bit pixels, 16-bit voice, 32-bit audio)
•
Regular and recurring memory access patterns, usually data independent
•
Localized, recurring computations performed on the data
•
Compute-intensive processing
In the following sections we illustrate how the rules for writing fast sequences of Intel® MMX™
Technology instructions on Intel® Wireless MMX™ Technology can be applied to the
optimization of short loops of Intel® MMX™ Technology code.
4.4.1
Software Pipelining
Software pipelining or loop unrolling is a well known optimization technique where multiple
calculations are in executed with each loop iteration. The disadvantages of applying this technique
include: increases in code size for critical loops and restrictions on the minimum and multiples of
taps or samples
The obvious advantage is in reduced cycle consumption. Overhead from loop exit testing may be
reduced load-use stalls may be minimized and in some cases eliminated completely instruction
scheduling opportunities may be created and exploited.
To illustrate the need for software pipe-lining, lets consider a key kernel of Intel® MMX™
Technology code that is central to many signal-processing algorithms, the real block Finite-
Impulse-Response (FIR) filter. A real block FIR filter operates on two real vectors
c(i)
and
x(i)
and
produces and output vector
y(n)
. The vectors are represented for Intel® MMX™ Technology
programming as arrays of 16-bit integers of some length N. The real FIR filter is represented by the
equation:
Summary of Contents for PXA270
Page 1: ...Order Number 280004 001 Intel PXA27x Processor Family Optimization Guide April 2004...
Page 10: ...x Intel PXA27x Processor Family Optimization Guide Contents...
Page 20: ...1 10 Intel PXA27x Processor Family Optimization Guide Introduction...
Page 30: ...2 10 Intel PXA27x Processor Family Optimization Guide Microarchitecture Overview...
Page 48: ...3 18 Intel PXA27x Processor Family Optimization Guide System Level Optimization...
Page 114: ...5 16 Intel PXA27x Processor Family Optimization Guide High Level Language Optimization...
Page 122: ...6 8 Intel PXA27x Processor Family Optimization Guide Power Optimization...
Page 143: ...Intel PXA27x Processor Family Optimization Guide Index 5 Index...
Page 144: ......