2-6
Intel® PXA27x Processor Family
Optimization Guide
Microarchitecture Overview
These are important characteristics about the MAC:
•
The MAC is not a true pipeline. The processing of a single instruction requires use of the same
data-path resources for several cycles before a new instruction is accepted. The type of
instruction and source arguments determine the number of required cycles.
•
No more than two instructions can concurrently occupy the MAC pipeline.
•
When the MAC is processing an instruction, another instruction cannot enter M1 unless the
original instruction completes in the next cycle.
•
The MAC unit can operate on 16-bit packed signed data. This reduces register pressure and
memory traffic size. Two 16-bit data items can be loaded into a register with one LDR.
•
The MAC can achieve throughput of one multiply per cycle when performing a 16-by-32-bit
multiply.
•
ACC registers in the Intel XScale® Microarchitecture can be up to 64 bits in future
implementations. Code should be written to depend on the 40-bit nature of the current
implementation.
2.2.5.1
Behavioral Description
The execution of the MAC unit starts at the beginning of the M1 pipestage. At this point, the MAC
unit receives two 32-bit source operands. Results are completed N cycles later (where N is
dependent on the operand size) and returned to the register file. For more information on MAC
instruction latencies, refer to
Section 4.8, “Instruction Latencies for Intel XScale®
An instruction occupying the M1 or M2 pipestages occupies the X1 and X2 pipestage, respectively.
Each cycle, a MAC operation progresses for M1 to M5. A MAC operation may complete anywhere
from M2-M5.
2.2.5.2
Perils of Superpipelining
The longer pipeline has several consequences worth considering:
•
Larger branch misprediction penalty (four cycles in the Intel XScale® Microarchitecture
instead of one in StrongARM* Architecture).
•
Larger load use delay (LUD) — LUDs arise from load-use dependencies. A load-use
dependency gives rise to a LUD if the result of the load instruction cannot be made available
by the pipeline in time for the subsequent instruction. To avoid these penalties, an optimizing
compiler should take advantage of the core’s multiple outstanding load capability (also called
hit-under-miss) as well as finding independent instructions to fill the slot following the load.
•
Certain instructions incur a few extra cycles of delay with the Intel XScale® Microarchitecture
as compared to StrongARM* processors (LDM, STM).
•
Decode and register file lookups are spread out over two cycles with the Intel XScale®
Microarchitecture, instead of one cycle in predecessors.
Summary of Contents for PXA270
Page 1: ...Order Number 280004 001 Intel PXA27x Processor Family Optimization Guide April 2004...
Page 10: ...x Intel PXA27x Processor Family Optimization Guide Contents...
Page 20: ...1 10 Intel PXA27x Processor Family Optimization Guide Introduction...
Page 30: ...2 10 Intel PXA27x Processor Family Optimization Guide Microarchitecture Overview...
Page 48: ...3 18 Intel PXA27x Processor Family Optimization Guide System Level Optimization...
Page 114: ...5 16 Intel PXA27x Processor Family Optimization Guide High Level Language Optimization...
Page 122: ...6 8 Intel PXA27x Processor Family Optimization Guide Power Optimization...
Page 143: ...Intel PXA27x Processor Family Optimization Guide Index 5 Index...
Page 144: ......