Intel® PXA27x Processor Family
Optimization Guide
5-3
High Level Language Optimization
estimate of N
evict
is the number of bytes written per loop iteration divided by a half cache line size
(16 bytes). Cache overflow can be estimated by the number of cache lines transferred each iteration
and the number of expected loop iterations. N
evict
and CPI can be estimated by profiling the code
using the performance monitor “cache write-back” event count.
If the preload address is not aligned with a cache-line boundary, CWF latency should considered in
the equation. CWF can offer higher latency for the cache-line load compared non-CWF read, thus,
it is recommended to use the cache-line aligned addresses for preload.
Using the memory latency for the PXA27x processor at 200-MHz run mode and 100-MHz
SDRAM, the preload distance comes out to three cache lines. This is a rule of thumb for the
PXA27x processor for a loop which consumes and produces one cache line.
Correct scheduling of the preload loop can help large memory based operations, for example,
memory to memory copy, page copy, video related operations to load particular image blocks.
5.1.1.2
Preload Loop Limitations
It is not always advantageous to add preloads to a loop. Loop characteristics that limit value of
adding preloads are discussed below.
5.1.1.2.1
Preload Limitations: Throughput bound vs. Latency bound
The worst case is a loop which is bounded by the memory throughput. This does not benefit from
preloading because all the system resources to transfer data are quickly allocated and there are no
preload instructions that can be executed without impacting (non-preload) memory loads.
However, if the application is bounded by the memory latency, preload can effectively hide the
memory latency. Applications requiring large data manipulation, such as graphics applications,
video applications etc., can greatly benefit from preloading.
5.1.1.2.2
Preload Limitations: Low Number of Iterations
Loops with a low number of iterations may completely mitigate the advantage of preloading. A
loop with a small fixed number of iterations may be faster if the loop is completely unrolled rather
than trying to schedule preload instructions.
5.1.1.2.3
Preload Limitations: Bandwidth Consumption
Overuse of preloads can usurp resources and degrade performance. This happens because once the
bus traffic requests exceed the system resource capacity, the processor stalls. Intel XScale®
Microarchitecture data transfer resources are:
•
4 fill buffers
•
4 pending buffers
•
8 half cache line write buffer
SDRAM resources are typically:
•
4 memory banks
•
1 page buffer per bank referencing a 4K address range
•
4 transfer request buffers
Summary of Contents for PXA270
Page 1: ...Order Number 280004 001 Intel PXA27x Processor Family Optimization Guide April 2004...
Page 10: ...x Intel PXA27x Processor Family Optimization Guide Contents...
Page 20: ...1 10 Intel PXA27x Processor Family Optimization Guide Introduction...
Page 30: ...2 10 Intel PXA27x Processor Family Optimization Guide Microarchitecture Overview...
Page 48: ...3 18 Intel PXA27x Processor Family Optimization Guide System Level Optimization...
Page 114: ...5 16 Intel PXA27x Processor Family Optimization Guide High Level Language Optimization...
Page 122: ...6 8 Intel PXA27x Processor Family Optimization Guide Power Optimization...
Page 143: ...Intel PXA27x Processor Family Optimization Guide Index 5 Index...
Page 144: ......