5-6
Intel® PXA27x Processor Family
Optimization Guide
High Level Language Optimization
5.1.1.3.3
Coding Technique: Preload to Reduce Register Pressure
Preloading can reduce register pressure. When data is needed for an operation, the load should be
scheduled far enough in advance to hide the load latency. However, the load ties up the receiving
register until the data can be used. For example:
ldr r2, [r0]
; Process code {not yet cached latency > 60 core clocks}
add r1, r1, r2
In the above case, R2 is unavailable for processing until the add statement. Preloading the data load
frees the register for use. The example code becomes:
pld [r0] ;preload the data keeping r2 available for use
; Process code
ldr r2, [r0]
; Process code {ldr result latency is 3 core clocks}
add r1, r1, r2
With the added preload, register R2 can be used for other operations until just before it is needed.
Apart from code optimization for preload, there are many other techniques to use while writing C
and C++ code; these are discussed in later chapters.
5.1.2
Array Merging
Stride (the way data structures are walked through) can affect the temporal quality of the data and
reduce or increase cache conflicts. Intel XScale® Microarchitecture data cache and mini-data
caches each have 32 sets of 32 bytes. This means that each cache line in a set is on a modular 1K-
address boundary. It is important to choose data structure sizes and stride requirements that do not
overwhelm a given set causing conflicts and increased register pressure. Register pressure can be
increased because additional registers are required to track preload addresses. This can be achieved
by rearranging data structure components to use more parallel access to search and compare
elements. Similarly, rearranging data structures so that the sections that are often written fit in the
same half cache line
1
can reduce cache eviction write-backs. On a global scale, techniques such as
array merging can enhance the spatial locality of the data.
As an example of array merging, refer to this code:
int a[NMAX];
int b[NMAX];
int ix;
for (i=0; i<NMAX]; i++)
{
ix = b[i];
if (a[i]!= 0)
ix = a[i];
do_other calculations;
}
1.
A half cache line is 16 bytes for the Intel XScale® Microarchitecture
Summary of Contents for PXA270
Page 1: ...Order Number 280004 001 Intel PXA27x Processor Family Optimization Guide April 2004...
Page 10: ...x Intel PXA27x Processor Family Optimization Guide Contents...
Page 20: ...1 10 Intel PXA27x Processor Family Optimization Guide Introduction...
Page 30: ...2 10 Intel PXA27x Processor Family Optimization Guide Microarchitecture Overview...
Page 48: ...3 18 Intel PXA27x Processor Family Optimization Guide System Level Optimization...
Page 114: ...5 16 Intel PXA27x Processor Family Optimization Guide High Level Language Optimization...
Page 122: ...6 8 Intel PXA27x Processor Family Optimization Guide Power Optimization...
Page 143: ...Intel PXA27x Processor Family Optimization Guide Index 5 Index...
Page 144: ......