5-4
Intel® PXA27x Processor Family
Optimization Guide
High Level Language Optimization
The following describes how these resources interact. A fill buffer is allocated for each cache read
miss. A fill buffer and a pending buffer is allocated for each cache write miss if the memory space
is marked as write allocate. A subsequent read to the same cache line does not require a new fill
buffer, but does require a pending buffer and a subsequent write will also require a new pending
buffer. A fill buffer is also allocated for each read to a non-cached memory and a write buffer is
needed for each memory write to non-cached memory that is non-coalescing. Consequently, an
STM instruction listing eight registers and referencing non-cached memory uses eight write buffers
assuming they don’t coalesce and two write buffers if they do coalesce. A cache eviction requires a
write buffer for each dirty bit set in the cache line. The preload instruction requires a fill buffer for
each cache line and 0, 1, or 2 write buffers for an eviction.
When adding preload instructions, use caution to ensure that the combination of preload and
instruction fetches do not exceed the available system resource capacity described above or
performance is degraded instead of improved. It is important to intersperse preload operations
throughout calculations to allow the memory bus traffic to flow freely and to minimize the number
of necessary preloads.
discusses code optimization for preloading.
5.1.1.3
Coding Technique with Preload
Since preload is a powerful optimization technique, preloading opportunities can be exploited
during the code development in the high-level language. The preload instruction can be
implemented as a C-callable function and can be used at different places throughout the C-code.
The developer may choose to implement two types of routines, one which loads only one cache
line or may choose to use a function which preloads multiple cache-lines. The data usage (that is
linear array striding or 2-D array striding) can influence the choice of the preload scheme.
However, one need to keep in mind that, only four outstanding preloads are allowed and excessive
use need to be avoided.
5.1.1.3.1
Coding Technique: Unrolling With Preload
When iterating through a loop, data transfer latency can be hidden by preloading ahead one or more
iterations. The solution incurs an unwanted side effect that the final interactions of a loop loads
useless data into the cache, polluting the cache, increasing bus traffic and possibly evicting
valuable temporal data. This problem can be resolved by preload unrolling. For example refer to
this:
for(i=0; i<NMAX; i++)
{
prefetch(data[i+2]);
sum += data[i];
}
Interactions i-1 and i, preload superfluous data. The problem can be avoid by unrolling the end of
the loop.
for(i=0; i<NMAX-2; i++)
{
prefetch(data[i+2]);
sum += data[i];
}
sum += data[NMAX-2];
sum += data[NMAX-1];
Summary of Contents for PXA270
Page 1: ...Order Number 280004 001 Intel PXA27x Processor Family Optimization Guide April 2004...
Page 10: ...x Intel PXA27x Processor Family Optimization Guide Contents...
Page 20: ...1 10 Intel PXA27x Processor Family Optimization Guide Introduction...
Page 30: ...2 10 Intel PXA27x Processor Family Optimization Guide Microarchitecture Overview...
Page 48: ...3 18 Intel PXA27x Processor Family Optimization Guide System Level Optimization...
Page 114: ...5 16 Intel PXA27x Processor Family Optimization Guide High Level Language Optimization...
Page 122: ...6 8 Intel PXA27x Processor Family Optimization Guide Power Optimization...
Page 143: ...Intel PXA27x Processor Family Optimization Guide Index 5 Index...
Page 144: ......