![Intel IXP45X Developer'S Manual Download Page 212](http://html1.mh-extra.com/html/intel/ixp45x/ixp45x_developers-manual_2073092212.webp)
Intel
®
IXP45X and Intel
®
IXP46X Product Line of Network Processors—Intel XScale
®
Processor
Intel
®
IXP45X and Intel
®
IXP46X Product Line of Network Processors
Developer’s Manual
August 2006
212
Order Number: 306262-004US
The second loop reuses the data elements A[i] and c[i]. Fusing the loops together
produces:
3.10.4.4.11 Prefetch to Reduce Register Pressure
Pre-fetch can be used to reduce register pressure. When data is needed for an
operation, then the load is scheduled far enough in advance to hide the load latency.
However, the load ties up the receiving register until the data can be used. For
example:
In the above case, r2 is unavailable for processing until the add statement. Prefetching
the data load frees the register for use. The example code becomes:
With the added prefetch, register r2 can be used for other operations until almost just
before it is needed.
3.10.5
Instruction Scheduling
This section discusses instruction scheduling optimizations. Instruction scheduling
refers to the rearrangement of a sequence of instructions for the purpose of minimizing
pipeline stalls. Reducing the number of pipeline stalls improves application
performance. While making this rearrangement, care should be taken to ensure that
the rearranged sequence of instructions has the same effect as the original sequence of
instructions.
3.10.5.1
Scheduling Loads
On the IXP45X/IXP46X network processors, an LDR instruction has a result latency of
three cycles assuming the data being loaded is in the data cache. If the instruction
after the LDR needs to use the result of the load, then it would stall for 2 cycles. If
possible, the instructions surrounding the LDR instruction should be rearranged.
for(i=0; i<NMAX; i++)
{
prefetch(A[i+1], c[i+1], c[i+1]);
A[i] = b[i] + c[i];
}
for(i=0; i<NMAX; i++)
{
prefetch(D[i+1], c[i+1], A[i+1]);
D[i] = A[i] + c[i];
}
for(i=0; i<NMAX; i++)
{
prefetch(D[i+1], A[i+1], c[i+1], b[i+1]);
ai = b[i] + c[i];
A[i] = ai;
D[i] = ai + c[i];
}
ldr r2, [r0]
; Process code { not yet cached latency > 60 core clocks }
add r1, r1, r2
pld [r0] ;prefetch the data keeping r2 available for use
; Process code
ldr r2, [r0]
; Process code { ldr result latency is 3 core clocks }
add r1, r1, r2