208
January, 2004
Developer’s Manual
Intel XScale® Core
Developer’s Manual
Optimization Guide
The result latency for an LDR instruction is significantly higher if the data being loaded is not in
the data cache. To minimize the number of pipeline stalls in such a situation the LDR instruction
should be moved as far away as possible from the instruction that uses result of the load. Note that
this may at times cause certain register values to be spilled to memory due to the increase in
register pressure. In such cases, use a preload instruction or a preload hint to ensure that the data
access in the LDR instruction hits the cache when it executes. A preload hint should be used in
cases where we cannot be sure whether the load instruction would be executed. A preload
instruction should be used in cases where we can be sure that the load instruction would be
executed. Consider the following code sample:
; all other registers are in use
sub r1, r6, r7
mul r3,r6, r2
mov r2, r2, LSL #2
orr r9, r9, #0xf
add r0,r4, r5
ldr r6, [r0]
add r8, r6, r8
add r8, r8, #4
orr r8,r8, #0xf
; The value in register r6 is not used after this
In the code sample above, the ADD and the LDR instruction can be moved before the MOV
instruction. Note that this would prevent pipeline stalls if the load hits the data cache. However, if
the load is likely to miss the data cache, move the LDR instruction so that it executes as early as
possible - before the SUB instruction. However, moving the LDR instruction before the SUB
instruction would change the program semantics. It is possible to move the ADD and the LDR
instructions before the SUB instruction if we allow the contents of the register r6 to be spilled and
restored from the stack as shown below:
; all other registers are in use
str r6,[sp, #-4]!
add r0,r4,r5
ldr r6, [r0]
mov r2, r2, LSL #2
orr r9, r9, #0xf
add r8, r6, r8
ldr r6, [sp], #4
add r8, r8, #4
orr r8,r8, #0xf
sub r1, r6, r7
mul r3,r6, r2
; The value in register r6 is not used after this
As can be seen above, the contents of the register r6 have been spilled to the stack and subsequently
loaded back to the register r6 to retain the program semantics. Another way to optimize the code
above is with the use of the preload instruction as shown below:
; all other registers are in use
add r0,r4, r5
pld [r0]
sub r1, r6, r7
mul r3,r6, r2
mov r2, r2, LSL #2
orr r9, r9, #0xf
ldr r6, [r0]
add r8, r6, r8
add r8, r8, #4
orr r8,r8, #0xf
; The value in register r6 is not used after this