212
January, 2004
Developer’s Manual
Intel XScale® Core
Developer’s Manual
Optimization Guide
A.5.2
Scheduling Data Processing Instructions
Most core data processing instructions have a result latency of 1 cycle. This means that the current
instruction is able to use the result from the previous data processing instruction. However, the
result latency is 2 cycles if the current instruction needs to use the result of the previous data
processing instruction for a shift by immediate. As a result, the following code segment would
incur a 1 cycle stall for the mov instruction:
sub r6, r7, r8
add r1, r2, r3
mov r4, r1, LSL #2
The code above can be rearranged as follows to remove the 1 cycle stall:
add r1, r2, r3
sub r6, r7, r8
mov r4, r1, LSL #2
All data processing instructions incur a 2 cycle issue penalty and a 2 cycle result penalty when the
shifter operand is a shift/rotate by a register or shifter operand is RRX. Since the next instruction
would always incur a 2 cycle issue penalty, there is no way to avoid such a stall except by
re-writing the assembler instruction. Consider the following segment of code:
mov r3, #10
mul r4, r2, r3
add r5, r6, r2, LSL r3
sub r7, r8, r2
The subtract instruction would incur a 1 cycle stall due to the issue latency of the add instruction as
the shifter operand is shift by a register. The issue latency can be avoided by changing the code as
follows:
mov r3, #10
mul r4, r2, r3
add r5, r6, r2, LSL #10
sub r7, r8, r2