![Intel IXP45X Developer'S Manual Download Page 216](http://html1.mh-extra.com/html/intel/ixp45x/ixp45x_developers-manual_2073092216.webp)
Intel
®
IXP45X and Intel
®
IXP46X Product Line of Network Processors—Intel XScale
®
Processor
Intel
®
IXP45X and Intel
®
IXP46X Product Line of Network Processors
Developer’s Manual
August 2006
216
Order Number: 306262-004US
3.10.5.1.2
Scheduling Load and Store Multiple (LDM/STM)
LDM and STM instructions have an issue latency of 2-20 cycles depending on the
number of registers being loaded or stored. The issue latency is typically two cycles
plus an additional cycle for each of the registers being loaded or stored assuming a
data cache hit. The instruction following an LDM would stall whether or not this
instruction depends on the results of the load. A LDRD or STRD instruction does not
suffer from this drawback (except when followed by a memory operation) and should
be used where possible. Consider the task of adding two 64-bit integer values. Assume
that the addresses of these values are aligned on an 8-byte boundary. This can be
achieved using the LDM instructions as shown below:
If the code were written as shown above, assuming all the accesses hit the cache, the
code would take 11 cycles to complete. Rewriting the code as shown below using LDRD
instruction would take only seven cycles to complete. The performance would increase
further if we can fill in other instructions after LDRD to reduce the stalls due to the
result latencies of the LDRD instructions.
Similarly, the code sequence shown below takes five cycles to complete.
.
The alternative version which is shown below would only take three cycles to complete.
3.10.5.2
Scheduling Data Processing Instructions
Most data processing instructions for the IXP45X/IXP46X network processors have a
result latency of one cycle. This means that the current instruction is able to use the
result from the previous data processing instruction. However, the result latency is two
cycles if the current instruction needs to use the result of the previous data processing
instruction for a shift by immediate. As a result, the following code segment would
incur a one-cycle stall for the MOV instruction:
The code above can be rearranged as follows to remove the one-cycle stall:
; r0 contains the address of the value being copied
; r1 contains the address of the destination location
ldm r0, {r2, r3}
ldm r1, {r4, r5}
adds r0, r2, r4
adc r1,r3, r5
; r0 contains the address of the value being copied
; r1 contains the address of the destination location
ldrd r2, [r0]
ldrd r4, [r1]
adds r0, r2, r4
adc r1,r3, r5
stm r0, {r2, r3}
add r1, r1, #1
strd r2,
[r0]
add r1, r1, #1
sub r6, r7, r8
add r1, r2, r3
mov r4, r1, LSL #2
add r1, r2, r3
sub r6, r7, r8
mov r4, r1, LSL #2