![Intel IXP45X Скачать руководство пользователя страница 215](http://html1.mh-extra.com/html/intel/ixp45x/ixp45x_developers-manual_2073092215.webp)
Intel
®
IXP45X and Intel
®
IXP46X Product Line of Network Processors
August 2006
Developer’s Manual
Order Number: 306262-004US
215
Intel XScale
®
Processor—Intel
®
IXP45X and Intel
®
IXP46X Product Line of Network Processors
loads are outstanding and are being fetched from memory. As a result, the code written
should ensure that no more than four loads are outstanding at the same time. For
example, the number of loads issued sequentially should not exceed four. Also note
that a preload instruction may cause a fill buffer to be used. As a result, the number of
preload instructions outstanding should also be considered to arrive at the number of
loads that are outstanding.
Similarly, the number of write buffers also limits the number of successive writes that
can be issued before the processor stalls. No more than eight stores can be issued. Also
note that if the data caches are using the write-allocate with write-back policy, then a
load operation may cause stores to the external memory if the read operation evicts a
cache line that is dirty (modified). The number of sequential stores may be limited by
this fact.
3.10.5.1.1
Scheduling Load and Store Double (LDRD/STRD)
The IXP45X/IXP46X network processors introduce two new double word instructions:
LDRD and STRD. LDRD loads 64 bits of data from an effective address into two
consecutive registers, conversely, STRD stores 64 bits from two consecutive registers
to an effective address. There are two important restrictions on how these instructions
may be used:
• The effective address must be aligned on an 8-byte boundary
• The specified register must be even (r0, r2, etc.).
If this situation occurs, using LDRD/STRD instead of LDM/STM to do the same thing is
more efficient because LDRD/STRD issues in only one/two clock cycle(s), as opposed
to LDM/STM which issues in four clock cycles. Avoid LDRDs targeting R12; this incurs
an extra cycle of issue latency.
The LDRD instruction has a result latency of 3 or 4 cycles depending on the destination
register being accessed (assuming the data being loaded is in the data cache).
In the code example above, the ORR instruction would stall for three cycles because of
the four cycle result latency for the second destination register of an LDRD instruction.
The code shown above can be rearranged to remove the pipeline stalls:
Any memory operation following a LDRD instruction (LDR, LDRD, STR and so on)
would stall for 1 cycle.
add r6, r7, r8
sub r5, r6, r9
; The following ldrd instruction would load values
; into registers r0 and r1
ldrd r0, [r3]
orr r8, r1, #0xf
mul r7, r0, r7
; The following ldrd instruction would load values
; into registers r0 and r1
ldrd r0, [r3]
add r6, r7, r8
sub r5, r6, r9
mul r7, r0, r7
orr r8, r1, #0xf
; The str instruction below would stall for 1 cycle
ldrd r0, [r3]
str r4, [r5]