GR740-UM-DS, Nov 2017, Version 1.7
51
www.cobham.com/gaisler
GR740
6.2.2
Instruction pipeline
The LEON4 integer unit uses a single instruction issue pipeline with 7 stages:
1.
FE (Instruction Fetch): If the instruction cache is enabled, the instruction is fetched from the instruction cache.
Otherwise, the fetch is forwarded to the memory controller. The instruction is valid at the end of this stage and is
latched inside the IU.
2.
DE (Decode): The instruction is decoded and the CALL/Branch target addresses are generated.
3.
RA (Register access): Operands are read from the register file or from internal data bypasses.
4.
EX (Execute): ALU, logical, and shift operations are performed. For memory operations (e.g., LD) and for JMPL/
RETT, the address is generated.
5.
ME (Memory): Data cache is accessed. Store data read out in the execution stage is written to the data cache at this
time.
6.
XC (Exception) Traps and interrupts are resolved. For cache reads, the data is aligned.
7.
WR (Write): The result of ALU and cache operations are written back to the register file.
Table 35 lists the cycles per instruction (assuming cache hit and no icc or load interlock):
* Multiplication cycle count is 1 clock (1 clock issue rate, 2 clock data latency), for the 32x32 multiplie
Additional conditions that can extend an instructions duration in the pipeline are listed in the table and
text below.
Branch interlock:
When a conditional branch or trap is performed 1-2 cycles after an instruction
which modifies the condition codes, 1-2 cycles of delay is added to allow the condition to be com-
puted. If static branch prediction is enabled, this extra delay is incurred only if the branch is not taken.
Load delay:
When using data shortly after the load instruction, the second instruction will be delayed
to satisfy the pipeline’s load delay.
Mul latency:
For pipelined multiplier implementations there is 1 cycle extra data latency, accessing
the result immediately after a MUL will then add one cycle pipeline delay.
Hold cycles:
During cache miss processing or when blocking on the store buffer, the pipeline will be
held still until the data is ready, effectively extending the execution time of the instruction causing the
miss by the corresponding number of cycles. Note that since the whole pipeline is held still, hold
cycles will not mask load delay or interlock delays. For instance on a load cache miss followed by a
data-dependent instruction, both hold cycles and load delay will be incurred.
FPU:
The floating-point unit or coprocessor may need to hold the pipeline or extend a specific
instruction.
Table 35.
Instruction timing
Instruction
Cycles (MMU disabled)
JMPL, RETT
3
SMUL/UMUL
1*
SDIV/UDIV
35
Taken Trap
5
Atomic load/store
5
All other instructions
1