Execution Timing
PowerPC e500 Core Family Reference Manual, Rev. 1
Freescale Semiconductor
4-29
This example shows the pipeline for two divw instructions interspersed among mulli instructions
(although any non-divide instructions that use the MU could have been used in place of the mulli
instructions). The stages occupied by divw instructions are highlighted in grey. In clock cycle 0,
the first divw is issued to the first stage of the MU. In clock cycle 1, the divw moves out of the
MU main pipeline into an iterative stage in the two-stage bypass path while the first mulli is issued
to MU stage 1.
The divw iterates in the first stage of the bypass path while a series of mulli instructions passes
through the main MU pipeline. At the end of clock 4, the first of the mulli instructions finishes
and leaves the MU pipeline. Although the mulli can finish out of order with respect to the divd, it
cannot complete ahead of it.
In clock cycle 6, a signal is passed to the issue logic to indicate that divw 1 will reenter the main
MU pipeline in 4 cycles. This creates a bubble that passes down the pipeline, making a space for
the divw instruction to reenter the main pipeline in clock cycle 10.
A second divw enters the first MU stage in clock cycle 10. Had divw 2 been issued earlier, it would
have stalled in the reservation station until divw 1 vacated the second stage of the bypass path. In
other words, the MU can hold as many as two divide instructions only if one is in the MU fourth
stage (as is the case in clock cycle 10).
Table 4-6
lists SU and MU execution latencies. As
Table 4-6
shows, most instructions executed in
the SU have a single-cycle execution latency.
4.4.3.2
MU Floating-Point Execution
The MU executes all floating-point arithmetic operations except efststx, efdtstx and evfststx.
Embedded floating-point operations largely comply with the IEEE-754 floating-point standard.
Software exception handling is required to achieve full IEEE 754-compliance because the IEEE
floating-point exception model is not fully implemented in hardware.
Floating-point arithmetic instructions, except for divide, execute with 4-cycle latency and 1-cycle
throughput. Single-precision floating-point multiply, add, and subtract instructions execute in the
four-stage pipeline MU.
If rA or rB is zero, a floating-point divide takes 4 cycles. All other cases take 29 cycles.
Table 4-8
shows floating-point instruction execution timing.
4.4.4
Load/Store Execution
The LSU executes instructions that move data between the GPRs and the memory unit of the core
(made up of the L1 caches and the core interface unit buffers).
Figure 4-10
shows the block
diagram for the LSU.
Summary of Contents for PowerPC e500 Core
Page 1: ...PowerPC e500 Core Family Reference Manual Supports e500v1 e500v2 E500CORERM Rev 1 4 2005...
Page 36: ...PowerPC e500 Core Family Reference Manual Rev 1 xxxvi Freescale Semiconductor...
Page 38: ...PowerPC e500 Core Family Reference Manual Rev 1 Part I 2 Freescale Semiconductor...
Page 332: ...PowerPC e500 Core Family Reference Manual Rev 1 Part II 2 Freescale Semiconductor...
Page 530: ...Opcode Listings PowerPC e500 Core Family Reference Manual Rev 1 D 50 Freescale Semiconductor...
Page 534: ...PowerPC e500 Core Family Reference Manual Rev 1 E 4 Freescale Semiconductor Revision History...