User’s Manual
PPC440x5 CPU Core
Preliminary
Page 570 of 589
optimize.fm.
September 12, 2002
If the CR-update is MAC or a 16
×
32 multiply, 1 to 3 instructions should be scheduled between the CR-
update and the branch (0 or 1 instruction, depending on whether the CR-update pairs with the instruction
before or after, or 1 to 2 instructions to issue between the issue of the CR-update and the issue of the
branch, depending on whether there is a single-issue or dual-issue opportunity for the instruction(s)
which are scheduled between the CR-update and the branch).
Similarly, if the CR-update is 32
×
32 multiply, divide,
tlbsx., or stwcx., schedule 3 to 5 instructions
between the CR-update and the branch (two issue cycles of 2 to 4 instructions between, plus the 0 to 1
issuing with the CR-update).
Finally, if the CR-update is
mtcrf, schedule 5 to 7 instructions between (3 cycles of issue between them).
5. Avoid the use of string/multiple instructions (with some exceptions).
The exceptions have to do with cache effects (more cache misses due to more instructions if you use
separate loads/stores instead of a string/multiple), and the specialized behavior of a string, where the
bytes are inserted into the more-significant portion of the GPR, in preparation for a “string compare” oper-
ation to determine which string is “greater” than another. If the string/multiple is for a relatively small num-
ber of registers (or the expansion into discrete loads/stores is known to not have an overall detrimental
cache impact), and if a string is being used only for a copy operation and the size is known, performance
can be improved by using discrete loads/stores. Essentially, due to hazard determination within the pro-
cessor, string/multiples impose a couple of cycles of extra, “false” penalty on both the front-end and the
back-end. On the other hand, if this penalty is amortized over a large number of registers (say 16 or so),
the impact of the extra stalls is probably negligible.
6. Insert 10 or so instructions within a
bdnz loop (loop unrolling).
7. Put 4 to 8 instructions between
mtlr/mtctr and blr/bctr
8. Put 1 to 3 instructions between 16
×
32 multiply and the use of the result.
9. Put 2 to 5 instructions between 32
×
32 multiply and the use of the result.
10. Use the “without allocate” attribute appropriately on block copy operations, such as calls to the library
memcpy function, or implicit structure copies.
11. Block move operations. If moving a block of memory using a series of load/store operations, perform the
load/store operations in the following order: L1-L2-L3-S1-S2-S3, and repeat. Having the second and third
loads between the first load and the first store fills the two-cycle load-use penalty.
Summary of Contents for PPC440X5 CPU Core
Page 1: ...PPC440x5 CPU Core User s Manual Preliminary SA14 2613 02 September 12 2002 Title Page...
Page 22: ...User s Manual PPC440x5 CPU Core Preliminary Page 22 of 583 ppc440x5LOT fm September 12 2002...
Page 26: ...User s Manual PPC440x5 CPU Core Preliminary Page 26 of 589 preface fm September 12 2002...
Page 38: ...User s Manual PPC440x5 CPU Core Preliminary Page 38 of 589 overview fm September 12 2002...
Page 94: ...User s Manual PPC440x5 CPU Core Preliminary Page 94 of 589 init fm September 12 2002...
Page 132: ...User s Manual PPC440x5 CPU Core Preliminary Page 132 of 589 cache fm September 12 2002...
Page 158: ...User s Manual PPC440x5 CPU Core Preliminary Page 158 of 589 mmu fm September 12 2002...
Page 218: ...User s Manual PPC440x5 CPU Core Preliminary Page 218 of 589 timers fm September 12 2002...
Page 248: ...User s Manual PPC440x5 CPU Core Preliminary Page 248 of 589 debug fm September 12 2002...
Page 458: ...User s Manual PPC440x5 CPU Core Preliminary Page 458 of 589 regsummIntro fm September 12 2002...
Page 568: ...User s Manual PPC440x5 CPU Core Preliminary Page 568 of 589 instalfa fm September 12 2002...
Page 588: ...User s Manual PPC440x5 CPU Core Preliminary Page 588 of 583 ppc440x5IX fm September 12 2002...
Page 590: ......