IBM PPC440X5 CPU Core Скачать руководство пользователя страница 570

Страница: 570 / 590

User’s Manual

PPC440x5 CPU Core

Preliminary

Page 570 of 589

optimize.fm.

September 12, 2002

If the CR-update is MAC or a 16

32 multiply, 1 to 3 instructions should be scheduled between the CR-

update and the branch (0 or 1 instruction, depending on whether the CR-update pairs with the instruction
before or after, or 1 to 2 instructions to issue between the issue of the CR-update and the issue of the
branch, depending on whether there is a single-issue or dual-issue opportunity for the instruction(s)
which are scheduled between the CR-update and the branch).

Similarly, if the CR-update is 32

32 multiply, divide,

tlbsx., or stwcx., schedule 3 to 5 instructions

between the CR-update and the branch (two issue cycles of 2 to 4 instructions between, plus the 0 to 1
issuing with the CR-update).

Finally, if the CR-update is

mtcrf, schedule 5 to 7 instructions between (3 cycles of issue between them).

5. Avoid the use of string/multiple instructions (with some exceptions).

The exceptions have to do with cache effects (more cache misses due to more instructions if you use
separate loads/stores instead of a string/multiple), and the specialized behavior of a string, where the
bytes are inserted into the more-significant portion of the GPR, in preparation for a “string compare” oper-
ation to determine which string is “greater” than another. If the string/multiple is for a relatively small num-
ber of registers (or the expansion into discrete loads/stores is known to not have an overall detrimental
cache impact), and if a string is being used only for a copy operation and the size is known, performance
can be improved by using discrete loads/stores. Essentially, due to hazard determination within the pro-
cessor, string/multiples impose a couple of cycles of extra, “false” penalty on both the front-end and the
back-end. On the other hand, if this penalty is amortized over a large number of registers (say 16 or so),
the impact of the extra stalls is probably negligible.

6. Insert 10 or so instructions within a

bdnz loop (loop unrolling).

7. Put 4 to 8 instructions between

mtlr/mtctr and blr/bctr

8. Put 1 to 3 instructions between 16

32 multiply and the use of the result.

9. Put 2 to 5 instructions between 32

32 multiply and the use of the result.

10. Use the “without allocate” attribute appropriately on block copy operations, such as calls to the library

memcpy function, or implicit structure copies.

11. Block move operations. If moving a block of memory using a series of load/store operations, perform the

load/store operations in the following order: L1-L2-L3-S1-S2-S3, and repeat. Having the second and third
loads between the first load and the first store fills the two-cycle load-use penalty.

IBM PPC440X5 CPU Core, Руководство пользователя

Результаты поиска

Содержание PPC440X5 CPU Core

Отзывы:

Похожие инструкции для PPC440X5 CPU Core

Бренды по названию

Популярные бренды