MIPS MIPS32 74K Series Скачать руководство пользователя страница 18

Страница: 18 / 156

1.4 A brief guide to the 74K

™

core implementation

Programming the MIPS32® 74K™ Core Family, Revision 02.14

The IFU’s branch predictor guesses whether conditional branches will be taken or not - it’s not magic, it uses a
BHT (a “Branch History Table”) of what happened to branches in the past, indexed by the low bits of the loca-
tion of the branch instruction. This particular hardware is an example of Combined branch prediction (majority
voting between three different algorithms, one of which is gshare; if you want to know, there’s a good wikipedia
article whose topic name is “Branch Predictor”). The branch predictor is taking a good guess. It can seem sur-
prising that the predictor makes no attempt to discover whether the history stored in a BHT slot is really that of
the current branch, or another one which happened to share the same low address bits; we’re going to be wrong
sometimes. It guesses correctly most of the time.

In this way the IFU can predict the next-instruction address and continue to run ahead.

•

When the IFU guesses wrong, it doesn’t know (the dog just rushes ahead until its owner reaches the fork). The
branch mispredict will be noticed once the branch instruction has been issued and carried through to the AGEN
“EC” stage, and is executed in its full context (“resolved”). On detecting a mispredict, the CPU must discard the
instructions based on the bad guess (which will not have graduated yet, so will not have changed any vital

machine state) and start fetching instructions from the correct target

. The exact penalty paid by a program which

suffers a mispredict depends on how busy the execution unit is, and how early it resolves the branch; the mini-
mum penalty is 12 cycles.

•

Even when we guess right, the branch target calculation in the IFU takes a little while to operate. A rapid
sequence of correctly-predicted branches can empty the queues, causing a program to run slower.

•

Jump-register instruction targets are unpredictable: the IFU has no knowledge of register data and can’t in gen-
eral anticipate it. But jump-register instructions are relatively rare, except for subroutine returns. In the MIPS
ISA you return from subroutines using a jump-register instruction,

jr $31

(register 31 is, by a strong conven-

tion, used to hold the return address). So on every call instruction, the IFU pushes the return address onto a small

stack; and on every

jr $31

it pops the value of the stack and uses that as its guess for the branch target

We have no way of knowing the target of a

instruction which uses a register other than

$31

. When we find

one of those, instruction fetch stops until the correct address is computed up in the AGEN pipeline, 12 or more
clocks later.

1.4.3 Loads and load-to-use delays

Even short-pipeline MIPS CPUs can’t deliver load data to the immediately following instruction without a delay,
even on a cache hit. Simple MIPS pipelines typically deliver the data one clock later: a one clock “load-to-use delay”.
Compilers and programmers try to put some useful and non-dependent operation between the load and its first use.

The 74K core’s long pipeline means that a full D-cache hit takes four clocks to return the data, not two: that would be
a three-clock “load-to-use delay”. A pair of loads dependent on each other (one fetches the other’s base address) must
be issued at least four cycles apart (that’s optimistic, hoping-for-a-hit timing).

But the AGEN and ALU pipelines are “skewed”, with ALU results delivered a cycle later than AGEN results. That
means that when an ALU operation is dependent on a load, it can be issued only three cycles after the load. There’s a
price to pay: a load/store whose base address is computed by a preceding ALU instruction must be issued a clock

In “branch-likely” variants of conditional branch instructions a mispredict means we also did the wrong thing with the
instruction in the branch delay slot. To fix that up, we need to refetch the branch itself, so the penalty is at least one cycle
higher.

The return-stack guess will be wrong for subroutines containing nested calls deeper than the size of the return stack; but sub-
routines high up the call tree are much more rarely executed, so this isn’t so bad.