MIPS MIPS32 74K Series Programming Manual Download Page 17

Page: 17 / 156

Introduction

Programming the MIPS32® 74K™ Core Family, Revision 02.14

cycles between the point where the exception is processed in the graduation unit and the time when the first
instruction of the exception handler graduates.

•

Loads and Stores: the L1 cache lookup happens inside the out-of-order execution pipeline. But only loads which
hit in the L1 cache are complete when they graduate. Other loads and stores graduate and then start actions in the
memory pipeline. It’s probably fairly obvious how a store can be “stored” — so long as the hardware keeps a note
of the address and data of the store, the cache/memory update can be done later. On the 74K core, even a write
into the L1 cache is deferred until after graduation. While the write is pending, the cache hardware has to keep a
note in case some later instruction wants to load the same value before we’ve completed the write; but that’s
familiar technology.

It’s less obvious that we can allow load instructions which L1-miss to graduate. But on the 74K core, loads are
non-blocking — a load executes, and results in data being loaded into a GP register at some time in the future.
Any later instruction which reads the register value must wait until the load data has arrived. So load instructions
are allowed to graduate regardless of how far away their data is. Once the instruction graduates its CB entry must
be given back, so data arriving for a graduated load is sent directly to the register file.

There’s another key reason why we did this: with only L1 accesses done out-of-order, loads and stores only
become visible outside the CPU after they graduate, so there’s no worry about other parts of the system seeing
unexpected effects from speculative instructions.

An instruction which depends on a load which misses will (unless it was a long, long way behind in instruction
sequence) have to wait. Most often the consuming instruction will become a candidate for issue before we know
whether the load hit in the L1 cache. In this case the dependent instruction is issued: we’re optimists, hoping for
a hit. If a consuming instruction reaches graduation and finds the load missed, we must do a “redirect”, re-fetch-
ing the consuming instruction and everything later in program order). Next time the consuming instruction is an
issue candidate, we’ll know the load has missed, and the consumer will not get issued until the load data has
arrived. The redirect for the consuming instruction is quite expensive (19 or more cycles), but in most cases that
overhead will be hidden in the time taken to return data for the cache miss.

Stores are less complicated. But since even the cache must not be updated until the store instruction graduates,
the memory pipeline is used for writing the L1 cache too: even store L1-hits result in action in the memory pipe-
line.

1.4.2 Branches and branch delays

The MIPS architecture defines that the instruction following a branch (the “branch delay slot” instruction) is always

executed

. That means that the CPU has one instruction it knows will be executed while it’s figuring out where a

branch is going. But with the 74K core’s long pipeline we don’t finally know whether a conditional branch should be
taken, and won’t have computed the target address for a jump-register, until about 8 stages down the pipeline. It’s bet-
ter to guess (and pay the price when we’re wrong) than to wait to be certain. Several different tricks are used:

•

The decoupled IFU (the electronic dog) runs ahead of the rest of the CPU by fetching four instructions per clock.

•

Branch instructions are identified very early (in fact, they’re marked when instructions are fetched into the I–
cache). MIPS branch and jump instructions (at least those not dependent on register values) are easy to decode,
and the IFU decodes them locally to calculate the target address.

That’s not quite accurate: there are special forms of conditional branches called “branch likely” which are defined to execute
the branch delay slot instruction only when the branch is taken. Note that the “likely” part of the name has nothing to do with
branch prediction; the 74K core’s branch prediction system treats the “likelies” just like any other branches. The dependency
between a branch condition and the branch delay slot instruction is annoying to keep track of in an out-of-order machine, and
MIPS would prefer you not to use branch-likely instructions.