MIPS MIPS32 74K Series Скачать руководство пользователя страница 60

Страница: 60 / 156

4.7 Branch misprediction delays

Programming the MIPS32® 74K™ Core Family, Revision 02.14

4.7 Branch misprediction delays

In a long-pipeline design like this, branches would be expensive if you waited until the branch was executed before
fetching any more instructions. See

Section 1.4 “A brief guide to the 74K‘ core implementation”

for what is done

about this: but the upshot is that where the fetch logic can’t compute the target address, or guesses wrong, that’s going
to cost 12 or more lost cycles (since when we’re not blocked on a cache miss we hope to average substantially more
than one instruction per clock, that’s worse than it sounds). It does depend what sort of branch: the conditional branch
which closes a tight loop will almost always be predicted correctly after the first time around.

However, too many branches in too short a period of time can overwhelm the ability of the instruction fetch logic to
keep ahead with its predictions, even if the predictions are almost always right. Three empty cycles occur between the
delivery of the branch delay slot instruction and the first instruction(s) from the branch target location. Where branchy
code can be replaced by conditional moves or tight loops “unrolled” a little to get at least 6-8 instructions between
branches, you’ll get significant benefits.

The branch-likely instructions deprecated by the MIPS32 architecture document are predicted just like any other
branch. Misprediction of branch-likelies costs an extra cycle or two, because the branch and the delay slot instruction
needs to be re-executed after a mispredict. Branch-likely instructions sometimes improve the performance of small
loops on 74K family cores, but they set problems for the designers of complex CPUs, and may one day disappear
from the standard. Good compilers for the MIPS32 architecture should provide an option to avoid these instructions.

4.8 Load delayed by (unrelated) recent store

Load instructions are handled within the execution unit (the AGEN pipeline) with “standard” timing, just so long as
they hit in the cache. When a load misses (or, handled the same way, turns out to be uncached) then a dependent oper-
ation which has already been issued will have to be replayed if the dependent instruction has been dispatched. That
generates long delays, but you already know about that. If the dependent instruction has not been dispatched at all
then it will wait in the DDQ until the load data becomes available.

However, store instructions are graduated before they are completed — which sounds problematic, but in fact you
can’t afford to let instructions write the cache (or commit a write to real memory) until they graduate and cease to be
speculative.

This presents a problem. A programmer may write code which stores a value in memory, then immediately loads the
same value. The CPU pipeline detects circumstances where instructions are dependent for register values, but cannot
go doing the same for addresses. The load can get the right data from an incomplete store as a side-effect of checking
whether the data we want might be in the FSB (the “fill/store buffer”) attached to the D-cache: see

Section

3.3.1 “Read/write ordering and cache/memory data queues in the 74K‘ core”

for more information. The store data

can also be in intermediate stages/queues before being written into the FSB. Any data that matches stores in such
intermediate queues will also be bypassed back to the pipeline as if the load hit in the cache.

4.9 Minimum load-miss penalty

74K family cores will typically run at high frequencies, so any load which misses in the L1 D-cache is likely to be
substantially delayed, waiting for the memory data to come back. However, if you ever use the core with a very fast
memory, it’s worth observing that even a fast-serviced miss is still a serious event. If an instruction which consumes
the loaded data issues before we’re sure the load missed (and most of the time the consumer will only be a few places
behind in instruction sequence, and will have issued), then that instruction will have to be re-executed by stopping
execution and starting again on the consuming instruction. That means it has to be re-fetched from the I-cache, and
involves a delay of 15 cycles or so.