A Detailed Look Inside the Intel
®
NetBurst
™
Micro-Architecture of the Intel Pentium
®
4 Processor
Page 13
The Static Predictor
. Once the branch instruction is decoded, the direction of the branch (forward or backward) is
known. If there was no valid entry in the BTB for the branch, the static predictor makes a prediction based on the
direction of the branch. The static prediction mechanism predicts backward conditional branches (those with
negative displacement), such as loop-closing branches, as taken. Forward branches are predicted not taken.
Branch Target Buffer
. Once branch history is available, the Pentium 4 processor can predict the branch outcome
before the branch instruction is even decoded, based on a history of previously-encountered branches. It uses a
branch history table and a branch target buffer (collectively called the BTB) to predict the direction and target of
branches based on an instruction’s linear address. Once the branch is retired, the BTB is updated with the target
address.
Return Stack.
Returns are always taken, but since a procedure may be invoked from several call sites, a single
predicted target will not suffice. The Pentium 4 processor has a Return Stack that can predict return addresses for a
series of procedure calls. This increases the benefit of unrolling loops containing function calls. It also mitigates the
need to put certain procedures inline since the return penalty portion of the procedure call overhead is reduced.
Even if the direction and target address of the branch are correctly predicted well in advance, a taken branch may
reduce available parallelism in a typical processor, since the decode bandwidth is wasted for instructions which
immediately follow the branch and precede the target, if the branch does not end the line and target does not begin
the line. The branch predictor allows a branch and its target to coexist in a single trace cache line, maximizing
instruction delivery from the front end.
Branch Hints
The Pentium 4 processor provides a feature that permits software to provide hints to the branch prediction and trace
formation hardware to enhance their performance. These hints take the form of prefixes to conditional branch
instructions. These prefixes have no effect for pre-Pentium 4 processor implementations. Branch hints are not
guaranteed to have any effect, and their function may vary across implementations. However, since branch hints are
architecturally visible, and the same code could be run on multiple implementations, they should be inserted only in
cases which are likely to be helpful across all implementations.
Branch hints are interpreted by the translation engine, and are used to assist branch prediction and trace construction
hardware. They are only used at trace build time, and have no effect within already-built traces. Directional hints
override the static (forward-taken, backward-not taken) prediction in the event that a BTB prediction is not
available. Because branch hints increase code size slightly, the preferred approach to providing directional hints is
by the arrangement of code so that
(i) forward branches that are more probable should be in the not-taken path, and
(ii) backward branches that are more probable should be in the taken path. Since the branch prediction information
that is available when the trace is built is used to predict which path or trace through the code will be taken,
directional branch hints can help traces be built along the most likely path.
Execution Core Detail
The execution core is designed to optimize overall performance by handling the most common cases most
efficiently. The hardware is designed to execute the most frequent operations in the most common context as fast as
possible, at the expense of less-frequent operations in rare context. Some parts of the core may speculate that a
common condition holds to allow faster execution. If it does not, the machine may stall. An example of this pertains
to store forwarding. If a load is predicted to be dependent on a store, it gets its data from that store and tentatively
proceeds. If the load turned out not to depend on the store, the load is delayed until the real data has been loaded
from memory, then it proceeds.
Instruction Latency and Throughput
The superscalar, out-of-order core contains multiple execution hardware resources that can execute multiple
µ
ops in
parallel. The core’s ability to make use of available parallelism can be enhanced by: