Program Flow Prediction
ARM DDI 0301H
Copyright © 2004-2009 ARM Limited. All rights reserved.
5-4
ID012310
Non-Confidential, Unrestricted Access
5.2
Branch prediction
In ARM processors that have no PU, the target of a branch is not known until the end of the
Execute stage. At the Execute stage it is known whether or not the branch is taken. The best
performance is obtained by predicting all branches as not taken and filling the pipeline with the
instructions that follow the branch in the current sequential path. In ARM processors without a
PU, an untaken branch requires one cycle and a taken branch requires three or more cycles.
Branch prediction enables the detection of branch instructions before they enter the integer core.
This permits the use of a branch prediction scheme that closely models actual conditional branch
behavior.
The increased pipeline length of the ARM1176JZF-S processor makes the performance penalty
of any changes in program flow, such as branches or other updates to the PC, more significant
than was the case on the ARM9TDMI or ARM1020T processors. Therefore, a significant
amount of hardware is dedicated to prediction of these changes. Two major classes of program
flow are addressed in the ARM1176JZF-S prediction scheme:
1.
Branches, including BL, and BLX immediate, where the target address is a fixed offset
from the program counter. The prediction amounts to an examination of the probability
that a branch passes its condition codes. These branches are handled in the Branch
Predictors.
2.
Loads, Moves, and ALU operations writing to the PC, that can be identified as being likely
to be a return from a procedure call. Two identifiable cases are Loads to the PC from an
address derived from R13, the stack pointer, and Moves or ALU operations to the PC
derived from R14, the Link Register. In these cases, if the calling operation can also be
identified, the likely return address can be stored in a hardware implemented stack, termed
a
Return Stack
(RS). Typical calling operations are BL and BLX instructions. In addition
Moves or ALU operations to the Link Register from the PC are often preludes to a branch
that serves as a calling operation. The Link Register value derived is the value required for
the RS. This was most commonly done on ARMv4T, before the BLX <register>
instruction was introduced in ARMv5T.
Branch prediction is required in the design to reduce the integer core CPI loss that arises from
the longer pipeline. To improve the branch prediction accuracy, a combination of static and
dynamic techniques is employed. It is possible to disable each of the predictors separately.
5.2.1
Enabling program flow prediction
The enabling of program flow prediction is controlled by the CP15 Register c1 Z bit, bit 11, that
is set to 0 on Reset. See
c1, Control Register
on page 3-44. The return stack, dynamic predictor,
and static predictor can also be individually controlled using the Auxiliary Control Register. See
c1, Auxiliary Control Register
on page 3-48.
5.2.2
Dynamic branch predictor
The first line of branch prediction in the processor is dynamic, through a simple BTAC. It is
virtually addressed and holds virtual target addresses. In addition, a two bit value holds the
prediction history of the branch. If the address mappings change, this cache must be flushed. A
dynamic branch predictor flush is included in the CP15 coprocessor control instructions. Also
included are direct dynamic branch predictor flush from main TLB and integer core.
A BTAC works by storing the existence of branches at particular locations in memory. The
branch target address and a prediction of whether or not it might be taken is also stored.