52
MicroBlaze Processor Reference Guide
UG081 (v14.7)
Chapter 2:
MicroBlaze Architecture
Delay Slots
When executing a taken branch with delay slot, only the fetch pipeline stage in MicroBlaze is
flushed. The instruction in the decode stage (branch delay slot) is allowed to complete. This
technique effectively reduces the branch penalty from two clock cycles to one. Branch instructions
with delay slots have a D appended to the instruction mnemonic. For example, the BNE instruction
does not execute the subsequent instruction (does not have a delay slot), whereas BNED executes
the next instruction before control is transferred to the branch location.
A delay slot must not contain the following instructions: IMM, branch, or break. Interrupts and
external hardware breaks are deferred until after the delay slot branch has been completed.
Instructions that could cause recoverable exceptions (e.g. unaligned word or halfword load and
store) are allowed in the delay slot. If an exception is caused in a delay slot the ESR[DS] bit is set,
and the exception handler is responsible for returning the execution to the branch target (stored in
the special purpose register BTR). If the ESR[DS] bit is set, register R17 is not valid (otherwise it
contains the address following the instruction causing the exception).
Branch Target Cache
To improve branch performance, MicroBlaze provides a Branch Target Cache (BTC) coupled with
a branch prediction scheme. With the BTC enabled, a correctly predicted immediate branch or
return instruction incurs no overhead.
The BTC operates by saving the target address of each immediate branch and return instruction the
first time the instruction is encountered. The next time it is encountered, it is usually found in the
Branch Target Cache, and the Instruction Fetch Program Counter is then simply changed to the
saved target address, in case the branch should be taken. Unconditional branches and return
instructions are always taken, whereas conditional branches use branch prediction, to avoid taking a
branch that should not have been taken and vice versa.
The BTC is cleared when a memory barrier (MBAR 0) or synchronizing branch (BRI 4) is executed.
This also occurs when the memory barrier or synchronizing branch follows immediately after a
branch instruction, even if that branch is taken. To avoid inadvertently clearing the BTC, the
memory barrier or synchronizing branch should not be placed immediately after a branch
instruction.
There are three cases where the branch prediction can cause a mispredict, namely:
•
A conditional branch that should not have been taken, is actually taken,
•
A conditional branch that should actually have been taken, is not taken,
•
The target address of a return instruction is incorrect, which may occur when returning from a
function called from different places in the code.
All of these cases are detected and corrected when the branch or return instruction reaches the
execute stage, and the branch prediction bits or target address are updated in the BTC, to reflect the
actual instruction behavior. This correction incurs a penalty of two clock cycles.
The size of the BTC can be selected with
C_BRANCH_TARGET_CACHE_SIZE
. The default
recommended setting uses one block RAM, and provides either 512 entries (for Virtex-5, Virtex-6,
and 7 Series) or 256 entries (for all other families). When selecting 64 entries or below, distributed
RAM is used to implement the BTC, otherwise block RAM is used.
When the BTC uses block RAM, and
C_FAULT_TOLERANT
is set to 1, block RAMs are protected
by parity. In case of a parity error, the branch is not predicted. To avoid accumulating errors in this
case, the BTC should be cleared periodically by a synchronizing branch.
The Branch Target Cache is available when
C_USE_BRANCH_TARGET_CACHE
is set to 1 and
C_AREA_OPTIMIZED
is set to 0.