background image

    

A Detailed Look Inside the Intel

®

 NetBurst

 Micro-Architecture of the Intel Pentium

®

 4 Processor

Page 13

The Static Predictor

. Once the branch instruction is decoded, the direction of the branch (forward or backward) is

known. If there was no valid entry in the BTB for the branch, the static predictor makes a prediction based on the
direction of the branch. The static prediction mechanism predicts backward conditional branches (those with
negative displacement), such as loop-closing branches, as taken. Forward branches are predicted not taken.

Branch Target Buffer

. Once branch history is available, the Pentium 4 processor can predict the branch outcome

before the branch instruction is even decoded, based on a history of previously-encountered branches. It uses a
branch history table and a branch target buffer (collectively called the BTB) to predict the direction and target of
branches based on an instruction’s linear address. Once the branch is retired, the BTB is updated with the target
address.

Return Stack.

 Returns are always taken, but since a procedure may be invoked from several call sites, a single

predicted target will not suffice. The Pentium 4 processor has a Return Stack that can predict return addresses for a
series of procedure calls. This increases the benefit of unrolling loops containing function calls. It also mitigates the
need to put certain procedures inline since the return penalty portion of the procedure call overhead is reduced.

Even if the direction and target address of the branch are correctly predicted well in advance, a taken branch may
reduce available parallelism in a typical processor, since the decode bandwidth is wasted for instructions which
immediately follow the branch and precede the target, if the branch does not end the line and target does not begin
the line. The branch predictor allows a branch and its target to coexist in a single trace cache line, maximizing
instruction delivery from the front end.

Branch Hints

The Pentium 4 processor provides a feature that permits software to provide hints to the branch prediction and trace
formation hardware to enhance their performance. These hints take the form of prefixes to conditional branch
instructions. These prefixes have no effect for pre-Pentium 4 processor implementations. Branch hints are not
guaranteed to have any effect, and their function may vary across implementations. However, since branch hints are
architecturally visible, and the same code could be run on multiple implementations, they should be inserted only in
cases which are likely to be helpful across all implementations.

Branch hints are interpreted by the translation engine, and are used to assist branch prediction and trace construction
hardware. They are only used at trace build time, and have no effect within already-built traces. Directional hints
override the static (forward-taken, backward-not taken) prediction in the event that a BTB prediction is not
available. Because branch hints increase code size slightly, the preferred approach to providing directional hints is
by the arrangement of code so that

(i) forward branches that are more probable should be in the not-taken path, and

(ii) backward branches that are more probable should be in the taken path. Since the branch prediction information
that is available when the trace is built is used to predict which path or trace through the code will be taken,
directional branch hints can help traces be built along the most likely path.

Execution Core Detail

The execution core is designed to optimize overall performance by handling the most common cases most
efficiently. The hardware is designed to execute the most frequent operations in the most common context as fast as
possible, at the expense of less-frequent operations in rare context. Some parts of the core may speculate that a
common condition holds to allow faster execution. If it does not, the machine may stall. An example of this pertains
to store forwarding. If a load is predicted to be dependent on a store, it gets its data from that store and tentatively
proceeds. If the load turned out not to depend on the store, the load is delayed until the real data has been loaded
from memory, then it proceeds.

Instruction Latency and Throughput

The superscalar, out-of-order core contains multiple execution hardware resources that can execute multiple 

µ

ops in

parallel. The core’s ability to make use of available parallelism can be enhanced by:

Summary of Contents for NetBurst

Page 1: ...A Detailed Look Inside the Intel NetBurst Micro Architecture of the Intel Pentium 4 Processor November 2000 ...

Page 2: ...applications Intel may make changes to specifications and product descriptions at any time without notice Designers must not rely on the absence or characteristics of any features or instructions marked reserved or undefined Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them The Intel Pentium...

Page 3: ...A Detailed Look Inside the Intel NetBurst Micro Architecture of the Intel Pentium 4 Processor Page 3 Revision History Revision Date Revision Major Changes 11 2000 1 0 Release ...

Page 4: ... 9 The Design Considerations of the Intel NetBurst Micro architecture 9 Overview of the Intel NetBurst Micro architecture Pipeline 10 The Front End 10 The Out of order Core 11 Retirement 11 Front End Pipeline Detail 11 Prefetching 12 Decoder 12 Execution Trace Cache 12 Branch Prediction 12 Branch Hints 13 Execution Core Detail 13 Instruction Latency and Throughput 13 Execution Units and Issue Port...

Page 5: ...s the foundation for the Intel Pentium 4 processor It includes several important new features and innovations that will allow the Intel Pentium 4 processor and future IA 32 processors to deliver industry leading performance for the next several years This paper provides an in depth examination of the features and functions the Intel NetBurst micro architecture ...

Page 6: ... typical SIMD computation Here two sets of four packed data elements X1 X2 X3 and X4 and Y1 Y2 Y3 and Y4 are operated on in parallel with the same operation being performed on each corresponding pair of data elements X1 and Y1 X2 and Y2 X3 and Y3 and X4 and Y4 The results of the four parallel computations are a set of four packed data elements SIMD computations like those shown in Figure 1 were in...

Page 7: ...re continues to run correctly without modification on IA 32 microprocessors that incorporate these technologies Existing software also runs correctly in the presence of new applications that incorporate these SIMD technologies The SSE and SSE2 instruction sets also introduced a set of cacheability and memory ordering instructions that can improve cache usage and application performance For more in...

Page 8: ...two packed double precision floating point operands Adds 128 bit data types for SIMD integer operation on 16 byte 8 word 4 doubleword or 2 quadword integers Adds support for SIMD arithmetic on 64 bit integer operands Adds instructions for converting between new and existing data types Extends support for data shuffling Extends support for cacheability and memory ordering operations The SSE2 instru...

Page 9: ...h the legacy IA 32 code and applications based on single instruction multiple data SIMD technology at high processing rates b to operate at high clock rates and to scale to higher performance and clock rates in the future To accomplish these design goals the Intel NetBurst micro architecture has many advanced features and improvements over the Pentium Pro processor micro architecture The major des...

Page 10: ...n the pipeline The Front End The front end of the Intel NetBurst micro architecture consists of two parts fetch decode unit execution trace cache The front end performs several basic functions prefetches IA 32 instructions that are likely to be executed fetches instructions that have not already been prefetched decodes instructions into µops generates microcode for complex instructions and special...

Page 11: ...in Figure 4 Note that six µops per cycle exceeds the trace cache and retirement µop bandwidth The higher bandwidth in the core allows for peak bursts of greater than 3 µops and to achieve higher issue rates by allowing greater flexibility in issuing µops to different execution ports Most execution units can start executing a new µop every cycle so that several instructions can be in flight at a ti...

Page 12: ...oes not hold all of the µops that need to be executed in the execution core In some situations the execution core may need to execute a microcode flow instead of the µop traces that are stored in the trace cache The Pentium 4 processor is optimized so that most frequently executed IA 32 instructions come from the trace cache efficiently and continuously while only a few instructions involve the mi...

Page 13: ...ir performance These hints take the form of prefixes to conditional branch instructions These prefixes have no effect for pre Pentium 4 processor implementations Branch hints are not guaranteed to have any effect and their function may vary across implementations However since branch hints are architecturally visible and the same code could be run on multiple implementations they should be inserte...

Page 14: ...mmonly used IA 32 instructions which consist of four or less µops are provided in the Intel Pentium 4 Processor Optimization Reference Manual to aid instruction selection Execution Units and Issue Ports Each cycle the core may dispatch µops to one or more of the four issue ports At the micro architectural level store operations are further divided into two parts store data and store address operat...

Page 15: ...he order of 12 processor cycles to get to the bus and back within the processor and 6 12 bus cycles to access memory if there is no bus congestion Each bus cycle equals several processor cycles The ratio of processor clock speed to the scalable bus clock speed is referred to as bus ratio For example one bus cycle for a 100 MHz bus is equal to 15 processor cycles on a 1 50 GHz processor Since the s...

Page 16: ...y before initiating the fetches Must be added to new code does not benefit existing applications In comparison hardware prefetching for Pentium 4 processor has the following characteristics Works with existing applications Requires regular access patterns Has a start up penalty before the hardware prefetcher triggers and begins initiating fetches This has a larger effect for short arrays when hard...

Page 17: ... having to wait until a write to memory and or cache is complete Writes are generally not on the critical path for dependence chains so it is often beneficial to delay writes for more efficient use of memory access bus cycles Store Forwarding Loads can be moved before stores that occurred earlier in the program if they are not predicted to load from the same linear address If they do read from the...

Reviews: