background image

    

A Detailed Look Inside the Intel

®

 NetBurst

 Micro-Architecture of the Intel Pentium

®

 4 Processor

Page 14

§

 

selecting IA-32 instructions that can be decoded into less than 4 

µ

ops and/or have short latencies

§

 

ordering IA-32 instructions to preserve available parallelism by minimizing long dependence chains and
covering long instruction latencies

§

 

ordering instructions so that their operands are ready and their corresponding issue ports and execution units
are free when they reach the scheduler.

This subsection describes port restrictions, result latencies, and issue latencies (also referred to as throughput) that
form the basis for that ordering. Scheduling affects the way that instructions are presented to the core of the
processor, but it is the execution core that reacts to an ever-changing machine state, reordering instructions for faster
execution or delaying them because of dependence and resource constraints. Thus the ordering of instructions is
more of a suggestion to the hardware.

The 

Intel® Pentium® 4 Processor Optimization Reference Manual

 lists the IA-32 instructions with their latency,

their issue throughput, and in relevant cases, the associated execution units. Some execution units are not pipelined,
such that µops cannot be dispatched in consecutive cycles and the throughput is less than one per cycle.

The number of µops associated with each instruction provides a basis for selecting which instructions to generate. In
particular, µops which are executed out of the microcode ROM involve extra overhead. For the Pentium II and
Pentium 

III

 processors, optimizing the performance of the decoder, which includes paying attention to the 4-1-1

sequence (instruction with four µops followed by two instructions each with one µop) and taking into account the
number of µops for each IA-32 instruction, was very important. On the Pentium 4 processor, the decoder template is
not an issue. Therefore it is no longer necessary to use a detailed list of exact µop count for IA-32 instructions.
Commonly used IA-32 instructions, which consist of four or less µops, are provided in the Intel

®

 Pentium

®

  4

Processor Optimization Reference Manual to aid instruction selection.

Execution Units and Issue Ports

Each cycle, the core may dispatch µops to one or more of the four issue ports. At the micro-architectural level, store
operations are further divided into two parts: store data and store address operations. The four ports through which

µ

ops are dispatched to various execution units and to perform load and store operations are shown in Figure 4. Some

ports can dispatch two µops per clock because the execution unit for that µop executes at twice the speed, and those
execution units are marked “Double speed.”

Port 0

. In the first half of the cycle,

port 0 can dispatch either one
floating-point move µop (including
floating-point stack move, floating-
point exchange or floating-point
store data), or one arithmetic
logical unit (ALU) µop (including
arithmetic, logic or store data). In
the second half of the cycle, it can
dispatch one similar ALU µop.

Port 1

. In the first half of the cycle,

port 1 can dispatch either one
floating-point execution (all
floating-point operations except
moves, all SIMD operations) µop
or normal-speed integer (multiply,
shift and rotate) µop, or one ALU
(arithmetic, logic or branch) µop.
In the second half of the cycle, it can dispatch one similar ALU µop.

Port 2

. Port 2 supports the dispatch of one load operation per cycle.

Note:

FP_ADD refers to x87 FP, and SIMD FP add and subtract operations
FP_MUL refers to x87 FP, and SIMD FP multiply operations
FP_DIV refers to x87 FP, and SIMD FP divide and square-root operations
MMX_ALU refers to SIMD integer arithmetic and logic operations
MMX_SHFT handles Shift, Rotate, Shuffle, Pack and Unpack operations
MMX_MISC handles SIMD reciprocal and some integer operations

Figure 4 Execution Units and Ports of the Out-of-order Core

Memory

Store

Store
Address

Port 3

Memory

Load

All Loads
LEA
Prefetch

Port 2

ALU 1

Double

speed

ADD/SUB

Integer

Operation

  Normal

speed

Shift/Rotate

Port 1

FP

Execute

   FP_ADD
   FP_MUL
   FP_DIV
   FP_MISC
 MMX_SHFT
 MMX_ALU
 MMX_MISC

Port 0

FP Move

FP Move
FP Store Data
FXCH

ALU 0

Double

speed

ADD/SUB
Logic
Store Data
Branches

Содержание NetBurst

Страница 1: ...A Detailed Look Inside the Intel NetBurst Micro Architecture of the Intel Pentium 4 Processor November 2000 ...

Страница 2: ...applications Intel may make changes to specifications and product descriptions at any time without notice Designers must not rely on the absence or characteristics of any features or instructions marked reserved or undefined Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them The Intel Pentium...

Страница 3: ...A Detailed Look Inside the Intel NetBurst Micro Architecture of the Intel Pentium 4 Processor Page 3 Revision History Revision Date Revision Major Changes 11 2000 1 0 Release ...

Страница 4: ... 9 The Design Considerations of the Intel NetBurst Micro architecture 9 Overview of the Intel NetBurst Micro architecture Pipeline 10 The Front End 10 The Out of order Core 11 Retirement 11 Front End Pipeline Detail 11 Prefetching 12 Decoder 12 Execution Trace Cache 12 Branch Prediction 12 Branch Hints 13 Execution Core Detail 13 Instruction Latency and Throughput 13 Execution Units and Issue Port...

Страница 5: ...s the foundation for the Intel Pentium 4 processor It includes several important new features and innovations that will allow the Intel Pentium 4 processor and future IA 32 processors to deliver industry leading performance for the next several years This paper provides an in depth examination of the features and functions the Intel NetBurst micro architecture ...

Страница 6: ... typical SIMD computation Here two sets of four packed data elements X1 X2 X3 and X4 and Y1 Y2 Y3 and Y4 are operated on in parallel with the same operation being performed on each corresponding pair of data elements X1 and Y1 X2 and Y2 X3 and Y3 and X4 and Y4 The results of the four parallel computations are a set of four packed data elements SIMD computations like those shown in Figure 1 were in...

Страница 7: ...re continues to run correctly without modification on IA 32 microprocessors that incorporate these technologies Existing software also runs correctly in the presence of new applications that incorporate these SIMD technologies The SSE and SSE2 instruction sets also introduced a set of cacheability and memory ordering instructions that can improve cache usage and application performance For more in...

Страница 8: ...two packed double precision floating point operands Adds 128 bit data types for SIMD integer operation on 16 byte 8 word 4 doubleword or 2 quadword integers Adds support for SIMD arithmetic on 64 bit integer operands Adds instructions for converting between new and existing data types Extends support for data shuffling Extends support for cacheability and memory ordering operations The SSE2 instru...

Страница 9: ...h the legacy IA 32 code and applications based on single instruction multiple data SIMD technology at high processing rates b to operate at high clock rates and to scale to higher performance and clock rates in the future To accomplish these design goals the Intel NetBurst micro architecture has many advanced features and improvements over the Pentium Pro processor micro architecture The major des...

Страница 10: ...n the pipeline The Front End The front end of the Intel NetBurst micro architecture consists of two parts fetch decode unit execution trace cache The front end performs several basic functions prefetches IA 32 instructions that are likely to be executed fetches instructions that have not already been prefetched decodes instructions into µops generates microcode for complex instructions and special...

Страница 11: ...in Figure 4 Note that six µops per cycle exceeds the trace cache and retirement µop bandwidth The higher bandwidth in the core allows for peak bursts of greater than 3 µops and to achieve higher issue rates by allowing greater flexibility in issuing µops to different execution ports Most execution units can start executing a new µop every cycle so that several instructions can be in flight at a ti...

Страница 12: ...oes not hold all of the µops that need to be executed in the execution core In some situations the execution core may need to execute a microcode flow instead of the µop traces that are stored in the trace cache The Pentium 4 processor is optimized so that most frequently executed IA 32 instructions come from the trace cache efficiently and continuously while only a few instructions involve the mi...

Страница 13: ...ir performance These hints take the form of prefixes to conditional branch instructions These prefixes have no effect for pre Pentium 4 processor implementations Branch hints are not guaranteed to have any effect and their function may vary across implementations However since branch hints are architecturally visible and the same code could be run on multiple implementations they should be inserte...

Страница 14: ...mmonly used IA 32 instructions which consist of four or less µops are provided in the Intel Pentium 4 Processor Optimization Reference Manual to aid instruction selection Execution Units and Issue Ports Each cycle the core may dispatch µops to one or more of the four issue ports At the micro architectural level store operations are further divided into two parts store data and store address operat...

Страница 15: ...he order of 12 processor cycles to get to the bus and back within the processor and 6 12 bus cycles to access memory if there is no bus congestion Each bus cycle equals several processor cycles The ratio of processor clock speed to the scalable bus clock speed is referred to as bus ratio For example one bus cycle for a 100 MHz bus is equal to 15 processor cycles on a 1 50 GHz processor Since the s...

Страница 16: ...y before initiating the fetches Must be added to new code does not benefit existing applications In comparison hardware prefetching for Pentium 4 processor has the following characteristics Works with existing applications Requires regular access patterns Has a start up penalty before the hardware prefetcher triggers and begins initiating fetches This has a larger effect for short arrays when hard...

Страница 17: ... having to wait until a write to memory and or cache is complete Writes are generally not on the critical path for dependence chains so it is often beneficial to delay writes for more efficient use of memory access bus cycles Store Forwarding Loads can be moved before stores that occurred earlier in the program if they are not predicted to load from the same linear address If they do read from the...

Отзывы: