Intel NetBurst Скачать руководство пользователя страница 14

Страница: 14 / 17

A Detailed Look Inside the Intel

NetBurst

™

Micro-Architecture of the Intel Pentium

4 Processor

Page 14

selecting IA-32 instructions that can be decoded into less than 4

ops and/or have short latencies

ordering IA-32 instructions to preserve available parallelism by minimizing long dependence chains and
covering long instruction latencies

ordering instructions so that their operands are ready and their corresponding issue ports and execution units
are free when they reach the scheduler.

This subsection describes port restrictions, result latencies, and issue latencies (also referred to as throughput) that
form the basis for that ordering. Scheduling affects the way that instructions are presented to the core of the
processor, but it is the execution core that reacts to an ever-changing machine state, reordering instructions for faster
execution or delaying them because of dependence and resource constraints. Thus the ordering of instructions is
more of a suggestion to the hardware.

The

Intel® Pentium® 4 Processor Optimization Reference Manual

lists the IA-32 instructions with their latency,

their issue throughput, and in relevant cases, the associated execution units. Some execution units are not pipelined,
such that µops cannot be dispatched in consecutive cycles and the throughput is less than one per cycle.

The number of µops associated with each instruction provides a basis for selecting which instructions to generate. In
particular, µops which are executed out of the microcode ROM involve extra overhead. For the Pentium II and
Pentium

III

processors, optimizing the performance of the decoder, which includes paying attention to the 4-1-1

sequence (instruction with four µops followed by two instructions each with one µop) and taking into account the
number of µops for each IA-32 instruction, was very important. On the Pentium 4 processor, the decoder template is
not an issue. Therefore it is no longer necessary to use a detailed list of exact µop count for IA-32 instructions.
Commonly used IA-32 instructions, which consist of four or less µops, are provided in the Intel

Pentium

Processor Optimization Reference Manual to aid instruction selection.

Execution Units and Issue Ports

Each cycle, the core may dispatch µops to one or more of the four issue ports. At the micro-architectural level, store
operations are further divided into two parts: store data and store address operations. The four ports through which

ops are dispatched to various execution units and to perform load and store operations are shown in Figure 4. Some

ports can dispatch two µops per clock because the execution unit for that µop executes at twice the speed, and those
execution units are marked “Double speed.”

Port 0

. In the first half of the cycle,

port 0 can dispatch either one
floating-point move µop (including
floating-point stack move, floating-
point exchange or floating-point
store data), or one arithmetic
logical unit (ALU) µop (including
arithmetic, logic or store data). In
the second half of the cycle, it can
dispatch one similar ALU µop.

Port 1

. In the first half of the cycle,

port 1 can dispatch either one
floating-point execution (all
floating-point operations except
moves, all SIMD operations) µop
or normal-speed integer (multiply,
shift and rotate) µop, or one ALU
(arithmetic, logic or branch) µop.
In the second half of the cycle, it can dispatch one similar ALU µop.

Port 2

. Port 2 supports the dispatch of one load operation per cycle.

Note:

FP_ADD refers to x87 FP, and SIMD FP add and subtract operations
FP_MUL refers to x87 FP, and SIMD FP multiply operations
FP_DIV refers to x87 FP, and SIMD FP divide and square-root operations
MMX_ALU refers to SIMD integer arithmetic and logic operations
MMX_SHFT handles Shift, Rotate, Shuffle, Pack and Unpack operations
MMX_MISC handles SIMD reciprocal and some integer operations

Figure 4 Execution Units and Ports of the Out-of-order Core

Memory

Store

Store
Address

Port 3

Memory

Load

All Loads
LEA
Prefetch

Port 2

ALU 1

Double

speed