Code Optimization
4-3
18524C/0—Nov1996
AMD-K5 Processor Technical Reference Manual
■
Loops—Unroll loops to get more parallelism and reduce
loop overhead even with branch prediction. Inline small
routines to avoid procedure-call overhead. In both cases,
however, consider the cost of possible increased register
usage, which might add load/store instructions for register
spilling.
■
Indexed Addressing—There is no penalty for base + index
addressing in the AMD-K5 processor. However, future
implementations may have such a penalty to achieve a
higher overall clock rate.
4.1.2
Techniques Specific to the AMD-K5 Processor
■
Jumps and Loops—JCXZ requires 1 cycle (correctly pre-
dicted) and therefore is faster than a TEST/JZ, in contrast
to the Pentium processor in which JCXZ requires 5 or 6
cycles. All forms of LOOP take 2 cycles (correctly pre-
dicted), which is also faster than the Pentium processor's 7
or 8 cycles.
■
Multiplies—Independent IMULs can be pipelined at one
per cycle with 4-cycle latency, in contrast to the Pentium
processor's serialized 9-cycle time. (MUL has the same
latency, although the implicit AX usage of MUL prevents
independent, parallel MUL operations.)
■
Dispatch Conflicts—Load-balancing (that is, selecting
instructions for parallel decode) is still important, but to a
lesser extent than on the Pentium processor. In particular,
arrange instructions to avoid execution-unit dispatching
conflicts. (See Section 4.2 on page 4-5.)
■
Instruction Prefixes—There is no penalty for instruction pre-
fixes, including combinations such as segment-size and
operand-size prefixes. This is particularly important for 16-
bit code. However, future implementations may have penal-
ties for the use of these prefixes.
■
Byte Operations—For byte operations, the high and low
bytes of AX, BX, CX, and DX are effectively independent
registers that can be operated on in parallel. For example,
reading AL does not have a dependency on an outstanding
write to AH.
■
Move and Convert—MOVZX, MOVSX, CBW, CWDE, CWD,
CDQ all take 1 cycle (2 cycles for memory-based input), in
contrast to the Pentium processor's 2 or 3 cycles.
Summary of Contents for AMD-K5
Page 1: ...AMD K5 Processor Technical Reference Manual TM...
Page 10: ...x AMD K5 Processor Technical Reference Manual 18524C 0 Nov1996...
Page 24: ...1 4 Overview AMD K5 Processor Technical Reference Manual 18524C 0 Nov1996...
Page 54: ...2 30 Internal Architecture AMD K5 Processor Technical Reference Manual 18524C 0 Nov1996...
Page 116: ...4 26 Performance AMD K5 Processor Technical Reference Manual 18524C 0 Nov1996...
Page 356: ...6 44 System Design AMD K5 Processor Technical Reference Manual 18524C 0 Nov1996...
Page 380: ...7 24 Test and Debug AMD K5 Processor Technical Reference Manual 18524C 0 Nov1996...
Page 396: ...A 16 AMD K5 Processor Technical Reference Manual 18524C 0 Nov1996...
Page 406: ...I 10 Index AMD K5 Processor Technical Reference Manual 18524C 0 Nov1996...