4-4
Performance
AMD-K5 Processor Technical Reference Manual
18524C/0—Nov1996
■
Bit Scan—BSF and BSR take 1 cycle (2 cycles for memory-
based input), in contrast to the Pentium processor's data-
dependent 6 to 34 cycles.
■
Bit Test—BT, BTS, BTR, and BTC take 1 cycle for register-
based operands, and 2 or 3 cycles for memory-based oper-
ands with immediate bit-offset, in contrast to the Pentium
processor's 4 to 9 cycles. Register-based bit-offset forms on
the AMD-K5 processor take 5 cycles. If the semantics of the
register-based bit-offset form are desired (where the bit off-
set can cover a very large bit string in memory), it is better
to emulate this with simpler instructions that can be inter-
leaved with independent instructions for greater parallel-
ism.
■
Floating-Point Top-of-Stack Bottleneck—The AMD-K5 proces-
sor has a pipelined floating-point unit. Greater parallelism
can be achieved by using FXCH in parallel with floating-
point operations to alleviate the top-of-stack bottleneck, as
in the Pentium processor. The AMD-K5 processor also per-
mits integer operations (ALU, branch, load/store) in paral-
lel with floating-point operations.
■
Locating Branch Targets—Performance can be sensitive to
code alignment, especially in tight loops. Locating branch
targets to the first 17 bytes of the 32-byte cache line maxi-
mizes the opportunity for parallel execution at the target.
NOPs can be added to adjust this alignment. The AMD-K5
processor executes NOPs (opcode 90h) at the rate of two per
cycle. Adding NOPs is even more effective if they execute
in parallel with existing code. Other instructions of greater
length, such as a register-immediate TEST instruction, can
be used as NOPs to minimize the overhead of such padding.
■
Branch Prediction—There are two branch prediction bits in
a 32-byte instruction cache line. One bit applies to the first
16 bytes of the line and the second bit applies to the second
16 bytes of the line. For effective branch prediction, code
should be generated with one branch per 16-byte line half.
The prediction is associated with the half-line containing
the last byte of the branch instruction.
■
Address-Generation Interlocks (AGIs)—The AMD-K5 proces-
sor does not suffer from the single-cycle penalty that the
486 and Pentium processors have when a result from execu-
tion or from a data-cache access is used to form a cache
address, so it is not necessary to avoid these situations.
Summary of Contents for AMD-K5
Page 1: ...AMD K5 Processor Technical Reference Manual TM...
Page 10: ...x AMD K5 Processor Technical Reference Manual 18524C 0 Nov1996...
Page 24: ...1 4 Overview AMD K5 Processor Technical Reference Manual 18524C 0 Nov1996...
Page 54: ...2 30 Internal Architecture AMD K5 Processor Technical Reference Manual 18524C 0 Nov1996...
Page 116: ...4 26 Performance AMD K5 Processor Technical Reference Manual 18524C 0 Nov1996...
Page 356: ...6 44 System Design AMD K5 Processor Technical Reference Manual 18524C 0 Nov1996...
Page 380: ...7 24 Test and Debug AMD K5 Processor Technical Reference Manual 18524C 0 Nov1996...
Page 396: ...A 16 AMD K5 Processor Technical Reference Manual 18524C 0 Nov1996...
Page 406: ...I 10 Index AMD K5 Processor Technical Reference Manual 18524C 0 Nov1996...