Unrolling Loops
69
22007E/0—November 1999
AMD Athlon™ Processor x86 Code Optimization
Without Loop Unrolling:
MOV ECX, MAX_LENGTH
MOV EAX, OFFSET A
MOV EBX, OFFSET B
$add_loop:
FLD
QWORD PTR [EAX]
FADD
QWORD PTR [EBX]
FSTP
QWORD PTR [EAX]
ADD
EAX, 8
ADD
EBX, 8
DEC
ECX
JNZ
$add_loop
The loop consists of seven instructions. The AMD Athlon
processor can decode/retire three instructions per cycle, so it
cannot execute faster than three iterations in seven cycles, or
3/7 floating-point adds per cycle. However, the pipelined
floating-point adder allows one add every cycle. In the following
code, the loop is partially unrolled by a factor of two, which
creates potential endcases that must be handled outside the
loop:
With Partial Loop Unrolling:
MOV
ECX, MAX_LENGTH
MOV
EAX, offset A
MOV
EBX, offset B
SHR
ECX, 1
JNC
$add_loop
FLD
QWORD PTR [EAX]
FADD
QWORD PTR [EBX]
FSTP
QWORD PTR [EAX]
ADD
EAX, 8
ADD
EBX, 8
$add_loop:
FLD
QWORD PTR[EAX]
FADD
QWORD PTR[EBX]
FSTP
QWORD PTR[EAX]
FLD
QWORD PTR[EAX+8]
FADD
QWORD PTR[EBX+8]
FSTP
QWORD PTR[EAX+8]
ADD
EAX, 16
ADD
EBX, 16
DEC
ECX
JNZ
$add_loop
Now the loop consists of 10 instructions. Based on the
decode/retire bandwidth of three OPs per cycle, this loop goes
Summary of Contents for Athlon Processor x86
Page 1: ...AMD Athlon Processor x86 Code Optimization Guide TM...
Page 12: ...xii List of Figures AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999...
Page 16: ...xvi Revision History AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999...
Page 202: ...186 Page Attribute Table PAT AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999...
Page 252: ...236 VectorPath Instructions AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999...
Page 256: ...240 Index AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999...