Volume 1, Part 2: Software Pipelining and Loop Support
1:201
This loop maintains five independent sums in registers
f33
-
f37
. The
fma
instruction in
iteration X produces a result that is used by the
fma
instruction in iteration X+5.
Iterations X through X+4 are independent, allowing an II of one to be achieved. The
code for a pipelined version of the loop assuming two memory ports and a nine cycle
latency for a floating-point load is shown below:
mov
lc = 199
// LC = loop count - 1
mov
ec = 10
// EC = epilog 1
mov
pr.rot=1<<16
// PR16 = 1, rest = 0
mov
f33 = 0
// initialize sums
mov
f34 = 0
mov
f35 = 0
mov
f36 = 0
mov
f37 = 0
L1:
(p16)
ldfs
f50 = [r5],4
// Cycle 0
(p16)
ldfs
f60 = [r8],4
// Cycle 0
(p25)
fma
f41 = f59,f69,f46 // Cycle 0
br.ctop.sptk
L1;;
// Cycle 0
fadd
f10 = f42,f43
// add sums
fadd
f11 = f44,f45 ;;
fadd
f12 = f10,f11 ;;
fadd
f7 = f12,f46
5.5.8
Explicit Prolog and Epilog
In some cases, an explicit prolog is necessary for code correctness. This can occur in
cases where a speculative instruction generates a value that is live across source
iterations. Consider the following loop:
ld4
r3 = [r5] ;;
L1:
ld4
r6 = [r8],4
// Cycle 0
ld4
r5 = [r9],4 ;;
// Cycle 0
add
r7 = r3,r6 ;;
// Cycle 2
ld4
r3 = [r5]
// Cycle 3
and
r10 = 3,r7;;
// Cycle 3
cmp.ne p1,p0=r10,r11
// Cycle 4
(p1)
br.cond L1 ;;
// Cycle 4
The following is a possible pipeline for the loop:
stage 1:
ld4.s
r6 = [r8],4
// II = 2
ld4.s
r5 = [r9],4 ;;
---
// empty cycle
stage 2:
---
// empty cycle
ld4.s
r36 = [r5]
add
r7 = r37,r6 ;;
stage 3:
(p18)
and r10 = 3,r7 ;;
(p18)
cmp.ne p1,p0 = r10,r11
(p1)
br.wtop L1 ;;
Summary of Contents for ITANIUM ARCHITECTURE - SOFTWARE DEVELOPERS VOLUME 3 REV 2.3
Page 1: ......
Page 11: ...x Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Page 13: ...1 2 Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Page 33: ...1 22 Volume 1 Part 1 Introduction to the Intel Itanium Architecture ...
Page 57: ...1 46 Volume 1 Part 1 Execution Environment ...
Page 147: ...1 136 Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Page 149: ...1 138 Volume 1 Part 2 About the Optimization Guide ...
Page 191: ...1 180 Volume 1 Part 2 Predication Control Flow and Instruction Stream ...
Page 230: ......
Page 248: ...236 Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Page 250: ...2 2 Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Page 264: ...2 16 Volume 2 Part 1 Intel Itanium System Environment ...
Page 380: ...2 132 Volume 2 Part 1 Interruptions ...
Page 398: ...2 150 Volume 2 Part 1 Register Stack Engine ...
Page 486: ...2 238 Volume 2 Part 1 IA 32 Interruption Vector Descriptions ...
Page 750: ...2 502 Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Page 754: ...2 506 Volume 2 Part 2 About the System Programmer s Guide ...
Page 796: ...2 548 Volume 2 Part 2 Interruptions and Serialization ...
Page 808: ...2 560 Volume 2 Part 2 Context Management ...
Page 842: ...2 594 Volume 2 Part 2 Floating point System Software ...
Page 850: ...2 602 Volume 2 Part 2 IA 32 Application Support ...
Page 862: ...2 614 Volume 2 Part 2 External Interrupt Architecture ...
Page 870: ...2 622 Volume 2 Part 2 Performance Monitoring Support ...
Page 891: ......
Page 1099: ...3 200 Volume 3 Instruction Reference padd Interruptions Illegal Operation fault ...
Page 1295: ...3 396 Volume 3 Resource and Dependency Semantics ...
Page 1296: ......
Page 1302: ...402 Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Page 1494: ...4 192 Volume 4 Base IA 32 Instruction Reference FWAIT Wait See entry for WAIT ...
Page 1647: ...Volume 4 Base IA 32 Instruction Reference 4 345 ROL ROR Rotate See entry for RCL RCR ROL ROR ...
Page 1884: ...4 582 Volume 4 IA 32 SSE Instruction Reference ...
Page 1885: ...Index Intel Itanium Architecture Software Developer s Manual Rev 2 3 Index ...
Page 1886: ...Index Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Page 1898: ...INDEX Index 12 Index for Volumes 1 2 3 and 4 ...