1:200
Volume 1, Part 2: Software Pipelining and Loop Support
If the loop trip count is even, two epilog stages are executed and the kernel loop is
exited at the
br.ctop
. If the trip count is odd, the first two epilog stages are executed
and then the
br.cexit
branch is taken. Because the target of the
br.cexit
branch is
the next sequential bundle (L4), a third epilog stage is executed before the kernel loop
is exited at the
br.ctop
. This optimization saves one stage at the end of the loop when
the trip count is even, and is beneficial for short trip count loops.
Although unrolling can be beneficial, there are a few considerations before trying to
unroll and software pipeline. Unrolling reduces the trip count of the loop that is given to
the pipeliner, and thus may make pipelining of the loop undesirable since low trip count
loops sometimes run faster unpipelined. Unrolling also increases the code size, which
may adversely affect instruction cache performance. Unrolling is most beneficial for
small loops because the potential performance degradation due to under utilized
resources is greater and the effect of unrolling on the instruction cache performance is
smaller compared to large loops.
5.5.7
Implementing Reductions
In the following example, a sum of products is accumulated in register f7:
mov
f7 = 0;;
// initialize sum
L1:
ldfs
f4 = [r5],4
ldfs
f9 = [r8],4;;
fma
f7 = f4,f9,f7;;
// accumulate
br.cloop
L1 ;;
The performance is bound by the latency of the
fma
instruction which we assume is 5
cycles for these examples. A pipelined version of this loop must have an II of at least
five because the
fma
latency is five. By making use of register rotation, the loop can
be transformed into the one below.
Note that the loop has not yet been pipelined. The register rotation and special loop
branches are being used to enable an optimization prior to software pipelining.
mov
lc = 199
// LC = loop count - 1
mov
ec = 1
// Not pipelined, so no epilog
mov
f33 = 0
// initialize 5 sums
mov
f34 = 0
mov
f35 = 0
mov
f36 = 0
mov
f37 = 0;;
L1:
ldfs
f4 = [r5],4
ldfs
f9 = [r8],4;;
fma
f32 = f4,f9,f37;; // accumulate
br.ctop L1 ;;
fadd
f10 = f33,f34
// add sums
fadd
f11 = f35,f36;;
fadd
f12 = f10,f11;;
fadd
f7 = f12,f37
Summary of Contents for ITANIUM ARCHITECTURE - SOFTWARE DEVELOPERS VOLUME 3 REV 2.3
Page 1: ......
Page 11: ...x Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Page 13: ...1 2 Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Page 33: ...1 22 Volume 1 Part 1 Introduction to the Intel Itanium Architecture ...
Page 57: ...1 46 Volume 1 Part 1 Execution Environment ...
Page 147: ...1 136 Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Page 149: ...1 138 Volume 1 Part 2 About the Optimization Guide ...
Page 191: ...1 180 Volume 1 Part 2 Predication Control Flow and Instruction Stream ...
Page 230: ......
Page 248: ...236 Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Page 250: ...2 2 Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Page 264: ...2 16 Volume 2 Part 1 Intel Itanium System Environment ...
Page 380: ...2 132 Volume 2 Part 1 Interruptions ...
Page 398: ...2 150 Volume 2 Part 1 Register Stack Engine ...
Page 486: ...2 238 Volume 2 Part 1 IA 32 Interruption Vector Descriptions ...
Page 750: ...2 502 Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Page 754: ...2 506 Volume 2 Part 2 About the System Programmer s Guide ...
Page 796: ...2 548 Volume 2 Part 2 Interruptions and Serialization ...
Page 808: ...2 560 Volume 2 Part 2 Context Management ...
Page 842: ...2 594 Volume 2 Part 2 Floating point System Software ...
Page 850: ...2 602 Volume 2 Part 2 IA 32 Application Support ...
Page 862: ...2 614 Volume 2 Part 2 External Interrupt Architecture ...
Page 870: ...2 622 Volume 2 Part 2 Performance Monitoring Support ...
Page 891: ......
Page 1099: ...3 200 Volume 3 Instruction Reference padd Interruptions Illegal Operation fault ...
Page 1295: ...3 396 Volume 3 Resource and Dependency Semantics ...
Page 1296: ......
Page 1302: ...402 Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Page 1494: ...4 192 Volume 4 Base IA 32 Instruction Reference FWAIT Wait See entry for WAIT ...
Page 1647: ...Volume 4 Base IA 32 Instruction Reference 4 345 ROL ROR Rotate See entry for RCL RCR ROL ROR ...
Page 1884: ...4 582 Volume 4 IA 32 SSE Instruction Reference ...
Page 1885: ...Index Intel Itanium Architecture Software Developer s Manual Rev 2 3 Index ...
Page 1886: ...Index Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Page 1898: ...INDEX Index 12 Index for Volumes 1 2 3 and 4 ...