Volume 1, Part 2: Floating-point Applications
1:209
If we suppose the minimum floating-point load latency is 9 clocks, and 2 memory
operations can be issued per clock, the above loop has to be unrolled by at least six if
there is no register rotation.
add
r8 = r7, 8
L1:
(p18)
stf
[r7] = f25, 16
// Cycle 17,26...
(p18)
stf
[r8] = f26, 16
// Cycle 17,26...
(p17) fadd
f25 = f5, f15
// Cycle 8,17,26...
(p16) ldf
f5 = [r5], 8
// Cycle 0,9,18...
(p16) ldf
f15 = [r6], 8
// Cycle 0,9,18...
(p17) fadd
f26 = f6, f16;;
// Cycle 9,18,27 ...
(p16) ldf
f6 = [r5], 8
// Cycle 1,10,19 ...
(p16) ldf
f16 = [r6], 8
// Cycle 1,10,19 ...
(p18) stf
[r7] = f27, 16
// Cycle 20,29 ...
(p18) stf
[r8] = f28, 16
// Cycle 20,29 ...
(p17) fadd
f27 = f7, f17 ;;
// Cycle 11,20 ...
(p16) ldf
f7 = [r5], 8
// Cycle 3,12,21 ...
(p16) ldf
f17 = [r6], 8
// Cycle 3,12,21 ...
(p17) fadd
f28 = f8, f18 ;;
// Cycle 12,21 ...
(p16) ldf
f8 = [r5], 8
// Cycle 4,13,22 ...
(p16) ldf
f18 = [r6], 8
// Cycle 4,13,22 ...
(p18) stf
[r7] = f29, 16
// Cycle 23,32 ...
(p18) stf
[r8] = f30, 16
// Cycle 23,32 ...
(p16) fadd
f29 = f9, f19 ;;
// Cycle 14,23 ...
(p16) ldf
f9 = [r5], 8
// Cycle 6,15,24 ...
(p16) ldf
f19 = [r6], 8
// Cycle 6,15,24 ...
(p16) fadd
f30 = f10, f20 ;;
// Cycle 15,24 ...
(p16) ldf
f10 = [r5], 8
// Cycle 7,16,25 ...
(p16) ldf
f20 = [r6], 8
// Cycle 7,16,25 ...
br.ctop L1 ;;
However, with register rotation, the same loop can be scheduled with an initiation
interval of just 2 clocks without unrolling (and 1.5 clocks if unrolled by 2):
L1:
(p24)
stf
[r7] = f57, 8
// Cycle 15,17...
(p21) fadd
f57 = f37, f47
// Cycle 9,11,13...
(p16) ldf
f32 = [r5], 8
// Cycle 0,2,4,6...
(p16) ldf
f42 = [r6], 8
// Cycle 0,2,4,6...
br.ctop L1;;
It is thus often advantageous to modulo schedule and then unroll (if required). Please
see
Chapter 5, “Software Pipelining and Loop Support”
for details on how to rewrite
loops using this transformation.
6.3.1.1
Notes on FP Precision
The floating-point registers are 82 bits wide with 17 bits for exponent range, 64 bits for
significand precision and 1 sign bit. During computation, the result range and precision
is determined by the computational model chosen by the user. The computational
model is indicated either statically in the instruction encoding, or dynamically via the
precision control (PC) and widest-range-exponent (WRE) bits in the floating-point
status register. Using an appropriate computational model, the user can minimize the
error accumulation in the computation. In the above matrix multiply example, if the
multiply and add computations are performed in full register file range and precision,
the results (in accumulators) can hold 64 bits of precision and up to 17 bits of range for
Summary of Contents for ITANIUM ARCHITECTURE - SOFTWARE DEVELOPERS VOLUME 3 REV 2.3
Page 1: ......
Page 11: ...x Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Page 13: ...1 2 Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Page 33: ...1 22 Volume 1 Part 1 Introduction to the Intel Itanium Architecture ...
Page 57: ...1 46 Volume 1 Part 1 Execution Environment ...
Page 147: ...1 136 Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Page 149: ...1 138 Volume 1 Part 2 About the Optimization Guide ...
Page 191: ...1 180 Volume 1 Part 2 Predication Control Flow and Instruction Stream ...
Page 230: ......
Page 248: ...236 Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Page 250: ...2 2 Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Page 264: ...2 16 Volume 2 Part 1 Intel Itanium System Environment ...
Page 380: ...2 132 Volume 2 Part 1 Interruptions ...
Page 398: ...2 150 Volume 2 Part 1 Register Stack Engine ...
Page 486: ...2 238 Volume 2 Part 1 IA 32 Interruption Vector Descriptions ...
Page 750: ...2 502 Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Page 754: ...2 506 Volume 2 Part 2 About the System Programmer s Guide ...
Page 796: ...2 548 Volume 2 Part 2 Interruptions and Serialization ...
Page 808: ...2 560 Volume 2 Part 2 Context Management ...
Page 842: ...2 594 Volume 2 Part 2 Floating point System Software ...
Page 850: ...2 602 Volume 2 Part 2 IA 32 Application Support ...
Page 862: ...2 614 Volume 2 Part 2 External Interrupt Architecture ...
Page 870: ...2 622 Volume 2 Part 2 Performance Monitoring Support ...
Page 891: ......
Page 1099: ...3 200 Volume 3 Instruction Reference padd Interruptions Illegal Operation fault ...
Page 1295: ...3 396 Volume 3 Resource and Dependency Semantics ...
Page 1296: ......
Page 1302: ...402 Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Page 1494: ...4 192 Volume 4 Base IA 32 Instruction Reference FWAIT Wait See entry for WAIT ...
Page 1647: ...Volume 4 Base IA 32 Instruction Reference 4 345 ROL ROR Rotate See entry for RCL RCR ROL ROR ...
Page 1884: ...4 582 Volume 4 IA 32 SSE Instruction Reference ...
Page 1885: ...Index Intel Itanium Architecture Software Developer s Manual Rev 2 3 Index ...
Page 1886: ...Index Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Page 1898: ...INDEX Index 12 Index for Volumes 1 2 3 and 4 ...