1:210
Volume 1, Part 2: Floating-point Applications
inputs that might be single precision numbers. With the rounding performed at the 64th
precision bit (instead of the 24th for single precision) a smaller error is accumulated
with each multiply and add. Furthermore, with 17 bits of range (instead of 8 bits for
single precision) large positive and negative products can be added to the accumulator
without overflow or underflow. In addition to providing more accurate results the extra
range and precision can often enhance the performance of iterative computations that
are required to be performed until convergence (as indicated by an error bound) is
reached.
6.3.2
Multiply-Add Instruction
The Itanium architecture defines the fused multiply-add (
fma
) as the basic
floating-point computation, since it forms the core of many computations (linear
algebra, series expansion, etc.) and its latency in hardware is typically less than the
sum of the latencies of an individual multiply operation (with rounding) implementation
and an individual add operation (with rounding) implementation.
In computational loops that have a loop carried dependency and whose speed is often
determined by the latency of the floating-point computation rather than the peak
computational rate, the multiply-add operation can often be used advantageously.
Consider the Livermore FORTRAN Kernel 9 – General Linear Recurrence Equations:
DO 191 k= 1,n
B5(k+KB5I)
= SA(k) + STB5 * SB(k)
STB5=
B5(k+KB5I)
- STB5
191CONTINUE
Since there is a true data dependency between the two statements on variable
B5(k+KB5I))
and a loop-carried dependency on variable
STB5
, the loop number of
clocks per iteration is entirely determined by the latency of the floating-point
operations. In the absence of an
fma
type operation, and assuming that the individual
multiply and add latencies are 5 clocks each and the loads are 8 cycles, the loop would
be:
L1:
(p16)
ldf
f32 = [r5], 8
// Load SA(k)
(p16) ldf
f42 = [r6], 8
// Load SB(k)
(p17) fmul
f5 = f7, f43;;
// tmp,Clk 0,15 ...
(p17) fadd
f6 = f33, f5 ;;
// B5,Clk 5,20 ...
(p17) stf
[r7] = f6, 8
// Store B5
(p17) fsub
f7 = f6, f7
// STB5,Clk 10,25 ..
br.ctop L1 ;;
With an
fma
, the overall latency of the chain of operations decreases and assuming a 5
cycle
fma
, the loop iteration speed is now 10 clocks (as opposed to 15 clocks above).
L1:
(p16)
ldf
f32 = [r5], 8
// Load SA(k)
(p16) ldf
f42 = [r6], 8
// Load SB(k)
(p17) fma
f6 = f7, f43, f33;;
// B5,Clk 0,10 ...
(p17) stf
[r7]
=
f6
, 8
// Store B5
(p17) fsub f7 =
f6
, f7
// STB5,Clk 5,15 ..
br.ctop L1 ;;
The fused multiply-add operation also offers the advantage of a single rounding error
for the pair of computations which is valuable when trying to compute small differences
of large numbers.
Содержание ITANIUM ARCHITECTURE - SOFTWARE DEVELOPERS VOLUME 3 REV 2.3
Страница 1: ......
Страница 11: ...x Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Страница 12: ...1 1 Intel Itanium Architecture Software Developer s Manual Rev 2 3 Part I Application Architecture Guide ...
Страница 13: ...1 2 Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Страница 33: ...1 22 Volume 1 Part 1 Introduction to the Intel Itanium Architecture ...
Страница 57: ...1 46 Volume 1 Part 1 Execution Environment ...
Страница 147: ...1 136 Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Страница 149: ...1 138 Volume 1 Part 2 About the Optimization Guide ...
Страница 191: ...1 180 Volume 1 Part 2 Predication Control Flow and Instruction Stream ...
Страница 230: ......
Страница 248: ...236 Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Страница 249: ...2 1 Intel Itanium Architecture Software Developer s Manual Rev 2 3 Part I System Architecture Guide ...
Страница 250: ...2 2 Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Страница 264: ...2 16 Volume 2 Part 1 Intel Itanium System Environment ...
Страница 380: ...2 132 Volume 2 Part 1 Interruptions ...
Страница 398: ...2 150 Volume 2 Part 1 Register Stack Engine ...
Страница 486: ...2 238 Volume 2 Part 1 IA 32 Interruption Vector Descriptions ...
Страница 749: ...2 501 Intel Itanium Architecture Software Developer s Manual Rev 2 3 Part II System Programmer s Guide ...
Страница 750: ...2 502 Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Страница 754: ...2 506 Volume 2 Part 2 About the System Programmer s Guide ...
Страница 796: ...2 548 Volume 2 Part 2 Interruptions and Serialization ...
Страница 808: ...2 560 Volume 2 Part 2 Context Management ...
Страница 842: ...2 594 Volume 2 Part 2 Floating point System Software ...
Страница 850: ...2 602 Volume 2 Part 2 IA 32 Application Support ...
Страница 862: ...2 614 Volume 2 Part 2 External Interrupt Architecture ...
Страница 870: ...2 622 Volume 2 Part 2 Performance Monitoring Support ...
Страница 891: ......
Страница 941: ...3 42 Volume 3 Instruction Reference cmp illegal_operation_fault PR p1 0 PR p2 0 Interruptions Illegal Operation fault ...
Страница 1099: ...3 200 Volume 3 Instruction Reference padd Interruptions Illegal Operation fault ...
Страница 1191: ...3 292 Volume 3 Pseudo Code Functions Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Страница 1295: ...3 396 Volume 3 Resource and Dependency Semantics ...
Страница 1296: ......
Страница 1302: ...402 Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Страница 1494: ...4 192 Volume 4 Base IA 32 Instruction Reference FWAIT Wait See entry for WAIT ...
Страница 1564: ...4 262 Volume 4 Base IA 32 Instruction Reference LES Load Full Pointer See entry for LDS LES LFS LGS LSS ...
Страница 1565: ...Volume 4 Base IA 32 Instruction Reference 4 263 LFS Load Full Pointer See entry for LDS LES LFS LGS LSS ...
Страница 1568: ...4 266 Volume 4 Base IA 32 Instruction Reference LGS Load Full Pointer See entry for LDS LES LFS LGS LSS ...
Страница 1583: ...Volume 4 Base IA 32 Instruction Reference 4 281 LSS Load Full Pointer See entry for LDS LES LFS LGS LSS ...
Страница 1647: ...Volume 4 Base IA 32 Instruction Reference 4 345 ROL ROR Rotate See entry for RCL RCR ROL ROR ...
Страница 1663: ...Volume 4 Base IA 32 Instruction Reference 4 361 SHL SHR Shift Instructions See entry for SAL SAR SHL SHR ...
Страница 1668: ...4 366 Volume 4 Base IA 32 Instruction Reference SIDT Store Interrupt Descriptor Table Register See entry for SGDT SIDT ...
Страница 1884: ...4 582 Volume 4 IA 32 SSE Instruction Reference ...
Страница 1885: ...Index Intel Itanium Architecture Software Developer s Manual Rev 2 3 Index ...
Страница 1886: ...Index Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Страница 1898: ...INDEX Index 12 Index for Volumes 1 2 3 and 4 ...