![Intel ITANIUM ARCHITECTURE - SOFTWARE DEVELOPERS VOLUME 3 REV 2.3 Manual Download Page 221](http://html.mh-extra.com/html/intel/itanium-architecture-software-developers-volume-3-rev-2-3/itanium-architecture-software-developers-volume-3-rev-2-3_manual_2073404221.webp)
1:210
Volume 1, Part 2: Floating-point Applications
inputs that might be single precision numbers. With the rounding performed at the 64th
precision bit (instead of the 24th for single precision) a smaller error is accumulated
with each multiply and add. Furthermore, with 17 bits of range (instead of 8 bits for
single precision) large positive and negative products can be added to the accumulator
without overflow or underflow. In addition to providing more accurate results the extra
range and precision can often enhance the performance of iterative computations that
are required to be performed until convergence (as indicated by an error bound) is
reached.
6.3.2
Multiply-Add Instruction
The Itanium architecture defines the fused multiply-add (
fma
) as the basic
floating-point computation, since it forms the core of many computations (linear
algebra, series expansion, etc.) and its latency in hardware is typically less than the
sum of the latencies of an individual multiply operation (with rounding) implementation
and an individual add operation (with rounding) implementation.
In computational loops that have a loop carried dependency and whose speed is often
determined by the latency of the floating-point computation rather than the peak
computational rate, the multiply-add operation can often be used advantageously.
Consider the Livermore FORTRAN Kernel 9 – General Linear Recurrence Equations:
DO 191 k= 1,n
B5(k+KB5I)
= SA(k) + STB5 * SB(k)
STB5=
B5(k+KB5I)
- STB5
191CONTINUE
Since there is a true data dependency between the two statements on variable
B5(k+KB5I))
and a loop-carried dependency on variable
STB5
, the loop number of
clocks per iteration is entirely determined by the latency of the floating-point
operations. In the absence of an
fma
type operation, and assuming that the individual
multiply and add latencies are 5 clocks each and the loads are 8 cycles, the loop would
be:
L1:
(p16)
ldf
f32 = [r5], 8
// Load SA(k)
(p16) ldf
f42 = [r6], 8
// Load SB(k)
(p17) fmul
f5 = f7, f43;;
// tmp,Clk 0,15 ...
(p17) fadd
f6 = f33, f5 ;;
// B5,Clk 5,20 ...
(p17) stf
[r7] = f6, 8
// Store B5
(p17) fsub
f7 = f6, f7
// STB5,Clk 10,25 ..
br.ctop L1 ;;
With an
fma
, the overall latency of the chain of operations decreases and assuming a 5
cycle
fma
, the loop iteration speed is now 10 clocks (as opposed to 15 clocks above).
L1:
(p16)
ldf
f32 = [r5], 8
// Load SA(k)
(p16) ldf
f42 = [r6], 8
// Load SB(k)
(p17) fma
f6 = f7, f43, f33;;
// B5,Clk 0,10 ...
(p17) stf
[r7]
=
f6
, 8
// Store B5
(p17) fsub f7 =
f6
, f7
// STB5,Clk 5,15 ..
br.ctop L1 ;;
The fused multiply-add operation also offers the advantage of a single rounding error
for the pair of computations which is valuable when trying to compute small differences
of large numbers.
Summary of Contents for ITANIUM ARCHITECTURE - SOFTWARE DEVELOPERS VOLUME 3 REV 2.3
Page 1: ......
Page 11: ...x Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Page 13: ...1 2 Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Page 33: ...1 22 Volume 1 Part 1 Introduction to the Intel Itanium Architecture ...
Page 57: ...1 46 Volume 1 Part 1 Execution Environment ...
Page 147: ...1 136 Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Page 149: ...1 138 Volume 1 Part 2 About the Optimization Guide ...
Page 191: ...1 180 Volume 1 Part 2 Predication Control Flow and Instruction Stream ...
Page 230: ......
Page 248: ...236 Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Page 250: ...2 2 Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Page 264: ...2 16 Volume 2 Part 1 Intel Itanium System Environment ...
Page 380: ...2 132 Volume 2 Part 1 Interruptions ...
Page 398: ...2 150 Volume 2 Part 1 Register Stack Engine ...
Page 486: ...2 238 Volume 2 Part 1 IA 32 Interruption Vector Descriptions ...
Page 750: ...2 502 Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Page 754: ...2 506 Volume 2 Part 2 About the System Programmer s Guide ...
Page 796: ...2 548 Volume 2 Part 2 Interruptions and Serialization ...
Page 808: ...2 560 Volume 2 Part 2 Context Management ...
Page 842: ...2 594 Volume 2 Part 2 Floating point System Software ...
Page 850: ...2 602 Volume 2 Part 2 IA 32 Application Support ...
Page 862: ...2 614 Volume 2 Part 2 External Interrupt Architecture ...
Page 870: ...2 622 Volume 2 Part 2 Performance Monitoring Support ...
Page 891: ......
Page 1099: ...3 200 Volume 3 Instruction Reference padd Interruptions Illegal Operation fault ...
Page 1295: ...3 396 Volume 3 Resource and Dependency Semantics ...
Page 1296: ......
Page 1302: ...402 Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Page 1494: ...4 192 Volume 4 Base IA 32 Instruction Reference FWAIT Wait See entry for WAIT ...
Page 1647: ...Volume 4 Base IA 32 Instruction Reference 4 345 ROL ROR Rotate See entry for RCL RCR ROL ROR ...
Page 1884: ...4 582 Volume 4 IA 32 SSE Instruction Reference ...
Page 1885: ...Index Intel Itanium Architecture Software Developer s Manual Rev 2 3 Index ...
Page 1886: ...Index Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Page 1898: ...INDEX Index 12 Index for Volumes 1 2 3 and 4 ...