Intel ITANIUM ARCHITECTURE - SOFTWARE DEVELOPERS VOLUME 3 REV 2.3 Manual Download Page 221

Page: 221 / 1898

1:210

Volume 1, Part 2: Floating-point Applications

inputs that might be single precision numbers. With the rounding performed at the 64th
precision bit (instead of the 24th for single precision) a smaller error is accumulated
with each multiply and add. Furthermore, with 17 bits of range (instead of 8 bits for
single precision) large positive and negative products can be added to the accumulator
without overflow or underflow. In addition to providing more accurate results the extra
range and precision can often enhance the performance of iterative computations that
are required to be performed until convergence (as indicated by an error bound) is
reached.

6.3.2

Multiply-Add Instruction

The Itanium architecture defines the fused multiply-add (

fma

) as the basic

floating-point computation, since it forms the core of many computations (linear
algebra, series expansion, etc.) and its latency in hardware is typically less than the
sum of the latencies of an individual multiply operation (with rounding) implementation
and an individual add operation (with rounding) implementation.

In computational loops that have a loop carried dependency and whose speed is often
determined by the latency of the floating-point computation rather than the peak
computational rate, the multiply-add operation can often be used advantageously.
Consider the Livermore FORTRAN Kernel 9 – General Linear Recurrence Equations:

DO 191 k= 1,n

B5(k+KB5I)

= SA(k) + STB5 * SB(k)

STB5=

B5(k+KB5I)

- STB5

191CONTINUE

Since there is a true data dependency between the two statements on variable

B5(k+KB5I))

and a loop-carried dependency on variable

STB5

, the loop number of

clocks per iteration is entirely determined by the latency of the floating-point
operations. In the absence of an

fma

type operation, and assuming that the individual

multiply and add latencies are 5 clocks each and the loads are 8 cycles, the loop would
be:

L1:

(p16)

ldf

f32 = [r5], 8

// Load SA(k)

(p16) ldf

f42 = [r6], 8

// Load SB(k)

(p17) fmul

f5 = f7, f43;;

// tmp,Clk 0,15 ...

(p17) fadd

f6 = f33, f5 ;;

// B5,Clk 5,20 ...

(p17) stf

[r7] = f6, 8

// Store B5

(p17) fsub

f7 = f6, f7

// STB5,Clk 10,25 ..

br.ctop L1 ;;

With an

fma

, the overall latency of the chain of operations decreases and assuming a 5

cycle

fma

, the loop iteration speed is now 10 clocks (as opposed to 15 clocks above).

L1:

(p16)

ldf

f32 = [r5], 8

// Load SA(k)

(p16) ldf

f42 = [r6], 8

// Load SB(k)

(p17) fma

f6 = f7, f43, f33;;

// B5,Clk 0,10 ...

(p17) stf

[r7]

, 8

// Store B5

(p17) fsub f7 =

, f7

// STB5,Clk 5,15 ..

br.ctop L1 ;;

The fused multiply-add operation also offers the advantage of a single rounding error
for the pair of computations which is valuable when trying to compute small differences
of large numbers.

Intel ITANIUM ARCHITECTURE - SOFTWARE DEVELOPERS VOLUME 3 REV 2.3, Manual

Search results

Summary of Contents for ITANIUM ARCHITECTURE - SOFTWARE DEVELOPERS VOLUME 3 REV 2.3

Reviews:

Brands by name

Popular brands