Intel ITANIUM ARCHITECTURE - SOFTWARE DEVELOPERS VOLUME 3 REV 2.3 Manual Download

Page: 220 / 1898

Volume 1, Part 2: Floating-point Applications

1:209

If we suppose the minimum floating-point load latency is 9 clocks, and 2 memory
operations can be issued per clock, the above loop has to be unrolled by at least six if
there is no register rotation.

add

r8 = r7, 8

L1:

(p18)

stf

[r7] = f25, 16

// Cycle 17,26...

(p18)

stf

[r8] = f26, 16

// Cycle 17,26...

(p17) fadd

f25 = f5, f15

// Cycle 8,17,26...

(p16) ldf

f5 = [r5], 8

// Cycle 0,9,18...

(p16) ldf

f15 = [r6], 8

// Cycle 0,9,18...

(p17) fadd

f26 = f6, f16;;

// Cycle 9,18,27 ...

(p16) ldf

f6 = [r5], 8

// Cycle 1,10,19 ...

(p16) ldf

f16 = [r6], 8

// Cycle 1,10,19 ...

(p18) stf

[r7] = f27, 16

// Cycle 20,29 ...

(p18) stf

[r8] = f28, 16

// Cycle 20,29 ...

(p17) fadd

f27 = f7, f17 ;;

// Cycle 11,20 ...

(p16) ldf

f7 = [r5], 8

// Cycle 3,12,21 ...

(p16) ldf

f17 = [r6], 8

// Cycle 3,12,21 ...

(p17) fadd

f28 = f8, f18 ;;

// Cycle 12,21 ...

(p16) ldf

f8 = [r5], 8

// Cycle 4,13,22 ...

(p16) ldf

f18 = [r6], 8

// Cycle 4,13,22 ...

(p18) stf

[r7] = f29, 16

// Cycle 23,32 ...

(p18) stf

[r8] = f30, 16

// Cycle 23,32 ...

(p16) fadd

f29 = f9, f19 ;;

// Cycle 14,23 ...

(p16) ldf

f9 = [r5], 8

// Cycle 6,15,24 ...

(p16) ldf

f19 = [r6], 8

// Cycle 6,15,24 ...

(p16) fadd

f30 = f10, f20 ;;

// Cycle 15,24 ...

(p16) ldf

f10 = [r5], 8

// Cycle 7,16,25 ...

(p16) ldf

f20 = [r6], 8

// Cycle 7,16,25 ...

br.ctop L1 ;;

However, with register rotation, the same loop can be scheduled with an initiation
interval of just 2 clocks without unrolling (and 1.5 clocks if unrolled by 2):

L1:

(p24)

stf

[r7] = f57, 8

// Cycle 15,17...

(p21) fadd

f57 = f37, f47

// Cycle 9,11,13...

(p16) ldf

f32 = [r5], 8

// Cycle 0,2,4,6...

(p16) ldf

f42 = [r6], 8

// Cycle 0,2,4,6...

br.ctop L1;;

It is thus often advantageous to modulo schedule and then unroll (if required). Please
see

Chapter 5, “Software Pipelining and Loop Support”

for details on how to rewrite

loops using this transformation.

6.3.1.1

Notes on FP Precision

The floating-point registers are 82 bits wide with 17 bits for exponent range, 64 bits for
significand precision and 1 sign bit. During computation, the result range and precision
is determined by the computational model chosen by the user. The computational
model is indicated either statically in the instruction encoding, or dynamically via the
precision control (PC) and widest-range-exponent (WRE) bits in the floating-point
status register. Using an appropriate computational model, the user can minimize the
error accumulation in the computation. In the above matrix multiply example, if the
multiply and add computations are performed in full register file range and precision,
the results (in accumulators) can hold 64 bits of precision and up to 17 bits of range for

Intel ITANIUM ARCHITECTURE - SOFTWARE DEVELOPERS VOLUME 3 REV 2.3, Manual

Search results

Summary of Contents for ITANIUM ARCHITECTURE - SOFTWARE DEVELOPERS VOLUME 3 REV 2.3

Reviews:

Related manuals for ITANIUM ARCHITECTURE - SOFTWARE DEVELOPERS VOLUME 3 REV 2.3

Brands by name

Popular brands