![Intel ITANIUM ARCHITECTURE - SOFTWARE DEVELOPERS VOLUME 3 REV 2.3 Manual Download Page 228](http://html.mh-extra.com/html/intel/itanium-architecture-software-developers-volume-3-rev-2-3/itanium-architecture-software-developers-volume-3-rev-2-3_manual_2073404228.webp)
Volume 1, Part 2: Floating-point Applications
1:217
The inner loop consists of two loads (for
A
and
B
) and a multiply-add (to accumulate the
product on C). The loop would run at the latency of the fma due to the recurrence on C.
In order to break the recurrence on C, the loop is typically unrolled and multiple partial
accumulators are used.
DO 1 I = 1, N, 8
C1 = C1 + A[I] * B[I]
C2 = C2 + A[I+1] * B[I+1]
C3 = C3 + A[I+2] * B[I+2]
C4 = C4 + A[I+3] * B[I+3]
C5 = C5 + A[I+4] * B[I+4]
C6 = C6 + A[I+5] * B[I+5]
C7 = C7 + A[I+6] * B[I+6]
1 C8 = C8 + A[I+7] * B[I+7]
C = C1 + C2 + C3 + C4 + C5 + C6 + C7 + C8
If normal (non-double pair) loads are used, the inner loop would consist of 16 loads and
8 fmas. If we assume the machine has two memory ports, this loop would be limited by
the availability of M slots and run at a peak rate of 1 clock per iteration. However, if this
loop is rewritten using 8 load-pairs (for
A[I]
,
A[I+1]
and
B[I]
,
B[I+1]
and
A[I+2]
,
A[I+3]
and
B[I+2]
,
B[I+3]
and so on) and 8 fmas this loop could run at a peak rate of
2 iterations per clock (or just 0.5 clocks per iteration) with just two M-units.
6.3.7.2
Data Prefetch
lfetch
allows the advance prefetching of a line (defined as 32 bytes or more) of data
into the cache from memory. Allocation hints can be used to indicate the nature of the
locality of the subsequent accesses on that data and to indicate which level of cache
that data needs to be promoted to.
While regular loads can also be used to achieve the effect of data prefetching, (if the
load target is never used) lfetches can more effectively reduce the memory latency
without using floating-point registers as targets of the data being prefetched.
Furthermore
lfetch
allows prefetching the data to different levels of caches.
6.3.7.3
Allocation Control
Since data accesses have different locality attributes (temporal/non-temporal,
spatial/non-spatial), The Itanium architecture allows annotating the data accesses
(loads/stores) to reflect these attributes. Based on these annotations, the
implementation can better manage the storage of the data in the caches.
Temporal and Non-temporal hints are defined. These attributes are applicable to the
various cache levels. (Only two cache levels are architecturally identified). The
non-temporal hint is best used for data that typically has no reuse with respect to that
level of cache. The temporal hint is used for all other data (that has reuse).
6.4
Summary
This chapter describes the limiting factors for many scientific and floating-point
applications: memory latency and bandwidth, functional unit latency, and number of
available functional units. It also describes the important features of floating-point
Summary of Contents for ITANIUM ARCHITECTURE - SOFTWARE DEVELOPERS VOLUME 3 REV 2.3
Page 1: ......
Page 11: ...x Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Page 13: ...1 2 Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Page 33: ...1 22 Volume 1 Part 1 Introduction to the Intel Itanium Architecture ...
Page 57: ...1 46 Volume 1 Part 1 Execution Environment ...
Page 147: ...1 136 Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Page 149: ...1 138 Volume 1 Part 2 About the Optimization Guide ...
Page 191: ...1 180 Volume 1 Part 2 Predication Control Flow and Instruction Stream ...
Page 230: ......
Page 248: ...236 Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Page 250: ...2 2 Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Page 264: ...2 16 Volume 2 Part 1 Intel Itanium System Environment ...
Page 380: ...2 132 Volume 2 Part 1 Interruptions ...
Page 398: ...2 150 Volume 2 Part 1 Register Stack Engine ...
Page 486: ...2 238 Volume 2 Part 1 IA 32 Interruption Vector Descriptions ...
Page 750: ...2 502 Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Page 754: ...2 506 Volume 2 Part 2 About the System Programmer s Guide ...
Page 796: ...2 548 Volume 2 Part 2 Interruptions and Serialization ...
Page 808: ...2 560 Volume 2 Part 2 Context Management ...
Page 842: ...2 594 Volume 2 Part 2 Floating point System Software ...
Page 850: ...2 602 Volume 2 Part 2 IA 32 Application Support ...
Page 862: ...2 614 Volume 2 Part 2 External Interrupt Architecture ...
Page 870: ...2 622 Volume 2 Part 2 Performance Monitoring Support ...
Page 891: ......
Page 1099: ...3 200 Volume 3 Instruction Reference padd Interruptions Illegal Operation fault ...
Page 1295: ...3 396 Volume 3 Resource and Dependency Semantics ...
Page 1296: ......
Page 1302: ...402 Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Page 1494: ...4 192 Volume 4 Base IA 32 Instruction Reference FWAIT Wait See entry for WAIT ...
Page 1647: ...Volume 4 Base IA 32 Instruction Reference 4 345 ROL ROR Rotate See entry for RCL RCR ROL ROR ...
Page 1884: ...4 582 Volume 4 IA 32 SSE Instruction Reference ...
Page 1885: ...Index Intel Itanium Architecture Software Developer s Manual Rev 2 3 Index ...
Page 1886: ...Index Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Page 1898: ...INDEX Index 12 Index for Volumes 1 2 3 and 4 ...