Volume 1, Part 2: Floating-point Applications
1:217
The inner loop consists of two loads (for
A
and
B
) and a multiply-add (to accumulate the
product on C). The loop would run at the latency of the fma due to the recurrence on C.
In order to break the recurrence on C, the loop is typically unrolled and multiple partial
accumulators are used.
DO 1 I = 1, N, 8
C1 = C1 + A[I] * B[I]
C2 = C2 + A[I+1] * B[I+1]
C3 = C3 + A[I+2] * B[I+2]
C4 = C4 + A[I+3] * B[I+3]
C5 = C5 + A[I+4] * B[I+4]
C6 = C6 + A[I+5] * B[I+5]
C7 = C7 + A[I+6] * B[I+6]
1 C8 = C8 + A[I+7] * B[I+7]
C = C1 + C2 + C3 + C4 + C5 + C6 + C7 + C8
If normal (non-double pair) loads are used, the inner loop would consist of 16 loads and
8 fmas. If we assume the machine has two memory ports, this loop would be limited by
the availability of M slots and run at a peak rate of 1 clock per iteration. However, if this
loop is rewritten using 8 load-pairs (for
A[I]
,
A[I+1]
and
B[I]
,
B[I+1]
and
A[I+2]
,
A[I+3]
and
B[I+2]
,
B[I+3]
and so on) and 8 fmas this loop could run at a peak rate of
2 iterations per clock (or just 0.5 clocks per iteration) with just two M-units.
6.3.7.2
Data Prefetch
lfetch
allows the advance prefetching of a line (defined as 32 bytes or more) of data
into the cache from memory. Allocation hints can be used to indicate the nature of the
locality of the subsequent accesses on that data and to indicate which level of cache
that data needs to be promoted to.
While regular loads can also be used to achieve the effect of data prefetching, (if the
load target is never used) lfetches can more effectively reduce the memory latency
without using floating-point registers as targets of the data being prefetched.
Furthermore
lfetch
allows prefetching the data to different levels of caches.
6.3.7.3
Allocation Control
Since data accesses have different locality attributes (temporal/non-temporal,
spatial/non-spatial), The Itanium architecture allows annotating the data accesses
(loads/stores) to reflect these attributes. Based on these annotations, the
implementation can better manage the storage of the data in the caches.
Temporal and Non-temporal hints are defined. These attributes are applicable to the
various cache levels. (Only two cache levels are architecturally identified). The
non-temporal hint is best used for data that typically has no reuse with respect to that
level of cache. The temporal hint is used for all other data (that has reuse).
6.4
Summary
This chapter describes the limiting factors for many scientific and floating-point
applications: memory latency and bandwidth, functional unit latency, and number of
available functional units. It also describes the important features of floating-point
Содержание ITANIUM ARCHITECTURE - SOFTWARE DEVELOPERS VOLUME 3 REV 2.3
Страница 1: ......
Страница 11: ...x Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Страница 12: ...1 1 Intel Itanium Architecture Software Developer s Manual Rev 2 3 Part I Application Architecture Guide ...
Страница 13: ...1 2 Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Страница 33: ...1 22 Volume 1 Part 1 Introduction to the Intel Itanium Architecture ...
Страница 57: ...1 46 Volume 1 Part 1 Execution Environment ...
Страница 147: ...1 136 Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Страница 149: ...1 138 Volume 1 Part 2 About the Optimization Guide ...
Страница 191: ...1 180 Volume 1 Part 2 Predication Control Flow and Instruction Stream ...
Страница 230: ......
Страница 248: ...236 Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Страница 249: ...2 1 Intel Itanium Architecture Software Developer s Manual Rev 2 3 Part I System Architecture Guide ...
Страница 250: ...2 2 Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Страница 264: ...2 16 Volume 2 Part 1 Intel Itanium System Environment ...
Страница 380: ...2 132 Volume 2 Part 1 Interruptions ...
Страница 398: ...2 150 Volume 2 Part 1 Register Stack Engine ...
Страница 486: ...2 238 Volume 2 Part 1 IA 32 Interruption Vector Descriptions ...
Страница 749: ...2 501 Intel Itanium Architecture Software Developer s Manual Rev 2 3 Part II System Programmer s Guide ...
Страница 750: ...2 502 Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Страница 754: ...2 506 Volume 2 Part 2 About the System Programmer s Guide ...
Страница 796: ...2 548 Volume 2 Part 2 Interruptions and Serialization ...
Страница 808: ...2 560 Volume 2 Part 2 Context Management ...
Страница 842: ...2 594 Volume 2 Part 2 Floating point System Software ...
Страница 850: ...2 602 Volume 2 Part 2 IA 32 Application Support ...
Страница 862: ...2 614 Volume 2 Part 2 External Interrupt Architecture ...
Страница 870: ...2 622 Volume 2 Part 2 Performance Monitoring Support ...
Страница 891: ......
Страница 941: ...3 42 Volume 3 Instruction Reference cmp illegal_operation_fault PR p1 0 PR p2 0 Interruptions Illegal Operation fault ...
Страница 1099: ...3 200 Volume 3 Instruction Reference padd Interruptions Illegal Operation fault ...
Страница 1191: ...3 292 Volume 3 Pseudo Code Functions Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Страница 1295: ...3 396 Volume 3 Resource and Dependency Semantics ...
Страница 1296: ......
Страница 1302: ...402 Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Страница 1494: ...4 192 Volume 4 Base IA 32 Instruction Reference FWAIT Wait See entry for WAIT ...
Страница 1564: ...4 262 Volume 4 Base IA 32 Instruction Reference LES Load Full Pointer See entry for LDS LES LFS LGS LSS ...
Страница 1565: ...Volume 4 Base IA 32 Instruction Reference 4 263 LFS Load Full Pointer See entry for LDS LES LFS LGS LSS ...
Страница 1568: ...4 266 Volume 4 Base IA 32 Instruction Reference LGS Load Full Pointer See entry for LDS LES LFS LGS LSS ...
Страница 1583: ...Volume 4 Base IA 32 Instruction Reference 4 281 LSS Load Full Pointer See entry for LDS LES LFS LGS LSS ...
Страница 1647: ...Volume 4 Base IA 32 Instruction Reference 4 345 ROL ROR Rotate See entry for RCL RCR ROL ROR ...
Страница 1663: ...Volume 4 Base IA 32 Instruction Reference 4 361 SHL SHR Shift Instructions See entry for SAL SAR SHL SHR ...
Страница 1668: ...4 366 Volume 4 Base IA 32 Instruction Reference SIDT Store Interrupt Descriptor Table Register See entry for SGDT SIDT ...
Страница 1884: ...4 582 Volume 4 IA 32 SSE Instruction Reference ...
Страница 1885: ...Index Intel Itanium Architecture Software Developer s Manual Rev 2 3 Index ...
Страница 1886: ...Index Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Страница 1898: ...INDEX Index 12 Index for Volumes 1 2 3 and 4 ...