Volume 1, Part 2: Floating-point Applications
1:205
Floating-point Applications
6
6.1
Overview
The Itanium floating-point architecture is fully ANSI/IEEE-754 standard compliant and
provides performance enhancing features such as the fused multiply accumulate
instruction, the large floating-point register file (with static and rotating sections), the
extended range register file data representation, the multiple independent
floating-point status fields, and the high bandwidth memory access instructions that
enable the creation of compact, high performance, floating-point application code.
The beginning of this chapter reviews some specific performance limitations that are
common in floating-point intensive application codes. Later, architectural features that
address these limitations are presented with illustrative code examples. The remainder
of this chapter highlights the optimization of some commonly used kernels using these
features.
6.2
FP Application Performance Limiters
Floating-point applications are characterized by a predominance of loops. Some loops
compute complex calculations on regularly structured data, others simply copy data
from one place to another, while others perform gather/scatter-type operations that
simultaneously compute and rearrange data. The following sections describe code
characteristics that limit performance and how they affect these different kinds of
loops.
6.2.1
Execution Latency
Loops often contain recurrence relationships. Consider the tri-diagonal elimination
kernel from the Livermore Fortran Kernel suite.
DO 5 i = 2, N
5X[i] = Z[i] * (Y[i] - X[i-1])
The dependency between
X[i]
and
X[i-1]
limits the iteration time of the loop to be
the sum of the latency of the subtract and the multiply. The available parallelism can be
increased by unrolling the loop and can be exploited by replicating computation,
however the fundamental limitation of the data dependency remains.
Sometimes, even if the loop is vectorizable and can be software pipelined, the iteration
time of the loop is limited by the execution latency of the hardware that executes the
code. A simple vector divide (shown below) is a typical example:
DO 1 I = 1, N
1X[i] = Y[i] / Z[i]
Since typical modern microprocessors contain a non-pipelined floating-point unit, the
iteration time of the loop is the latency of the divide which can be tens of clocks.
Summary of Contents for ITANIUM ARCHITECTURE - SOFTWARE DEVELOPERS VOLUME 3 REV 2.3
Page 1: ......
Page 11: ...x Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Page 13: ...1 2 Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Page 33: ...1 22 Volume 1 Part 1 Introduction to the Intel Itanium Architecture ...
Page 57: ...1 46 Volume 1 Part 1 Execution Environment ...
Page 147: ...1 136 Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Page 149: ...1 138 Volume 1 Part 2 About the Optimization Guide ...
Page 191: ...1 180 Volume 1 Part 2 Predication Control Flow and Instruction Stream ...
Page 230: ......
Page 248: ...236 Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Page 250: ...2 2 Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Page 264: ...2 16 Volume 2 Part 1 Intel Itanium System Environment ...
Page 380: ...2 132 Volume 2 Part 1 Interruptions ...
Page 398: ...2 150 Volume 2 Part 1 Register Stack Engine ...
Page 486: ...2 238 Volume 2 Part 1 IA 32 Interruption Vector Descriptions ...
Page 750: ...2 502 Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Page 754: ...2 506 Volume 2 Part 2 About the System Programmer s Guide ...
Page 796: ...2 548 Volume 2 Part 2 Interruptions and Serialization ...
Page 808: ...2 560 Volume 2 Part 2 Context Management ...
Page 842: ...2 594 Volume 2 Part 2 Floating point System Software ...
Page 850: ...2 602 Volume 2 Part 2 IA 32 Application Support ...
Page 862: ...2 614 Volume 2 Part 2 External Interrupt Architecture ...
Page 870: ...2 622 Volume 2 Part 2 Performance Monitoring Support ...
Page 891: ......
Page 1099: ...3 200 Volume 3 Instruction Reference padd Interruptions Illegal Operation fault ...
Page 1295: ...3 396 Volume 3 Resource and Dependency Semantics ...
Page 1296: ......
Page 1302: ...402 Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Page 1494: ...4 192 Volume 4 Base IA 32 Instruction Reference FWAIT Wait See entry for WAIT ...
Page 1647: ...Volume 4 Base IA 32 Instruction Reference 4 345 ROL ROR Rotate See entry for RCL RCR ROL ROR ...
Page 1884: ...4 582 Volume 4 IA 32 SSE Instruction Reference ...
Page 1885: ...Index Intel Itanium Architecture Software Developer s Manual Rev 2 3 Index ...
Page 1886: ...Index Intel Itanium Architecture Software Developer s Manual Rev 2 3 ...
Page 1898: ...INDEX Index 12 Index for Volumes 1 2 3 and 4 ...