
IA-32 Intel® Architecture Optimization
3-14
The examples that follow illustrate the use of coding adjustments to
enable the algorithm to benefit from the SSE. The same techniques may
be used for single-precision floating-point, double-precision
floating-point, and integer data under SSE2, SSE, and MMX
technology.
As a basis for the usage model discussed in this section, consider a
simple loop shown in Example 3-8.
Note that the loop runs for only four iterations. This allows a simple
replacement of the code with Streaming SIMD Extensions.
For the optimal use of the Streaming SIMD Extensions that need data
alignment on the 16-byte boundary, all examples in this chapter assume
that the arrays passed to the routine,
a
,
b
,
c
, are aligned to 16-byte
boundaries by a calling routine. For the methods to ensure this
alignment, please refer to the application notes for the Pentium 4
processor.
The sections that follow provide details on the coding methodologies:
inlined assembly, intrinsics, C++ vector classes, and automatic
vectorization.
Example 3-8
Simple Four-Iteration Loop
void add(float *a, float *b, float *c)
{
int i;
for (i = 0; i < 4; i++) {
c[i] = a[i] + b[i];
}
}
Summary of Contents for ARCHITECTURE IA-32
Page 1: ...IA 32 Intel Architecture Optimization Reference Manual Order Number 248966 013US April 2006...
Page 220: ...IA 32 Intel Architecture Optimization 3 40...
Page 434: ...IA 32 Intel Architecture Optimization 9 20...
Page 514: ...IA 32 Intel Architecture Optimization B 60...
Page 536: ...IA 32 Intel Architecture Optimization C 22...