Refining C/C++ Code
3-45
Optimizing C/C++ Code
The performance of a software pipeline is limited by the number of resources
that can execute in parallel. In its word-aligned form (Example 3–27), the vec-
tor sum loop delivers two results every two cycles because the two loads and
the store are all operating on two 16-bit values at a time.
Example 3–27. Word-Aligned Vector Sum
void vecsum4(short *restrict sum, const short *restrict in1,
const short *restrict in2, unsigned int N)
{
int i;
const int *restrict i_in1 = (const int *)in1;
const int *restrict i_in2 = (const int *)in2;
int *restrict i_sum = (int *)sum;
#pragma MUST_ITERATE (10);
for (i = 0; i < (N/2); i++)
i_sum[i] = _add2(i_in1[i], i_in2[i]);
}
If you unroll the loop once, the loop then performs six memory operations per
iteration, which means the unrolled vector sum loop can deliver four results ev-
ery three cycles (that is, 1.33 results per cycle). Example 3–28 shows four re-
sults for each iteration of the loop: sum[i] and sum[ i + sz] each store an int value
that represents two 16-bit values.
Example 3–28 is not simple loop unrolling where the loop body is simply repli-
cated. The additional instructions use memory pointers that are offset to point
midway into the input arrays and the assumptions that the additional arrays are
a multiple of four shorts in size.
Example 3–28. Vector Sum Using const Keywords, MUST_ITERATE pragma, Word
Reads, and Loop Unrolling
void vecsum6(int *restrict sum, const int *restrict in1, const int *restrict
in2, unsigned int N)
{
int i;
int sz = N >> 2;
#pragma MUST_ITERATE (10);
for (i = 0; i < sz; i++)
{
sum[i] = _add2(in1[i], in2[i]);
sum[i+sz] = _add2(in1[i+sz], in2[i+sz]);
}
}