Packed-Data Processing on the ’C64x
8-21
’C64x Programming Considerations
Example 8–6. Vector Addition (Complete)
void vec_sum(const short *restrict a, const short *restrict b,
short *restrict c, int len)
{
int i;
unsigned a3_a2, a1_a0;
unsigned b3_b2, b1_b0;
unsigned c3_c2, c1_c0;
for (i = 0; i < len; i += 4)
{
a3_a2 = _hi(*(const double *) &a[i]);
a1_a0 = _lo(*(const double *) &a[i]);
b3_b2 = _hi(*(const double *) &b[i]);
b1_b0 = _lo(*(const double *) &b[i]);
c3_c2 = _add2(b3_b2, a3_a2);
c1_c0 = _add2(b1_b0, a1_a0);
*(double *) &c[i] = _itod(c3_c2, c1_c0);
}
}
At this point, the vector sum is fully vectorized, and can be optimized further
using other traditional techniques such as loop unrolling and software pipelin-
ing. These and other optimizations are described in detail throughout Chapter
6.
8.2.6.2
Vectorizing the Vector Multiply
The vector multiply shown in Figure 8–8 is similar to the vector sum, in that the
algorithm is a pure vector algorithm. One major difference, is the fact that the
intermediate values change precision. In the context of vectorization, this
changes the format the data is stored in, but it does not inhibit the ability to vec-
torize the code.
The basic operation of vector multiply is to take two 16-bit elements, multiply
them together to produce a 32-bit product, right-shift the 32-bit product to pro-
duce a 16-bit result, and then to store this result. The entire process for a single
iteration is shown graphically in Figure 8–14.