Packed-Data Processing on the ’C64x
8-29
’C64x Programming Considerations
_dotpu4 eliminates
three adds. The following sections describe how to write
the Dot Product and Vector Complex Multiply examples to take advantage of
these.
8.2.7.1
Combining Operations in the Dot Product Kernel
The Dot Product kernel, presented in Example 8–3, is one which benefits both
from vectorization as well as macro operations. First, apply the vectorization
optimization as presented earlier, and then look at combining operations to fur-
ther improve the code.
Vectorization can be performed on the array reads and multiplies that are this
kernel, as described in section 8.2.3. The result of those steps is the intermedi-
ate code shown in Example 8–9.
Example 8–9. Vectorized Form of the Dot Product Kernel
int dot_prod(const short *restrict a, const short *restrict b,
short *restrict c, int len)
{
int i;
int sum = 0; /* 32–bit accumulation */
unsigned a3_a2, a1_a0; /* Packed 16–bit values */
unsigned b3_b2, b1_b0; /* Packed 16–bit values */
double c3_c2_dbl, c1_c0_dbl; /* 32–bit prod in 64–bit pairs */
int c3, c2, c1, c0; /* Separate 32–bit products */
unsigned c3_c2, c1_c0; /* Packed 16–bit values */
for (i = 0; i < len; i += 4)
{
a3_a2 = _hi(*(const double *) &a[i]);
a1_a0 = _lo(*(const double *) &a[i]);
b3_b2 = _hi(*(const double *) &b[i]);
b1_b0 = _lo(*(const double *) &b[i]);
/* Multiply elements together, producing four products */
c3_c2_dbl = _mpy2(a3_a2, b3_b2);
c1_c0_dbl = _mpy2(a1_a0, b1_b0);
/* Add each of the four products to the accumulation. */
sum += _hi(c3_c2_dbl);
sum += _lo(c3_c2_dbl);
sum += _hi(c1_c0_dbl);
sum += _lo(c1_c0_dbl);
}
return sum;
}