Packed-Data Processing on the ’C64x
8-24
The ’C64x provides the _pack family intrinsics to convert the 32-bit results into
16-bit results. The _packXX2() intrinsics, described in section 8.2.4, extract
two 16-bit values from two 32-bit registers, returning the results in a single
32-bit register. This allows efficient conversion of the 32-bit intermediate re-
sults to a packed 16-bit format.
In this case, after the right-shift, the affected bits will be in the lower half of the
32-bit registers. Use the _pack2() intrinsic to convert the 32-bit intermediate
values back to packed 16-bit results so they can be stored. The resulting C
code is shown in Example 8–8.
Example 8–8. Using _mpy2() and _pack2() to Perform the Vector Multiply
void vec_mpy1(const short *restrict a, const short *restrict b,
short *restrict c, int len, int shift)
{
int i;
unsigned a_hi, a_lo; /* Packed 16–bit values */
unsigned b_hi, b_lo; /* Packed 16–bit values */
double c_hi_dbl, c_lo_dbl; /* 32–bit prod in 64–bit pairs */
int c_hi3, c_hi2, c_lo1, c_lo0; /* Separate 32–bit products */
unsigned c_hi, c_lo; /* Packed 16–bit values */
for (i = 0; i < len; i += 4)
{
a_hi = _hi(*(const double *) &a[i]);
a_lo = _lo(*(const double *) &a[i]);
b_hi = _hi(*(const double *) &b[i]);
b_lo = _lo(*(const double *) &b[i]);
/* Multiply elements together, producing four products */
c_hi_dbl = _mpy2(a_hi, b_hi);
c_lo_dbl = _mpy2(a_lo, b_lo);
/* Shift each of the four products right by our shift amount */
c_hi3 = _hi(c_hi_dbl) >> shift;
c_hi2 = _lo(c_hi_dbl) >> shift;
c_lo1 = _hi(c_lo_dbl) >> shift;
c_lo0 = _lo(c_lo_dbl) >> shift;
/* Pack the results back together into packed 16–bit format */
c_hi = _pack2(c_hi3, c_hi2);
c_lo = _pack2(c_lo1, c_lo0);
/* Store the results. */
*(double *) &c[i] = _itod(c_hi, c_lo);
}
}