Packed-Data Processing on the ’C64x
8-23
’C64x Programming Considerations
Example 8–7. Using LDDW and STDW in Vector Multiply
void vec_mpy(const short *restrict a, const short *restrict b,
short *restrict c, int len, int shift)
{
int i;
unsigned a_hi, a_lo;
unsigned b_hi, b_lo;
unsigned c_hi, c_lo;
for (i = 0; i < len; i += 4)
{
a_hi = _hi(*(const double *) &a[i]);
a_lo = _lo(*(const double *) &a[i]);
b_hi = _hi(*(const double *) &b[i]);
b_lo = _lo(*(const double *) &b[i]);
/* ...somehow, the Multiply and Shift occur here,
with results in c_hi, c_lo... */
*(double *) &c[i] = _itod(c_hi, c_lo);
}
}
The next step is to perform the multiplication. The ’C64x intrinsic, _mpy2(),
performs two 16
16 multiplies, providing two 32-bit results packed in a 64-bit
double. This provides the multiplication. The _lo() and _hi() intrinsics allow
separation of the two separate 32-bit products. Figure 8–15 illustrates how
_mpy2() works.
Figure 8–15. Packed 16 16 Multiplies Using _mpy2
a[1]
a[0]
*
*
a_lo
b_lo
b[1]
b[0]
c_lo_dbl
32–bit
register
register
32–bit
16 bits
16 bits
64–bit
register
pair
a[1] * b[1]
a[0] * b[0]
32 bits
32 bits
Once the 32-bit products are obtained, use standard 32-bit shifts to shift these
to their final precision. However, this will leave the results in two separate 32-bit
registers.