Packed-Data Processing on the ’C64x
8-31
’C64x Programming Considerations
Example 8–10. Vectorized Form of the Dot Product Kernel
int dot_prod(const short *restrict a, const short *restrict b,
short *restrict c, int len)
{
int i;
int sum = 0; /* 32–bit accumulation */
unsigned a3_a2, a1_a0; /* Packed 16–bit values */
unsigned b3_b2, b1_b0; /* Packed 16–bit values */
for (i = 0; i < len; i += 4)
{
a3_a2 = _hi(*(const double *) &a[i]);
a1_a0 = _lo(*(const double *) &a[i]);
b3_b2 = _hi(*(const double *) &b[i]);
b1_b0 = _lo(*(const double *) &b[i]);
/* Perform dot–products on pairs of elements, totalling the
results in the accumulator. */
sum += _dotp2(a3_a2, b3_b2);
sum += _dotp2(a1_a0, b1_b0);
}
return sum;
}
At this point, the code takes full advantage of the new features that the ’C64x
provides. In the particular case of this kernel, no further optimization should
be necessary. The tools produce an optimal single cycle loop, using the com-
piler version that was available at the time this book was written.
Example 8–11. Final Assembly Code for Dot–Product Kernel’s Inner Loop
L2:
[ B0] SUB .L2 B0,1,B0 ;
|| [!B0] ADD .S2 B8,B7,B7 ; |10|
|| [!B0] ADD .L1 A7,A6,A6 ; |10|
|| DOTP2 .M2X B5,A5,B8 ; @@@@|10|
|| DOTP2 .M1X B4,A4,A7 ; @@@@|10|
|| [ A0] BDEC .S1 L2,A0 ; @@@@@
|| LDDW .D1T1 *A3++,A5:A4 ; @@@@@@@@@|10|
|| LDDW .D2T2 *B6++,B5:B4 ; @@@@@@@@@|10|