Packed-Data Processing on the ’C64x
8-30
While this code is fully vectorized, it still can be improved. The kernel itself is
performing two LDDWs, two MPY2, four ADDs, and one Branch. Because of
the large number of ADDs, the loop cannot fit in a single cycle, and so the ’C64x
datapath is not used efficiently.
The way to improve this is to combine some of the multiplies with some of the
adds. The ’C64x family of _dotp intrinsics provides the answer here.
Figure 8–18 illustrates how the _dotp2 intrinsic operates. Other _dotp intrin-
sics operate similarly.
Figure 8–18. Graphical Representation of the _dotp2 Intrinsic c = _dotp2(b, a)
a_hi
a_lo
a
b
b_hi
b_lo
*
*
32–bit register
32–bit register
a_hi * b_hi
a_lo * b_lo
16 bit
16 bit
32 bit
32 bit
add
a_hi * b_hi + a_lo * b_lo
c
c = _dotp2(b, a)
32 bit
This operation exactly maps to the operation the dot product kernel performs.
The modified version of the kernel absorbs two of the four ADDs into _dotp in-
trinsics. The result is shown as Example 8–11. Notice that the variable
c has
been eliminated by summing the results of the _dotp intrinsic directly.