Packed-Data Processing on the ’C64x
8-18
stores typically occur near the very beginning and end of the loop body. The
following examples use this outside-in approach to perform packed data opti-
mization techniques on the example kernels.
Note:
The following examples assume that the compiler has not performed any
packed data optimizations. The most recent release of the ’C6000 Code
Generation Tools will apply many of the optimizations described in this chap-
ter automatically when presented with sufficient information about the code.
8.2.6.1
Vectorizing the Vector Sum
Consider the vector sum kernel presented in Example 8–1. In its default form,
it reads one half–word from the a[ ] array, one half-word from the b[ ] array,
adds them, and writes a single half–word result to the c[ ] array. This results
in a 2-cycle loop that moves 48 bits per iteration. When you consider that the
’C64x can read or write 128 bits every cycle, it becomes clear that this is very
inefficient.
One simple optimization is to replace the half-word accesses with double-word
accesses to read and write array elements in groups of four. When doing this,
array elements are read into the register file with four elements packed into a
register pair. The array elements are packed with, two elements in each regis-
ter, across two registers. Each register contains data in the packed 16-bit data
type illustrated in Figure 8–2.
For the moment, assume that the arrays are double-word aligned, as shown
in Example 8–5. For more information about methods for working with arrays
that are not specially aligned, see section 8.2.8. The first thing to note is that
the ’C6000 Code Generation Tools lack a 64-bit integral type. This is not a
problem, however, as the tools allow the use of
double, and the intrinsics _lo(),
_hi(), _itod() to access integer data with double-word loads. To account for the
fact that the loop is operating on multiple elements at a time, the loop counter
must be modified to count by fours instead of by single elements.
The type-cast tells the compiler to treat the array as an array of type
double.
This causes LDDW and STDW instructions to be issued for the array ac-
cesses. The _lo() and _hi() intrinsics break apart a 64-bit
double into its lower
and upper 32-bit halves. Each of these halves contain two 16-bit values
packed in a 32-bit word. To store the results, the _itod() intrinsics assemble
32-bit words back into 64-bit doubles to be stored. Figure 8–11 and
Figure 8–12 show these processes graphically.
The adds themselves have not been addressed, so for now, the add is re-
placed with a comment.