Packed-Data Processing on the ’C64x
8-39
’C64x Programming Considerations
8.2.8.2
When to Use Non-Aligned Memory Accesses
As noted earlier, the ’C64x can provide 128 bits/cycle bandwidth with aligned
memory accesses, and 64 bits/cycle bandwidth with non-aligned memory ac-
cesses. Therefore, it is important to use non–aligned memory accesses in
places where they provide a true benefit over aligned memory accesses. Gen-
erally, non-aligned memory accesses are a win in places where they allow a
routine to be vectorized, where aligned memory accesses could not. These
places can be broken down into several cases:
-
Generic routines which cannot impose alignment,
-
Single sample algorithms which update their input or output pointers by
only one sample
-
Nested loop algorithms where outer loop cannot be unrolled, and
-
Routines which have an irregular memory access pattern, or whose ac-
cess pattern is data-dependent and not known until run time.
An example of a generic routine which cannot impose alignment on routines
that call it would be a library function such as
memcpy or strcmp. Single-sam-
ple algorithms include adaptive filters which preclude processing multiple out-
puts at once. Nested loop algorithms include 2-D convolution and motion es-
timation. Data-dependent access algorithms include motion compensation,
which must read image blocks from arbitrary locations in the source image.
In each of these cases, it is extremely difficult to transform the problem into one
which uses aligned memory accesses while still vectorizing the code. Often,
the result with aligned memory accesses is worse than if the code were not
optimized for packed data processing at all. So, for these cases, non-aligned
memory accesses are a win.
In contrast, non-aligned memory accesses should not be used in more general
cases where they are not specifically needed. Rather, the program should be
structured to best take advantage of aligned memory accesseswith a packed
data processing flow. The following checklist should help.
-
Use
signed short or unsigned char data types for arrays where possible.
These are the types for which the ’C64x provides the greatest support.
-
Place arrays on double-word boundaries, using #pragma DATA_ALIGN.
This allows the program to use LDDW and STDW to access the array, pro-
viding up to 128 bit/cycle bandwidth for accesing the array.
-
Round loop counts, numbers of samples, and so on to multiples of 4 or 8
where possible. This allows the inner loop to be unrolled more readily to
take advantage of packed data processing.
-
In nested loop algorithms, unroll outer loops to process multiple output
samples at once. This allows packed data processing techniques to be ap-
plied to elements that are indexed by the outer loop.