Redundant Load Elimination
6-110
6.11 Redundant Load Elimination
Filter algorithms typically read the same value from memory multiple times and
are, therefore, prime candidates for optimization by eliminating redundant load
instructions. Rather than perform a load operation each time a particular value
is read, you can keep the value in a register and read the register multiple
times.
6.11.1 FIR Filter C Code
Example 6–60 shows C code for a simple FIR filter. There are two memory
reads (x[i+j] and h[i]) for each multiply. Because the ’C6000 can perform only
two LDHs per cycle, it seems, at first glance, that only one multiply-accumulate
per cycle is possible.
Example 6–60. FIR Filter C Code
void fir(short x[], short h[], short y[])
{
int i, j, sum;
for (j = 0; j < 100; j++) {
sum = 0;
for (i = 0; i < 32; i++)
sum += x[i+j] * h[i];
y[j] = sum >> 15;
}
}
One way to optimize this situation is to perform LDWs instead of LDHs to read
two data values at a time. Although using LDW works for the h array, the x array
presents a different problem because the ’C6x does not allow you to load
values across a word boundary.
For example, on the first outer loop (j = 0), you can read the x-array elements
(0 and 1, 2 and 3, etc.) as long as elements 0 and 1 are aligned on a 4-byte
word boundary. However, the second outer loop (j = 1) requires reading x-array
elements 1 through 32. The LDW operation must load elements that are not
word-aligned (1 and 2, 3 and 4, etc.).
6.11.1.1 Redundant Loads
In order to achieve two multiply-accumulates per cycle, you must reduce the
number of LDHs. Because successive outer loops read all the same h-array
values and almost all of the same x-array values, you can eliminate the redun-
dant loads by unrolling the inner and outer loops.
For example, x[1] is needed for the first outer loop (x[j+1] with j = 0) and for the
second outer loop (x[j] with j = 1). You can use a single LDH instruction to load
this value.