Software Pipelining the Outer Loop
6-132
6.13.2 Making the Outer Loop Parallel With the Inner Loop Epilog and Prolog
The final assembly code for the FIR filter with redundant load elimination and
no memory hits (shown in Example 6–69 on page 6-129) contained 16 cycles
of overhead to call the inner loop every time: ten cycles for the loop prolog and
six cycles for the outer loop instructions and branching to the outer loop.
Most of this overhead can be reduced as follows:
-
Put the outer loop and branch instructions in parallel with the prolog.
-
Create an epilog to the inner loop.
-
Put some outer loop instructions in parallel with the inner-loop epilog.
6.13.3 Final Assembly
Example 6–71 shows the final assembly for the FIR filter with a software-pipe-
lined outer loop. Below the inner loop (starting on page 6-134), each instruc-
tion is marked in the comments with an e, p, or o for instructions relating to epi-
log, prolog, or outer loop, respectively.
The inner loop is now only run seven times, because the eighth iteration is
done in the epilog in parallel with the prolog of the next inner loop and the outer
loop instructions.