Software Pipelining
6-51
Optimizing Assembly Code via Linear Assembly
6.5.3.4
Priming the Loop
Although Example 6–28 and Example 6–29 execute as fast as possible, the
code size can be smaller without significantly sacrificing performance. To help
reduce code size, you can use a technique called
priming the loop. Assuming
that you can handle extraneous loads, start with Example 6–26 or
Example 6–27, which do not have epilogs and, therefore, contain fewer
instructions. (This technique can be used equally well with Example 6–28 or
Example 6–29.)
Fixed-Point Example
To eliminate the prolog of the fixed-point dot product and, therefore, the extra
LDW and MPY instructions, begin execution at the loop body (at the LOOP
label). Eliminating the prolog means that:
-
Two LDWs, two MPYs, and two ADDs occur in the first execution cycle of
the loop.
-
Because the first LDWs require five cycles to write results into a register,
the MPYs do not multiply valid data until after the loop executes five times.
The ADDs have no valid data until after seven cycles (five cycles for the
first LDWs and two more cycles for the first valid MPYs).
Example 6–30 shows the loop without the prolog but with four new instructions
that zero the inputs to the MPY and ADD instructions. Making the MPYs and
ADDs use 0s before valid data is available ensures that the final accumulator
values are unaffected. (The loop counter is initialized to 57 to accommodate
the seven extra cycles needed to prime the loop.)
Because the first LDWs are not issued until after seven cycles, the code in
Example 6–30 requires a total of 65 cycles (7 + 57+ 1). Therefore, you are re-
ducing the code size with a slight loss in performance.