Software Pipelining
6-29
Optimizing Assembly Code via Linear Assembly
6.5
Software Pipelining
This section describes the process for improving the performance of the as-
sembly code in the previous section through
software pipelining.
Software pipelining is a technique used to schedule instructions from a loop
so that multiple iterations execute in parallel. The parallel resources on the
’C6x make it possible to initiate a new loop iteration before previous iterations
finish. The goal of software pipelining is to start a new loop iteration as soon
as possible.
The modulo iteration interval scheduling table is introduced in this section as
an aid to creating software-pipelined loops.
The fixed-point dot product code in Example 6–19 needs eight cycles for each
iteration of the loop: five cycles for the LDWs, two cycles for the MPYs, and one
cycle for the ADDs.
Figure 6–9 shows the dependency graph for the fixed-point dot product
instructions. Example 6–21 shows the same dot product assembly code in
Example 6–17 on page 6-24, except that the SUB instruction is now condition-
al on the loop counter (A1).
Note:
Making the SUB instruction conditional on A1 ensures that A1 stops decre-
menting when it reaches 0. Otherwise, as the loop executes five more times,
the loop counter becomes a negative number. When A1 is negative, it is non-
zero and, therefore, causes the condition on the branch to be true again. If the
SUB instruction were not conditional on A1, you would have an infinite loop.
The floating-point dot product code in Example 6–20 needs ten cycles for each
iteration of the loop: five cycles for the LDDWs, four cycles for the MPYSPs,
and one cycle for the ADDSPs.
Figure 6–10 shows the dependency graph for the floating-point dot product
instructions. Example 6–22 shows the same dot product assembly code in
Example 6–18 on page 6-25, except that the SUB instruction is now condition-
al on the loop counter (A1).
Note:
The ADDSP has 3 delay slots associated with it. The extra delay slots are
taken up by the LDDW, SUB, and NOP when executing the next cycle of the
loop. Thus an NOP 3 is not required inside the loop but is required outside
the loop prior to adding sum0 and sum1 together.