Software Pipelining
6-37
Optimizing Assembly Code via Linear Assembly
Note:
Since the ADDSP instruction has three delay slots associated with it, the re-
sults of adding are staggered by four. That is, the first result from the ADDSP
is added to the fifth result, which is then added to the ninth, and so on. The
second result is added to the sixth, which is then added to the 10th. This is
shown in Table 6–9.
In this case, multiple iterations of the loop execute in parallel in a software pipe-
line that is ten iterations deep, with iterations n through n + 9 executing in paral-
lel. Floating-point software pipelines are rarely deeper than the one created
by this single-cycle loop. As loop sizes grow, the number of iterations that can
execute in parallel tends to become fewer.
6.5.1.5
Staggered Accumulation With a Multicycle Instruction
When accumulating results with an instruction that is multicycle (that is, has
delay slots other than 0), you must either unroll the loop or stagger the results.
When unrolling the loop, multiple accumulators collect the results so that one
result has finished executing and has been written into the accumulator before
adding the next result of the accumulator. If you do not unroll the loop, then the
accumulator will contain staggered results.
Staggered results occur when you attempt to accumulate successive results
while in the delay slots of previous execution. This can be achieved without
error if you are aware of what is in the accumulator, what will be added to that
accumulator, and when the results will be written on a given cycle (such as the
pseudo-code shown in Example 6–23).
Example 6–23. Pseudo-Code for Single-Cycle Accumulator With ADDSP
LOOP:
ADDSP
x,sum,sum
||
LDW
*xptr++,x
||[cond] B
cond
||[cond] SUB
cond,1,cond
Table 6–9 shows the results of the loop kernel for a single-cycle accumulator
using a multicycle add instruction; in this case, the ADDSP, which has three
delay slots (a 4-cycle instruction).