Software Pipelining
6-53
Optimizing Assembly Code via Linear Assembly
Floating-Point Example
To eliminate the prolog of the floating-point dot product and, therefore, the
extra LDDW and MPYSP instructions, begin execution at the loop body (at the
LOOP label). Eliminating the prolog means that:
-
Two LDDWs, two MPYSPs, and two ADDSPs occur in the first execution
cycle of the loop.
-
Because the first LDDWs require five cycles to write results into a register,
the MPYSPs do not multiply valid data until after the loop executes five
times. The ADDSPs have no valid data until after nine cycles (five cycles
for the first LDDWs and four more cycles for the first valid MPYSPs).
Example 6–31 shows the loop without the prolog but with four new instructions
that zero the inputs to the MPYSP and ADDSP instructions. Making the
MPYSPs and ADDSPs use 0s before valid data is available ensures that the
final accumulator values are unaffected. (The loop counter is initialized to 59
to accommodate the nine extra cycles needed to prime the loop.)
Because the first LDDWs are not issued until after nine cycles, the code in
Example 6–31 requires a total of 81 cycles (7 + 59+ 15). Therefore, you are
reducing the code size with a slight loss in performance.
Example 6–31. Assembly Code for Floating-Point Dot Product (Software Pipelined With
Removal of Prolog and Epilog)
MVK
.S1
59,A1
; set up loop counter
ZERO
.L1
A7
; zero out mpysp input
||
ZERO
.L2
B7
; zero out mpysp input
||[A1]
SUB
.S1
A1,1,A1
; decrement loop counter
[A1]
B
.S2
LOOP
; branch to loop
||[A1]
SUB
.S1
A1,1,A1
;* decrement loop counter
||
ZERO
.L1
A8
; zero out sum0 accumulator
||
ZERO
.L2
B8
; zero out sum0 accumulator
[A1]
B
.S2
LOOP
;* branch to loop
||[A1]
SUB
.S1
A1,1,A1
;** decrement loop counter
||
ZERO
.L1
A5
; zero out addsp input
||
ZERO
.L2
B5
; zero out addsp input
[A1]
B
.S2
LOOP
;** branch to loop
||[A1]
SUB
.S1
A1,1,A1
;*** decrement loop counter
||
ZERO
.L1
A6
; zero out mpysp input
||
ZERO
.L2
B6
; zero out mpysp input