Software Pipelining
6-43
Optimizing Assembly Code via Linear Assembly
6.5.3.2
Floating-Point Example
The first branch in the floating-point dot product is issued on cycle 4 but does
not actually branch until the end of cycle 9 (after five delay slots). The branch
target is the execute packet defined by the label LOOP. On cycle 9, the first
branch returns to the same execute packet, resulting in a single-cycle loop. On
every cycle after cycle 9, a branch executes back to LOOP until the loop count-
er finally decrements to 0. Once the loop counter is 0, five more branches
execute because they are already in the pipe.
Executing the floating-point dot product code with the software pipelining as
shown in Example 6–27 requires a total of 74 cycles (9 + 50 + 15), which is a
significant improvement over the 508 cycles required by the code in
Example 6–20.
Example 6–27. Assembly Code for Floating-Point Dot Product (Software Pipelined)
MVK
.S1
50,A1
; set up loop counter
||
ZERO
.L1
A8
; sum0 = 0
||
ZERO
.L2
B8
; sum1 = 0
||
LDDW
.D1
A4++,A7:A6
; load ai & ai + 1 from memory
||
LDDW
.D2
B4++,B7:B6
; load bi & bi + 1 from memory
LDDW
.D1
A4++,A7:A6
;* load ai & ai + 1 from memory
||
LDDW
.D2
B4++,B7:B6
;* load bi & bi + 1 from memory
LDDW
.D1
A4++,A7:A6
;** load ai & ai + 1 from memory
||
LDDW
.D2
B4++,B7:B6
;** load bi & bi + 1 from memory
LDDW
.D1
A4++,A7:A6
;*** load ai & ai + 1 from memory
||
LDDW
.D2
B4++,B7:B6
;*** load bi & bi + 1 from memory
||[A1] SUB
.S1
A1,1,A1
; decrement loop counter
LDDW
.D1
A4++,A7:A6
;**** load ai & ai + 1 from memory
||
LDDW
.D2
B4++,B7:B6
;**** load bi & bi + 1 from memory
||[A1] B
.S2
LOOP
; branch to loop
||[A1] SUB
.S1
A1,1,A1
;* decrement loop counter
LDDW
.D1
A4++,A7:A6
;***** load ai & ai + 1 from memory
||
LDDW
.D2
B4++,B7:B6
;***** load bi & bi + 1 from memory
||
MPYSP
.M1X
A6,B6,A5
; pi = a0 b0
||
MPYSP
.M2X
A7,B7,B5
; pi1 = a1 b1
||[A1] B
.S2
LOOP
;* branch to loop
||[A1] SUB
.S1
A1,1,A1
;** decrement loop counter
LDDW
.D1
A4++,A7:A6
;****** load ai & ai + 1 from memory
||
LDDW
.D2
B4++,B7:B6
;****** load bi & bi + 1 from memory
||
MPYSP
.M1X
A6,B6,A5
;* pi = a0 b0
||
MPYSP
.M2X
A7,B7,B5
;* pi1 = a1 b1
||[A1] B
.S2
LOOP
;** branch to loop
||[A1] SUB
.S1
A1,1,A1
;*** decrement loop counter