Software Pipelining
6-46
ADDSPs all execute exactly 50 times. (The shaded areas of Example 6–29 in-
dicate the changes in this code.)
Executing the dot product code in Example 6–29 with no extraneous LDDWs
still requires a total of 74 cycles (9 + 41 + 9 + 15), but the code size is now larg-
er.
Example 6–28. Assembly Code for Fixed-Point Dot Product (Software Pipelined
With No Extraneous Loads)
LDW
.D1
*A4++,A2
; load ai & ai+1 from memory
||
LDW
.D2
*B4++,B2
; load bi & bi+1 from memory
||
MVK
.S1
43,A1
; set up loop counter
||
ZERO
.L1
A7
; zero out sum0 accumulator
||
ZERO
.L2
B7
; zero out sum1 accumulator
[A1] SUB
.S1
A1,1,A1
; decrement loop counter
||
LDW
.D1
*A4++,A2
;* load ai & ai+1 from memory
||
LDW
.D2
*B4++,B2
;* load bi & bi+1 from memory
[A1] SUB
.S1
A1,1,A1
;* decrement loop counter
||[A1] B
.S2
LOOP
; branch to loop
||
LDW
.D1
*A4++,A2
;** load ai & ai+1 from memory
||
LDW
.D2
*B4++,B2
;** load bi & bi+1 from memory
[A1] SUB
.S1
A1,1,A1
;** decrement loop counter
||[A1] B
.S2
LOOP
;* branch to loop
||
LDW
.D1
*A4++,A2
;*** load ai & ai+1 from memory
||
LDW
.D2
*B4++,B2
;*** load bi & bi+1 from memory
[A1]
SUB
.S1
A1,1,A1
;*** decrement loop counter
||[A1] B
.S2
LOOP
;** branch to loop
||
LDW
.D1
*A4++,A2
;**** load ai & ai+1 from memory
||
LDW
.D2
*B4++,B2
;**** load bi & bi+1 from memory
MPY
.M1X
A2,B2,A6
; ai * bi
||
MPYH
.M2X
A2,B2,B6
; ai+1 * bi+1
||[A1] SUB
.S1
A1,1,A1
;**** decrement loop counter
||[A1] B
.S2
LOOP
;*** branch to loop
||
LDW
.D1
*A4++,A2
;***** ld ai & ai+1 from memory
||
LDW
.D2
*B4++,B2
;***** ld bi & bi+1 from memory
MPY
.M1X
A2,B2,A6
;* ai * bi
||
MPYH
.M2X
A2,B2,B6
;* ai+1 * bi+1
||[A1] SUB
.S1
A1,1,A1
;***** decrement loop counter
||[A1] B
.S2
LOOP
;**** branch to loop
||
LDW
.D1
*A4++,A2
;****** ld ai & ai+1 from memory
||
LDW
.D2
*B4++,B2
;****** ld bi & bi+1 from memory