Software Pipelining
6-40
Example 6–25. Linear Assembly for Full Floating-Point Dot Product
.global _dotp
_dotp: .cproc a, b
.reg
sum, sum0, sum1, a, b
.reg
ai:ai1, bi:bi1, pi, pi1
MVK
50,cntr
; cntr = 100/2
ZERO
sum0
; multiply result = 0
ZERO
sum1
; multiply result = 0
LOOP:
.trip 50
LDDW
*a++,ai:ai1
; load ai & ai+1 from memory
LDDW
*b++,bi:bi1
; load bi & bi+1 from memory
MPYSP
a0,b0,pi
; ai * bi
MPYSP
a1,b1,pi1
; ai+1 * bi+1
ADDSP
pi,sum0,sum0
; sum0 += (ai * bi)
ADDSP
pi1,sum1,sum1
; sum1 += (ai+1 * bi+1)
[cntr]
SUB
cntr,1,cntr
; decrement loop counter
[cntr]
B
LOOP
; branch to loop
ADDSP
sum,sum1,sum0
; compute final result
.return sum
.endproc
6.5.3
Final Assembly
Example 6–26 shows the assembly code for the fixed-point software-pipe-
lined dot product in Table 6–7 on page 6-35. Example 6–27 shows the assem-
bly code for the floating-point software-pipelined dot product in Table 6–8 on
page 6-36. The accumulators are initialized to 0 and the loop counter is set up
in the first execute packet in parallel with the first load instructions. The aster-
isks in the comments correspond with those in Table 6–7 and Table 6–8, re-
spectively.
Note:
All instructions executing in parallel constitute an execute packet. An exe-
cute packet can contain up to eight instructions.
See the
TMS320C6000 CPU and Instruction Set Reference Guide for more
information about pipeline operation.