Modulo Scheduling of Multicycle Loops
6-60
6.6.3.2
Translating Unrolled Inner Loop to Linear Assembly
Example 6–37 shows the linear assembly that calculates c[i] and c[i+1] for the
weighted vector sum in Example 6–36.
-
The two store pointers (*ciptr and *ci+1ptr) are separated so that one
(*ciptr)
increments by 2 through the odd elements of the array and the
other (*ci+1ptr) increments through the even elements.
-
AND and SHR separate bi and bi+1 into two separate registers.
-
This code assumes that mask is preloaded with 0x0000FFFF to clear the
upper 16 bits. The shift right of 16 places bi+1 into the 16 LSBs.
Example 6–37. Linear Assembly for Weighted Vector Sum Using LDW
LDW
*aptr++,ai_i+1
; ai & ai+1
LDW
*bptr++,bi_i+1
; bi & bi+1
MPY
m,ai_i+1,pi
; m * ai
MPYHL
m,ai_i+1,pi+1
; m * ai+1
SHR
pi,15,pi_scaled
; (m * ai) >> 15
SHR
pi+1,15,pi+1_scaled
; (m * ai+1) >> 15
AND
bi_i+1,mask,bi
; bi
SHR
bi_i+1,16,bi+1
; bi+1
ADD
pi_scaled,bi,ci
; ci = (m * ai) >> 15 + bi
ADD
pi+1_scaled,bi+1,ci+1
; ci+1 = (m * ai+1) >> 15 + bi+1
STH
ci,*ciptr++[2]
; store ci
STH
ci+1,*ci+1ptr++[2]
; store ci+1
[cntr]SUB
cntr,1,cntr
; decrement loop counter
[cntr]B
LOOP
; branch to loop
6.6.3.3
Determining a New Minimum Iteration Interval
Use the following considerations to determine the minimum iteration interval
for the assembly instructions in Example 6–37:
-
Four memory operations (two LDWs and two STHs) must each use a .D
unit. With two .D units available, this loop still requires only two cycles.
-
Four instructions must use the .S units (three SHRs and one branch). With
two .S units available, the minimum iteration interval is still 2.
-
The two MPYs do not increase the minimum iteration interval.
-
Because the remaining four instructions (two ADDs, AND, and SUB) can
all use a .L unit, the minimum iteration interval for this loop is the same as
in Example 6–35.
By using LDWs instead of LDHs, the program can do twice as much work in
the same number of cycles.