Redundant Load Elimination
6-115
Optimizing Assembly Code via Linear Assembly
Example 6–63. Linear Assembly for Full FIR Code (Continued)
LOOP:
.trip 16
LDH
.D2
*x_1++[2],x1
; x1 = x[j+i+1]
LDH
.D1
*h++[2],h0
; h0 = h[i]
MPY
.M1
x0,h0,p00
; x0 * h0
MPY
.M1X
x1,h0,p10
; x1 * h0
ADD
.L1
p00,sum0,sum0
; sum0 += x0 * h0
ADD
.L2X
p10,sum1,sum1
; sum1 += x1 * h0
LDH
.D1
*x++[2],x0
; x0 = x[j+i+2]
LDH
.D2
*h_1++[2],h1
; h1 = h[i+1]
MPY
.M2
x1,h1,p01
; x1 * h1
MPY
.M2X
x0,h1,p11
; x0 * h1
ADD
.L1X
p01,sum0,sum0
; sum0 += x1 * h1
ADD
.L2
p11,sum1,sum1
; sum1 += x0 * h1
[ctr]
SUB
.S2
ctr,1,ctr
; decrement loop counter
[ctr]
B
.S2
LOOP
; branch to loop
SHR
sum0,15,sum0
; sum0 >> 15
SHR
sum1,15,sum1
; sum1 >> 15
STH
sum0,*y++
; y[j] = sum0 >> 15
STH
sum1,*y++
; y[j+1] = sum1 >> 15
SUB
x,rstx,x
; reset x pointer to x[j]
SUB
h_1,rsth,h_1
; reset h pointer to h[0]
[octr]
B
OUTLOOP
; branch to outer loop
.endproc
6.11.6 Final Assembly
Example 6–64 shows the final assembly for the FIR filter without redundant
load instructions. At the end of the inner loop is a branch to OUTLOOP that
executes the next outer loop. The outer loop counter is 50 because iterations
j and j + 1 execute each time the inner loop is run. The inner loop counter is
16 because iterations i and i + 1 execute each inner loop iteration.
The cycle count for this nested loop is 2352 cycles: 50 (16
2 + 9 + 6) + 2.
Fifteen cycles are overhead for each outer loop:
-
Nine cycles execute the inner loop prolog.
-
Six cycles execute the branch to the outer loop.
See section 6.13,
Software Pipelining the Outer Loop, on page 6-131 for in-
formation on how to reduce this overhead.