Software Pipelining the Outer Loop
6-135
Optimizing Assembly Code via Linear Assembly
Example 6–71. Final Assembly Code for FIR Filter With Redundant Load Elimination and
No Memory Hits With Outer Loop Software-Pipelined (Continued)
ADD
.D2
B7,B9,B9
;e sum1 += x2 * h1
||
ADD
.L1
A5,A9,A9
;e sum0 += x2 * h2
||
LDH
.D1
*A4++,B8
;p x0 = x[j]
||
ADD
.L2X
A4,4,B1
;o set up pointer to x[j+2]
||
ADD
.S1X
B4,2,A8
;o set up pointer to h[1]
||
MVK
.S2
8,B2
;o set up inner loop counter
ADD
.L2X
A7,B9,B9
;e sum1 += x3 * h2
||
ADD
.L1X
B8,A9,A9
;e sum0 += x3 * h3
||
LDH
.D2
*B1++[2],B0
;p x2 = x[j+i+2]
||
LDH
.D1
*A4++[2],A0
;p x1 = x[j+i+1]
||[A2]
SUB
.S1
A2,1,A2
;o decrement outer loop counter
ADD
.L2
B7,B9,B9
;e sum1 += x0 * h3
||
SHR
.S1
A9,15,A9
;e sum0 >> 15
||
LDH
.D1
*A8++[2],B6
;p h1 = h[i+1]
||
LDH
.D2
*B4++[2],A1
;p h0 = h[i]
SHR
.S2
B9,15,B9
;e sum1 >> 15
||
LDH
.D1
*A4++[2],A5
;p x3 = x[j+i+3]
||
LDH
.D2
*B1++[2],B5
;p x0 = x[j+i+4]
STH
.D1
A9,*A6++[2]
;e y[j] = sum0 >> 15
||
STH
.D2
B9,*B11++[2]
;e y[j+1] = sum1 >> 15
||
ZERO
.S1
A9
;o zero out sum0
||
ZERO
.S2
B9
;o zero out sum1
; outer loop branch occurs here
6.13.4 Comparing Performance
The improved cycle count for this loop is 2006 cycles: 50 ((7
4) + 6 + 6) + 6. The
outer-loop overhead for this loop has been reduced from 16 to 8 (6 + 6 – 4);
the – 4 represents one iteration less for the inner-loop iteration (seven instead
of eight).
Table 6–26. Comparison of FIR Filter Code
Code Example
Cycles
Cycle Count
Example 6–64 FIR with redundant load elimination
50 (16
2 + 9 + 6) + 2
2352
Example 6–69 FIR with redundant load elimination and no memory
hits
50 (8
4 + 10 + 6) + 2
2402
Example 6–71 FIR with redundant load elimination and no memory
hits with outer loop software-pipelined
50 (7
4 + 6 + 6) + 6
2006