Software Pipelining
6-55
Optimizing Assembly Code via Linear Assembly
6.5.3.5
Removing Extra SUB Instructions
To reduce code size further, you can remove extra SUB instructions. If you
know that the loop count is at least 6, you can eliminate the extra SUB instruc-
tions as shown in Example 6–32 and Example 6–33. The first five branch
instructions are made unconditional, because they always execute. (If you do
not know that the loop count is at least 6, you must keep the SUB instructions
that decrement before each conditional branch as in Example 6–30 and
Example 6–31.) Based on the elimination of six SUB instructions, the loop
counter is now 51 (57 – 6) for the fixed-point dot product and 53 (59 – 6) for
the floating-point dot product. This code shows some improvement over
Example 6–30 and Example 6–31. The loop in Example 6–32 requires 63
cycles (5 + 57 + 1) and the loop in Example 6–31 requires 79 cycles
(5 + 59 + 15).
Example 6–32. Assembly Code for Fixed-Point Dot Product (Software Pipelined
With Smallest Code Size)
B
.S2
LOOP
; branch to loop
||
MVK
.S1
51,A1
; set up loop counter
B
.S2
LOOP
;* branch to loop
B
.S2
LOOP
;** branch to loop
||
ZERO
.L1
A7
; zero out sum0 accumulator
||
ZERO
.L2
B7
; zero out sum1 accumulator
B
.S2
LOOP
;*** branch to loop
||
ZERO
.L1
A6
; zero out add input
||
ZERO
.L2
B6
; zero out add input
B
.S2
LOOP
;**** branch to loop
||
ZERO
.L1
A2
; zero out mpy input
||
ZERO
.L2
B2
; zero out mpy input
LOOP:
ADD
.L1
A6,A7,A7
; sum0 += (ai * bi)
||
ADD
.L2
B6,B7,B7
; sum1 += (ai+1 * bi+1)
||
MPY
.M1X
A2,B2,A6
;** ai * bi
||
MPYH
.M2X
A2,B2,B6
;** ai+1 * bi+1
||[A1] SUB
.S1
A1,1,A1
;****** decrement loop counter
||[A1] B
.S2
LOOP
;***** branch to loop
||
LDW
.D1
*A4++,A2
;******* ld ai & ai+1 fm memory
||
LDW
.D2
*B4++,B2
;******* ld bi & bi+1 fm memory
; Branch occurs here
ADD
.L1X
A7,B7,A4
; sum = sum0 + sum1