Writing Parallel Code
6-15
Optimizing Assembly Code via Linear Assembly
Figure 6–3. Dependency Graph of Fixed-Point Dot Product with Parallel Assembly
ai
bi
1
2
5
5
SUB
ADD
MPY
LDH
LDH
pi
sum
1
B
i
LOOP
1
.L1
.D1
.S1
.S1
.M1X
.D2
Example 6–10. Parallel Assembly Code for Fixed-Point Dot Product
MVK
.S1
100, A1
; set up loop counter
||
ZERO
.L1
A7
; zero out accumulator
LOOP:
LDH
.D1
*A4++,A2
; load ai from memory
||
LDH
.D2
*B4++,B2
; load bi from memory
SUB
.S1
A1,1,A1
; decrement loop counter
[A1] B
.S2
LOOP
; branch to loop
NOP
2
; delay slots for LDH
MPY
.M1X
A2,B2,A6
; ai * bi
NOP
; delay slots for MPY
ADD
.L1
A6,A7,A7
; sum += (ai * bi)
; Branch occurs here
Because the loads of ai and bi
do not depend on one another, both LDH
instructions can execute in parallel as long as they do not share the same
resources. To schedule the load instructions in parallel, allocate the functional
units as follows:
-
ai
and the pointer to ai to a functional unit on the A side, .D1
-
bi and the pointer to bi to a functional unit on the B side, .D2
Because the MPY instruction now has one source operand from A and one
from B, MPY uses the 1X cross path.