Writing Parallel Code
6-17
Optimizing Assembly Code via Linear Assembly
Figure 6–4. Dependency Graph of Floating-Point Dot Product with Parallel Assembly
ai
bi
4
4
5
5
SUB
ADDSP
MPYSP
LDW
LDW
pi
sum
1
B
i
LOOP
1
.L1
.D1
.S1
.S1
.M1X
.D2
Example 6–12. Parallel Assembly Code for Floating-Point Dot Product
MVK
.S1
100, A1
; set up loop counter
||
ZERO
.L1
A7
; zero out accumulator
LOOP:
LDW
.D1
*A4++,A2
; load ai from memory
||
LDW
.D2
*B4++,B2
; load bi from memory
SUB
.S1
A1,1,A1
; decrement loop counter
NOP
2
; delay slots for LDW
[A1] B
.S2
LOOP
; branch to loop
MPYSP
.M1X
A2,B2,A6
; ai * bi
NOP
3
; delay slots for MPYSP
ADDSP
.L1
A6,A7,A7
; sum += (ai * bi)
; Branch occurs here
Because the loads of ai and bi
do not depend on one another, both LDW
instructions can execute in parallel as long as they do not share the same
resources. To schedule the load instructions in parallel, allocate the functional
units as follows:
-
ai
and the pointer to ai to a functional unit on the A side, .D1
-
bi and the pointer to bi to a functional unit on the B side, .D2
Because the MPYSP instruction now has one source operand from A and one
from B, MPYSP uses the 1X cross path.