Writing Parallel Code
6-13
Optimizing Assembly Code via Linear Assembly
The dependency graph for this dot product algorithm has two separate parts
because the decrement of the loop counter and the branch do not read or write
any variables from the other part.
-
The SUB instruction writes to the loop counter, cntr. The output of the SUB
instruction feeds back and creates a loop carry path.
-
The branch (B) instruction is a child of the loop counter.
6.3.4.2
Floating-Point Dot Product
Similarly, Figure 6–2 shows the dependency graph for the floating-point dot
product assembly instructions shown in Example 6–8 and their corresponding
register allocations.
Figure 6–2. Dependency Graph of Floating-Point Dot Product
ai
bi
4
4
5
5
SUB
ADDSP
MPYSP
LDW
LDW
pi
sum
1
B
cntr
LOOP
1
(A2)
(A5)
(A6)
(A7)
(A1)
.M1
.L1
.D1
.D1
.S1
.S1
Number of cycles
required to complete
an instruction
Variable
being
written
Instruction
mnemonic
Functional
unit
Register
allocation
-
The two LDW instructions, which write the values of ai and bi, are parents
of the MPYSP instruction. It takes five cycles for the parent (LDW) instruc-
tion to complete. Therefore, if LDW is scheduled on cycle i, then its child
(MPYSP) cannot be scheduled until cycle i + 5.
-
The MPYSP instruction, which writes the product pi, is the parent of the
ADDSP instruction. The MPYSP instruction takes four cycles to complete.
-
The ADDSP instruction adds pi (the result of the MPYSP) to sum. The
output of the ADDSP instruction feeds back to become an input on the next
iteration and, thus, creates a
loop carry path. (See section 6.7 on page
6-77 for more information on loop carry paths.)