Writing Parallel Code
6-14
The dependency graph for this dot product algorithm has two separate parts
because the decrement of the loop counter and the branch do not read or write
any variables from the other part.
-
The SUB instruction writes to the loop counter, cntr. The output of the SUB
instruction feeds back and creates a loop carry path.
-
The branch (B) instruction is a child of the loop counter.
6.3.5
Nonparallel Versus Parallel Assembly Code
Nonparallel assembly code is performed serially, that is, one instruction follow-
ing another in sequence. This section explains how to rewrite the instructions
so that they execute in parallel.
6.3.5.1
Fixed-Point Dot Product
Example 6–9 shows the nonparallel assembly code for the fixed-point dot
product loop. The MVK instruction initializes the loop counter to 100. The
ZERO instruction clears the accumulator. The NOP instructions allow for the
delay slots of the LDH, MPY, and B instructions.
Executing this dot product code serially requires 16 cycles for each iteration
plus two cycles to set up the loop counter and initialize the accumulator; 100 it-
erations require 1602 cycles.
Example 6–9. Nonparallel Assembly Code for Fixed-Point Dot Product
MVK
.S1
100, A1
; set up loop counter
ZERO
.L1
A7
; zero out accumulator
LOOP:
LDH
.D1
*A4++,A2
; load ai from memory
LDH
.D1
*A3++,A5
; load bi from memory
NOP
4
; delay slots for LDH
MPY
.M1
A2,A5,A6
; ai * bi
NOP
; delay slot for MPY
ADD
.L1
A6,A7,A7
; sum += (ai * bi)
SUB
.S1
A1,1,A1
; decrement loop counter
[A1] B
.S2
LOOP
; branch to loop
NOP
5
; delay slots for branch
; Branch occurs here
Assigning the same functional unit to both LDH instructions slows perfor-
mance of this loop. Therefore, reassign the functional units to execute the
code in parallel, as shown in the dependency graph in Figure 6–3. The parallel
assembly code is shown in Example 6–10.