Writing Parallel Code
6-16
Rearranging the order of the instructions also improves the performance of the
code. The SUB instruction can take the place of one of the NOP delay slots
for the LDH instructions. Moving the B instruction after the SUB removes the
need for the NOP 5 used at the end of the code in Example 6–9.
The branch now occurs immediately after the ADD instruction so that the MPY
and ADD execute in parallel with the five delay slots required by the branch
instruction.
6.3.5.2
Floating-Point Dot Product
Similarly, Example 6–11 shows the nonparallel assembly code for the floating-
point dot product loop. The MVK instruction initializes the loop counter to 100.
The ZERO instruction clears the accumulator. The NOP instructions allow for
the delay slots of the LDW, ADDSP, MPYSP, and B instructions.
Executing this dot product code serially requires 21 cycles for each iteration
plus two cycles to set up the loop counter and initialize the accumulator; 100 it-
erations require 2102 cycles.
Example 6–11. Nonparallel Assembly Code for Floating-Point Dot Product
MVK
.S1
100, A1
; set up loop counter
ZERO
.L1
A7
; zero out accumulator
LOOP:
LDW
.D1
*A4++,A2
; load ai from memory
LDW
.D1
*A3++,A5
; load bi from memory
NOP
4
; delay slots for LDW
MPYSP
.M1
A2,A5,A6
; ai * bi
NOP
3
; delay slots for MPYSP
ADDSP
.L1
A6,A7,A7
; sum += (ai * bi)
NOP
3
; delay slots for ADDSP
SUB
.S1
A1,1,A1
; decrement loop counter
[A1] B
.S2
LOOP
; branch to loop
NOP
5
; delay slots for branch
; Branch occurs here
Assigning the same functional unit to both LDW instructions slows perfor-
mance of this loop. Therefore, reassign the functional units to execute the
code in parallel, as shown in the dependency graph in Figure 6–4. The parallel
assembly code is shown in Example 6–12.