Writing Parallel Code
6-18
Rearranging the order of the instructions also improves the performance of the
code. The SUB instruction replaces one of the NOP delay slots for the LDW
instructions. Moving the B instruction after the SUB removes the need for the
NOP 5 used at the end of the code in Example 6–11 on page 6-16.
The branch now occurs immediately after the ADDSP instruction so that the
MPYSP and ADDSP execute in parallel with the five delay slots required by
the branch instruction.
Since the ADDSP finishes execution before the result is needed, the NOP 3
for delay slots is removed, further reducing cycle count.
6.3.6
Comparing Performance
Executing the fixed-point dot product code in Example 6–10 requires eight
cycles for each iteration plus one cycle to set up the loop counter and initialize
the accumulator; 100 iterations require 801 cycles.
Table 6–1 compares the performance of the nonparallel code with the parallel
code for the fixed-point example.
Table 6–1. Comparison of Nonparallel and Parallel Assembly Code for Fixed-Point
Dot Product
Code Example
100 Iterations
Cycle Count
Example 6–9
Fixed-point dot product nonparallel assembly
2 + 100
16
1602
Example 6–10
Fixed-point dot product parallel assembly
1 + 100
8
801
Executing the floating-point dot product code in Example 6–12 requires ten
cycles for each iteration plus one cycle to set up the loop counter and initialize
the accumulator; 100 iterations require 1001 cycles.
Table 6–2 compares the performance of the nonparallel code with the parallel
code for the floating-point example.
Table 6–2. Comparison of Nonparallel and Parallel Assembly Code for Floating-Point
Dot Product
Code Example
100 Iterations
Cycle Count
Example 6–11
Floating-point dot product nonparallel assembly
2 + 100
21
2102
Example 6–12
Floating-point dot product parallel assembly
1 + 100
10
1001