Linear Assembly Considerations
8-54
In Example 8–27 below, the assembly output generated by the assembly opti-
mizer for the weighted vector sum loop kernel compiled with the –mv6400 –o3
–mt –mi –k –mg options:
Example 8–27. Avoiding Cross Path Stalls: Assembly Output Generated for Weighted
Vector Sum Loop Kernel
LOOP: ; PIPED LOOP KERNEL
STH .D1T1 A6,*A8++(4) ; store 16–bits (cn)
|| ADD .L2X B9,A16,B9 ; add bn + copy of sprod0
|| MV .L1 A3,A16 ; copy sprod0 to another register
|| SHR .S1 A5,0x10,A3 ; shift bn & bn+1 by 16 to isolate bn+1
|| [ B0] BDEC .S2 LOOP,B0 ;branch to loop & decrement loop count
|| MPY .M1X B17,A7,A4 ; multiply an by a constant ; prod0
|| MPYHL .M2 B17,B4,B16 ; multiply an+1 by a constant ; prod1
|| LDW .D2T2 *B6++,B17 ; load 32–bits (an & an+1)
STH .D2T2 B9,*B7++(4) ; store 16–bits (cn+1)
|| ADD .L1X A3,B8,A6 ; add bn+1 + sprod1
|| AND .L2X A5,B5,B9 ; AND bn & bn+1 with mask to isolate bn
|| SHR .S2 B16,0xf,B8 ; shift prod1 right by 15 –> sprod1
|| SHR .S1 A4,0xf,A3 ; shift prod0 right by 15 –> sprod0
|| LDW .D1T1 *A9++,A5 ; load 32–bits (bn & bn+1)
In Example 8–27, the assembly optimizer has created a two–cycle loop with-
out a cross path stall. The loop count decrement instruction and the conditional
branch to loop based on the value of loop count instruction have been replaced
with a single BDEC instruction. In the instruction slot created by combining
these two instructions into one, a MV instruction has been placed. The MV in-
struction copies the value in the source register to the destination register. The
value in A3 (sprod0) is placed into A16. A16 is then used as a cross path oper-
and to the .L2 functional unit. A16 is updated every two cycles. For example,
A16 is updated in cycles 2, 4, 6, 8 etc. The value of A16 from the previous loop
iteration is used as the cross path operand to the .L2 unit in cycles 2, 4, 6, 8
etc. This rescheduling prevents the cross path stall. Again, There are two–
cycle loop with two 16–bit results produced per loop iteration. Further opti-
mization of this algorithm can be achieved by unrolling the loop one more time.