Linear Assembly Considerations
8-53
’C64x Programming Considerations
The code above is sent to the assembly optimizer with the following compiler
options: –o3, –mi, –mt, –k, and –mg. Since a specific C6000 platform was not
specified , the default is to generate code for the ’C62x. The –o3 option enables
the highest level of the optimizer. The –mi option creates code with an interrupt
threshold equal to infinity. In other words, interrupts will never occur when this
code runs. The –k option keeps the assembly language file and –mt indicates
that the programmer is assuming no aliasing. Aliasing allows multiple pointers
to point to the same object). The –mg option allows profiling to occur in the de-
bugger for benchmarking purposes.
Example 8–26 below, is the assembly output generated by the assembly opti-
mizer for the weighted vector sum loop kernel:
Example 8–26. Avoiding Cross Path Stalls: Vector Sum Loop Kernel
LOOP: ; PIPED LOOP KERNEL
AND .L2X A3,B6,B8 ;AND bn & bn+1 with mask to isolate bn
|| SHR .S1 A0,0xf,A0 ; shift prod0 right by 15 –> sprod0
|| MPY .M1X B2,A5,A0 ; multiply an by constant ; prod0
|| [ A1] B .S2 LOOP ; branch to loop if loop count >0
|| [ A1] ADD .L1 0xffffffff,A1,A1 ; decrement loop count
|| LDW .D1T1 *A7++,A3 ; load 32–bits (bn & bn+1)
|| LDW .D2T2 *B5++,B2 ; load 32–bits (an & an+1)
[ A2] MPYSU .M1 2,A2,A2 ;
|| [!A2] STH .D2T2 B1,*B4++(4) ; store 16–bits (cn+1)
|| [!A2] STH .D1T1 A6,*A8++(4) ; store 16–bits (cn)
|| ADD .L1X A4,B0,A6 ; add bn+1
|| ADD .L2X B8,A0,B1 ; add bn
|| SHR .S2 B9,0xf,B0 ; shift prod1 right by 15 –> sprod1
|| SHR .S1 A3,0x10,A4 ; shift bn & bn+1 by 16 to isolate bn+1
|| MPYHL .M2 B2,B7,B9 ; multiply an+1 by a constant ; prod1
This two–cycle loop produces two 16–bit results per loop iteration as planned.
If the code is used on the ’C64x, be aware that in the first execute packet that
A0 (prod0) is shifted to the right by 15, causing the result to be written back into
A0. In the next execute packet and therefore the next clock cycle, A0 (sprod0)
is used as a cross path operand to the .L2 functional unit. If this code were run
on the ’C64x, it would exhibit a one cycle clock stall as described above. A0
in cycle 2 is being updated and used as a cross path operand in cycle 3. If the
code performs as planned, the two–cycle loop would now take three cycles to
execute.
The cross path stall can, in most cases, be avoided, if the –mv6400 option is
added to the compiler options list. This option indicates to the compiler/assem-
bly optimizer that the code below will be run on the ’C64x core.