Linear Assembly Considerations
8-52
Example 8–25. Avoiding Cross Path Stalls: Partitioned Linear Assembly
.global _w_vec
_w_vec: .cproc a, b, c, m
.reg ai_i1, bi_i1, pi, pi1, pi_i1, pi_s, pi1_s
.reg mask, bi, bi1, ci, ci1, c1, cntr
MVK –1, mask
MVKH 0, mask ; generate a mask = 0x0000FFFF
MVK 50, cntr ; load loop count with 50
ADD 2, c, c1 ; c1 is offset by 2(16–bit values)from c
LOOP: .trip 50 ; this loop will run a minimum of 50 times
LDW .D2 *a++,ai_i1 ;load 32–bits (an & an+1)
LDW .D1 *b++,bi_i1 ;load 32–bits (bn & bn+1)
MPY .M1 ai_i1, m, pi ;multiply an by a constant ; prod0
MPYHL .M2 ai_i1, m, pi1 ;multiply an+1 by a constant; prod1
SHR .S1 pi, 15, pi_s ;shift prod0 right by 15 –> sprod0
SHR .S2 pi1,15, pi1_s ;shift prod1 right by 15 –> sprod1
AND .L2X bi_i1, mask, bi ;AND bn & bn+1 w/ mask to isolate bn
SHR .S1 bi_i1, 16, bi1 ;shift bn & bn+1 by 16 to isolate bn+1
ADD .L2X pi_s, bi, ci ;add bn
ADD .L1X pi1_s, bi1, ci1 ;add bn+1
STH .D2 ci, *c++[2]
;store 16–bits (cn)
STH .D1 ci1, *c1++[2]
;store 16–bits (cn+1)
[cntr]SUB cntr, 1, cntr ;decrement loop count
[cntr]B LOOP ;branch to loop if loop count > 0
.endproc
In the implementation above, 16–bit values two at a time with the LDW instruc-
tion into a single 32–bit register. Each 16–bit value is multiplied in register ai_i1
by the short (16–bit) constant m. Each 32–bit product is shifted to the right by
15 bits. The second input array is also brought in two 16–bit values at a time
into a single 32–bit register, bi_i1. bi_i1 is ANDed with a mask that zeros the
upper 16–bits of the register to create bi (a single 16–bit value). bi_i1 is also
shifted to the right by 16 bits so that the upper 16–bit input value can be added
to the corresponding weighted input value.