Texas Instruments TMS320C6000 DSP Programmer'S Manual Download Page 318 | Manualshive

Page: 318 / 373

background image

Packed-Data Processing on the ’C64x

8-18

stores typically occur near the very beginning and end of the loop body. The
following examples use this outside-in approach to perform packed data opti-
mization techniques on the example kernels.

Note:

The following examples assume that the compiler has not performed any
packed data optimizations. The most recent release of the ’C6000 Code
Generation Tools will apply many of the optimizations described in this chap-
ter automatically when presented with sufficient information about the code.

8.2.6.1

Vectorizing the Vector Sum

Consider the vector sum kernel presented in Example 8–1. In its default form,
it reads one half–word from the a[ ] array, one half-word from the b[ ] array,
adds them, and writes a single half–word result to the c[ ] array. This results
in a 2-cycle loop that moves 48 bits per iteration. When you consider that the
’C64x can read or write 128 bits every cycle, it becomes clear that this is very
inefficient.

One simple optimization is to replace the half-word accesses with double-word
accesses to read and write array elements in groups of four. When doing this,
array elements are read into the register file with four elements packed into a
register pair. The array elements are packed with, two elements in each regis-
ter, across two registers. Each register contains data in the packed 16-bit data
type illustrated in Figure 8–2.

For the moment, assume that the arrays are double-word aligned, as shown
in Example 8–5. For more information about methods for working with arrays
that are not specially aligned, see section 8.2.8. The first thing to note is that
the ’C6000 Code Generation Tools lack a 64-bit integral type. This is not a
problem, however, as the tools allow the use of

double, and the intrinsics _lo(),

_hi(), _itod() to access integer data with double-word loads. To account for the
fact that the loop is operating on multiple elements at a time, the loop counter
must be modified to count by fours instead of by single elements.

The type-cast tells the compiler to treat the array as an array of type

double.

This causes LDDW and STDW instructions to be issued for the array ac-
cesses. The _lo() and _hi() intrinsics break apart a 64-bit

double into its lower

and upper 32-bit halves. Each of these halves contain two 16-bit values
packed in a 32-bit word. To store the results, the _itod() intrinsics assemble
32-bit words back into 64-bit doubles to be stored. Figure 8–11 and
Figure 8–12 show these processes graphically.

The adds themselves have not been addressed, so for now, the add is re-
placed with a comment.

«
...
316
317
318
319
320
...
»

Summary of Contents for TMS320C6000 DSP

Page 1: ...TMS320C6000 Programmer s Guide Literature Number SPRU198E October 2000 Printed on Recycled Paper ...

Page 2: ...NED INTENDED AUTHORIZED OR WARRANTED TO BE SUITABLE FOR USE IN LIFE SUPPORT APPLICATIONS DEVICES OR SYSTEMS OR OTHER CRITICAL APPLICATIONS Inclusion of TI products in such applications is understood to be fully at the risk of the customer Use of TI products in such applications requires the written approval of an appropriate TI officer Questions concerning potential risk applications should be dir...

Page 3: ... you will use in each phase of development and an optimization checklist to help you achieve optimal performance from your code Part II C Code includes C code examples and discusses optimization methods for the code This information can help you choose the most appropriate optimization techniques for your code Part III Assembly Code describes the structure of assembly code It pro vides examples an...

Page 4: ...de for the C6000 generation of devices The assembly optimizer helps you optimize your assembly code TMS320C6000 CPU and Instruction Set Reference Guide literature number SPRU189 describes the C6000 CPU architecture instruction set pipeline and interrupts for these digital signal processors TMS320C6000 Peripherals Reference Guide literature number SPRU190 describes common peripherals available on t...

Page 5: ...Read This First Trademarks Solaris and SunOS are trademarks of Sun Microsystems Inc VelociTI is a trademark of Texas Instruments Incorporated Windows and Windows NT are registered trademarks of Microsoft Corporation ...

Page 6: ...g 2 2 2 2 Lesson 1 Loop Carry Path From Memory Pointers 2 5 2 3 Lesson 2 Balancing Resources With Dual Data Paths 2 12 2 4 Lesson 3 Packed Data Optimization of Memory Bandwidth 2 18 2 5 Lesson 4 Program Level Optimization 2 23 2 6 Lesson 5 Writing Linear Assembly 2 25 3 Optimizing C C Code 3 1 Explains how to maximize C performance by using compiler options intrinsics and code trans formations 3 1...

Page 7: ...rands and comments 5 1 Labels 5 2 5 2 Parallel Bars 5 2 5 3 Conditions 5 3 5 4 Instructions 5 4 5 5 Functional Units 5 5 5 6 Operands 5 8 5 7 Comments 5 9 6 Optimizing Assembly Code via Linear Assembly 6 1 Describes methods that help you develop more efficient assembly language programs 6 1 Assembly Code 6 2 6 2 Assembly Optimizer Options and Directives 6 4 6 2 1 The on Option 6 4 6 2 2 The mt Opt...

Page 8: ... 6 4 Drawing a Dependency Graph 6 61 6 6 5 Linear Assembly Resource Allocation 6 62 6 6 6 Modulo Iteration Interval Scheduling 6 62 6 6 7 Using the Assembly Optimizer for the Weighted Vector Sum 6 73 6 6 8 Final Assembly 6 74 6 7 Loop Carry Paths 6 77 6 7 1 IIR Filter C Code 6 77 6 7 2 Translating C Code to Linear Assembly Inner Loop 6 78 6 7 3 Drawing a Dependency Graph 6 79 6 7 4 Determining the...

Page 9: ...8 6 12 1 FIR Filter Inner Loop 6 120 6 12 2 Unrolled FIR Filter C Code 6 122 6 12 3 Translating C Code to Linear Assembly 6 123 6 12 4 Drawing a Dependency Graph 6 124 6 12 5 Linear Assembly for Unrolled FIR Inner Loop With mptr Directive 6 125 6 12 6 Linear Assembly Resource Allocation 6 127 6 12 7 Determining the Minimum Iteration Interval 6 128 6 12 8 Final Assembly 6 128 6 12 9 Comparing Perfo...

Page 10: ...ral Enhancements 8 2 8 1 1 Improved Scheduling Flexibility 8 2 8 1 2 Greater Memory Bandwidth 8 2 8 1 3 Support for Packed Data Types 8 2 8 1 4 Non aligned Memory Accesses 8 3 8 1 5 Additional Specialized Instructions 8 3 8 2 Packed Data Processing on the C64x 8 4 8 2 1 Introduction to Packed Data Processing Techniques 8 4 8 2 2 Packed Data Types 8 4 8 2 3 Storing Multiple Elements in a Single Reg...

Page 11: ...t Dot Product With LDW Showing Functional Units 6 24 6 8 Dependency Graph of Floating Point Dot Product With LDDW Showing Functional Units 6 25 6 9 Dependency Graph of Fixed Point Dot Product With LDW Showing Functional Units 6 30 6 10 Dependency Graph of Floating Point Dot Product With LDDW Showing Functional Units 6 31 6 11 Dependency Graph of Weighted Vector Sum 6 61 6 12 Dependency Graph of We...

Page 12: ...Operation 8 14 8 9 Graphical Representation of Dot Product 8 16 8 10 Graphical Representation of a Single Iteration of Vector Complex Multiply 8 17 8 11 Array Access in Vector Sum by LDDW 8 19 8 12 Array Access in Vector Sum by STDW 8 20 8 13 Vector Addition 8 20 8 14 Graphical Representation of a Single Iteration of Vector Multiply 8 22 8 15 Packed 16x16 Multiplies Using _mpy2 8 23 8 16 Fine Tuni...

Page 13: ...t Product 6 18 6 3 Comparison of Fixed Point Dot Product Code With Use of LDW 6 28 6 4 Comparison of Floating Point Dot Product Code With Use of LDDW 6 28 6 5 Modulo Iteration Interval Scheduling Table for Fixed Point Dot Product Before Software Pipelining 6 32 6 6 Modulo Iteration Interval Scheduling Table for Floating Point Dot Product Before Software Pipelining 6 33 6 7 Modulo Iteration Interva...

Page 14: ... Table for FIR Filter Code 6 128 6 25 Comparison of FIR Filter Code 6 128 6 26 Comparison of FIR Filter Code 6 135 6 27 Resource Table for FIR Filter Code 6 146 6 28 Comparison of FIR Filter Code 6 149 8 1 Packed data types 8 5 8 2 Supported Operations on Packed Data Types 8 7 8 3 Instructions for Manipulating Packed Data Types 8 8 8 4 Unpacking Packed 16 bit Quantities to 32 bit Values 8 10 8 5 I...

Page 15: ...6 Rewriting the iircas4 Function in Linear Assembly 2 29 2 17 Software Pipeline Feedback from Linear Assembly 2 30 3 1 Basic Vector Sum 3 8 3 2 Use of the Restrict Type Qualifier With Pointers 3 9 3 3 Use of the Restrict Type Qualifier With Arrays 3 10 3 4 Incorrect Use of the restrict Keyword 3 10 3 5 Including the clock Function 3 14 3 6 Saturated Add Without Intrinsics 3 15 3 7 Saturated Add Wi...

Page 16: ...l Objects Defined in Other Files 4 4 6 1 Linear Assembly Block Copy 6 4 6 2 Block copy With mdep 6 5 6 3 Linear Assembly Dot Product 6 5 6 4 Linear Assembly Dot Product With mptr 6 7 6 5 Fixed Point Dot Product C Code 6 9 6 6 Floating Point Dot Product C Code 6 9 6 7 List of Assembly Instructions for Fixed Point Dot Product 6 10 6 8 List of Assembly Instructions for Floating Point Dot Product 6 10...

Page 17: ...Dot Product Software Pipelined With Smallest Code Size 6 56 6 34 Weighted Vector Sum C Code 6 58 6 35 Linear Assembly for Weighted Vector Sum Inner Loop 6 58 6 36 Weighted Vector Sum C Code Unrolled 6 59 6 37 Linear Assembly for Weighted Vector Sum Using LDW 6 60 6 38 Linear Assembly for Weighted Vector Sum With Resources Allocated 6 62 6 39 Linear Assembly for Weighted Vector Sum 6 73 6 40 Assemb...

Page 18: ...r Loop 6 138 6 75 Unrolled FIR Filter C Code 6 139 6 76 Linear Assembly for FIR With Outer Loop Conditionally Executed With Inner Loop 6 141 6 77 Linear Assembly for FIR With Outer Loop Conditionally Executed With Inner Loop With Functional Units 6 143 6 78 Final Assembly Code for FIR Filter 6 147 7 1 Code With Multiple Assignment of A1 7 3 7 2 Code Using Single Assignment 7 4 7 3 Dot Product With...

Page 19: ...mpgtu4 and _xpnd4 Intrinsics 8 44 8 18 Loop Trip Count in C 8 46 8 19 Loop Trip Count in Linear Assembly without BDEC 8 46 8 20 Loop Trip Count Using BDEC 8 47 8 21 Loop Tip Count Using BDEC With Extra Loop Iterations 8 47 8 22 Using the call Directive in Linear Assembly 8 48 8 23 Compiler Output Using ADDKPC 8 48 8 24 Avoiding Cross Path Stalls Weighted Vector Sum Example 8 51 8 25 Avoiding Cross...

Page 20: ...ence to C6000 pertains to the C62x fixed point C64x fixed point and the C67x floating point devices Though most of the examples shown are fixed point specific all techniques are applicable to each device Topic Page 1 1 TMS320C6000 Architecture 1 2 1 2 TMS320C6000 Pipeline 1 2 1 3 Code Development Flow to Increase Performance 1 3 1 4 Understanding Feedback 1 9 Chapter 1 ...

Page 21: ...on the C6000 CPU which consists of Program fetch unit Instruction dispatch unit Instruction decode unit Two data paths each with four functional units Thirty two 32 bit registers C62x and C67x Sixty four 32 bit registers C64x Control registers Control logic Test emulation and interrupt logic 1 2 TMS320C6000 Pipeline The C6000 pipeline has several features that provide optimum performance low cost ...

Page 22: ...on parallelizing pipelining and register al location This allows the programmer to focus on getting the product to market quickly These features simplify the maintenance of the code as everything resides in a C framework that is simple to maintain support and upgrade The recommended code development flow for the C6000 involves the phases described below The tutorial section of the Programmer s Gui...

Page 23: ...ment flow when you are writing and debugging your code Efficient Yes No Complete Efficient Yes No Efficient Write C code Phase 1 Develop C Code Phase 2 Refine C Code Phase 3 Write Linear Assembly More C optimization No Yes No Yes Complete Compile Profile Refine C code Compile Profile Complete Write linear assembly Profile Assembly optimize ...

Page 24: ...cal areas from your C code and rewrite the code in linear assembly You can use the assembly optimizer to optimize this code Becausemost of the millions of instructions per second MIPS in DSP applica tions occur in tight loops it is important for the C6000 code generation tools to make maximal use of all the hardware resources in important loops Fortu nately loops inherently have more parallelism t...

Page 25: ...piler Uses memory bank pragmas and _nassert intrinsic to pass memory bank and alignment information to the compiler Phase 2 3 Optimize C code using other C6000 intrinsics and other methods Facilitates use of certain C6000 instructions not easily represented in C Optimizes data flow bandwidth uses word access for short C62x C64x and C67x data and double word access for word C64x and C67x data Phase...

Page 26: ...f optimizations that can fine tune all from the high level C language For the few loops that need even further optimiza tions the assembly optimizer gives the programmer more flexibility than C C can offer works within the framework of C C and is much like pro gramming in higher level C For more information on the assembly optimizer see the TMS320C6000 Optimizing C C Compiler User s Guide and Chap...

Page 27: ...t Addition ops LSD 6 3 L or S or D unit Bound L S LS 3 4 Bound L S D LS LSD 5 4 Searching for software pipeline schedule at ii 5 Register is live too long ii 6 Did not find schedule ii 7 Schedule found with 3 iterations in parallel done Epilog not entirely removed Collapsed epilog stages 1 Prolog not removed Collapsed prolog stages 0 Minimum required memory pad 2 bytes Minimum safe trip count 2 Th...

Page 28: ...e basic stages when compiling a loop Here we will focus on the comprehension of these stages and the feedback produced by them This combined with the Feedback Solutions in Appendix A will send you well on your way to fully optimizing your code with the C6000 compiler The three stages are 1 Qualify the loop for software pipelining 2 Collect loop resource and dependency graph information 3 Software ...

Page 29: ... example if the exact value of a loop counter is not known but it is known that the value is a multiple of some number the compiler may be able to unroll the loop to improve performance There are several conditions that must be met before software pipelining is al lowed or legal from the compiler s point of view These conditions are It cannot have too many instructions in the loop Loops that are t...

Page 30: ... A loop carry path occurs when one iteration of a loop writes a value that must be read in a future iteration Instructions that are part of the loop carry bound are marked with the symbol in the assembly code saved with the k option in the asm file The number shown for the loop carried dependency bound is the minimum iteration interval due to a loop carry dependency bound for the loop Often this l...

Page 31: ...ctional units down by the possible resource combinations The table entries are described below J Individual Functional Units L S D M show the total number of instructions that specifically require the L S D or M functional units Instructions that can operate on multiple different functional units are not included in these counts They are described below in the Logical Ops LS and Addition Ops LSD r...

Page 32: ...teger In Example 1 3 if the B side needs 3 L unit only instructions 4 S unit only instructions 1 logical LS instruction you would need at least d8 2e cycles or 4 cycles to issue these J Bound L S D LS LSD represents the resource bound value as determined by the number of instructions that use the D L and S unit It is calculated with the following formula Bound L S D LS LSD ceil L S D LS LSD 3 Wher...

Page 33: ...it takes to execute a loop All of the numbers shown in each row of the feedback imply something about what the minimum iteration in terval mii will be for the compiler to attempt initial software pipelining Several things will determine what the mii of the loop is and are described in the following sections The mii is simply the maximum of any of these individual mii s The first thing the compiler...

Page 34: ...minimum of 33 registers is necessary and this will not be pos sible with the 32 registers available on the C62x and C67x cores In addi tion this is broken down between A and B side so if there is uneven parti tioning with 30 values and there are 17 on one side and 13 on the other the same problem will exist This situation does not apply to the 64 regis ters available on the C64x core Max Cond Regs...

Page 35: ...stages 1 This refers to the number of epilog stages or loop iterations that were re moved This can produce a large savings in code size The mh enables spec ulative execution and improves the compiler s ability to remove epilogs and prologs However in some cases epilogs and prologs can be partially or en tirely removed without speculative execution Thus you may see nonzero val ues for this even wit...

Page 36: ... the example code is installed in c ti tutorial sim62xx optimiz ing_c Use the code in that directory to go through the examples in this chapter The examples in this chapter were run on the most recent version of the soft ware development tools that were available as of the publication of this book Because the tools are being continuously improved you may get different re sults if you are using a m...

Page 37: ...s where tuning your C code can offer great performance improvements In this tutorial a single code example is used to demonstrate all four areas The following example is the vector summation of two weighted vectors Example 2 1 Vector Summation of Two Weighted Vectors void lesson_c short xptr short yptr short zptr short w_sum int N int i w_vec1 w_vec2 short w1 w2 w1 zptr 0 w2 zptr 1 for i 0 i N i w...

Page 38: ...a cursor at c_int00 is displayed and high lighted in yellow Profile the C_tutorial project 1 From the menu bar select Profiler Enable Clocks The Profile Statistics window shows profile points that are already set up for each of the four functions tutor1 4 2 From the menu bar select Debug Run This updates the Profile Statistics and Dis Assembly window You can also click on the Run icon or F5 key to...

Page 39: ...run the project 1 From Project menu choose Rebuild All or click on the Rebuild All icon All of the files are built with compiler options gp k g mh o3 fr C ti tu torial sim62xx optimizing_c 2 From the file menu choose Reload Program This reloads tutor out and returns the cursor to c_int00 3 From the Debug menu choose Run or click the Run icon The count in the Profile Statistics window now equals 2 ...

Page 40: ...i 0 i N i w_vec1 xptr i w1 w_vec2 yptr i w2 w_sum i w_vec1 w_vec2 15 Compile the project and analyze the feedback in lesson_c asm When you rebuilt the project in Getting Ready for Lesson 1 each file was com piled with k gp mh o3 Because option k was used a asm file for each c file is included in the rebuilt project 1 From the File menu choose File Open From the Files of Type drop down menu select ...

Page 41: ...s 1 0 T address paths 2 1 Long read paths 1 0 Long write paths 0 0 Logical ops LS 1 0 L or S unit Addition ops LSD 0 1 L or S or D unit Bound L S LS 1 1 Bound L S D LS LSD 2 1 Searching for software pipeline schedule at ii 10 Schedule found with 1 iterations in parallel done Collapsed epilog stages 0 Collapsed prolog stages 0 Minimum safe trip count 1 SINGLE SCHEDULED ITERATION C17 LDH D1T1 A4 A0 ...

Page 42: ...s there is a loop carry path equal to ten cycles Loop Carried Dependency Bound 10 The symbol is interspersed in the assembly output in the comments of each instruction in the loop carry path and is visible in lesson_c asm Example 2 4 lesson_c asm L2 PIPED LOOP KERNEL LDH D1T1 A4 A0 32 LDH D2T2 B4 B6 32 NOP 2 B0 SUB L2 B0 1 B0 33 B0 B S2 L2 33 MPY M1 A0 A5 A0 32 MPY M2 B6 B5 B6 32 NOP 1 ADD L1X B6 ...

Page 43: ...s more information to the compiler to improve its perfor mance A The next example lesson1_c provides the answer Open lesson1_c c and lesson1_c asm Example 2 5 lesson1_c c void lesson1_c short restrict xptr short restrict yptr short zptr short w_sum int N int i w_vec1 w_vec2 short w1 w2 w1 zptr 0 w2 zptr 1 for i 0 i N i w_vec1 xptr i w1 w_vec2 yptr i w2 w_sum i w_vec1 w_vec2 15 The only change made...

Page 44: ...0 T address paths 2 1 Long read paths 1 0 Long write paths 0 0 Logical ops LS 1 0 L or S unit Addition ops LSD 0 1 L or S or D unit Bound L S LS 1 1 Bound L S D LS LSD 2 1 Searching for software pipeline schedule at ii 2 Schedule found with 5 iterations in parallel done Collapsed epilog stages 4 Prolog not entirely removed Collapsed prolog stages 2 Minimum required memory pad 8 bytes Minimum safe ...

Page 45: ... In the Aliasing drop down box select No Bad Alias Code The mt option will appear in the options window 5 Click OK to set the new options 6 Select lesson_c c by selecting it in the project environment or double clicking on it in the Project View window 7 From the Project menu choose Build or click on the Build icon If prompted reload lesson_c asm 8 From the File menu chooose Open and select lesson...

Page 46: ...on1_c Tutorial Example Lesson_c Lesson1_c Potential pointer aliasing info discussed in Lesson 1 p Loop count info minimum trip count discussed in Lesson 2 Loop count info max trip count factor discussed in Lesson 2 Alignment info xptr yptr aligned on a word boundary discussed in Lesson 3 Cycles per iteration discussed in Lesson 1 3 10 2 ...

Page 47: ...ormance gains in les son_c The result is lesson1_c with a 2 cycle loop Q Is this the best the compiler can do Is this the best that is possible on the VelociTI architecture A Again the answers lie in the amount of knowledge to which the compiler has access Let s analyze the feedback of lesson1_c to determine what improve ments could be made Open lesson1_c asm ...

Page 48: ...1 0 T address paths 2 1 Long read paths 1 0 Long write paths 0 0 Logical ops LS 1 0 L or S unit Addition ops LSD 0 1 L or S or D unit Bound L S LS 1 1 Bound L S D LS LSD 2 1 Searching for software pipeline schedule at ii 2 Schedule found with 5 iterations in parallel done Collapsed epilog stages 4 Prolog not entirely removed Collapsed prolog stages 2 Minimum required memory pad 8 bytes Minimum saf...

Page 49: ...se you will incorrectly execute too few or too many iterations In tutor_d c LOOPCOUNT is defined to be 40 which is a multiple of two so you are able to unroll the loop Q Why did the compiler not unroll the loop A In the limited scope of lesson1_c the loop counter is passed as a parameter to the function Therefore it might be any value from this limited view of the function To improve this scope yo...

Page 50: ...the compiler that the trip count in this case the trip count is N is a multiple of two and that the trip count is greater than or equal to 20 The first argument for MUST_ITERATE is the minimum number of times the loop will iterate The second argument is the maximum number of times the loop will iterate The trip count must be evenly divisible by the third argument See the TMS320C6000 Optimizing C C...

Page 51: ...at ii 3 Schedule found with 5 iterations in parallel done Epilog not entirely removed Collapsed epilog stages 2 Prolog not entirely removed Collapsed prolog stages 3 Minimum required memory pad 8 bytes Minimum safe trip count 4 Notice the following things in the feedback A schedule with three cycles ii 3 You can tell by looking at the D units and T address paths that this 3 cycle loop comes after ...

Page 52: ... to 2 cycles and now to 1 5 cycles Q Is this the lower limit A Check out Lesson 3 to find out Table 2 2 Status Update Tutorial example lesson_c lesson1_c lesson2_c Tutorial Example Lesson_c Lesson1_c Lesson2_c Potential pointer aliasing info discussed in Lesson 1 p p Loop count info minimum trip count discussed in Lesson 2 p Loop count info max trip count factor discussed in Lesson 2 p Alignment i...

Page 53: ... Unroll Multiple 2x Known Minimum Trip Count 10 Known Maximum Trip Count 1073741823 Known Max Trip Count Factor 1 Loop Carried Dependency Bound 0 Unpartitioned Resource Bound 3 Partitioned Resource Bound 3 Resource Partition A side B side L units 0 0 S units 2 1 D units 3 3 M units 2 2 X cross paths 1 1 T address paths 3 3 Long read paths 1 1 Long write paths 0 0 Logical ops LS 1 1 L or S unit Add...

Page 54: ...e Because this is a resource bottleneck in our loop increasing the memory access bandwidth further improves the performance of our loop In the unrolled loop generated from lesson2_c we load two consecutive 16 bit elements with LDHs from both the xptr and yptr array Q Why not use a single LDW to load one 32 bit element with the resulting reg ister load containing the first element in one half of th...

Page 55: ... i w_vec1 w_vec2 short w1 w2 WORD_ALIGNED xptr WORD_ALIGNED yptr w1 zptr 0 w2 zptr 1 pragma MUST_ITERATE 20 2 for i 0 i N i w_vec1 xptr i w1 w_vec2 yptr i w2 w_sum i w_vec1 w_vec2 15 By asserting that xptr and yptr addresses anded with 0x3 are equal to zero the compiler knows that they are word aligned This means the compiler can perform LDW and packed data optimization on these memory accesses Op...

Page 56: ...s paths 2 2 Long read paths 1 1 Long write paths 0 0 Logical ops LS 1 1 L or S unit Addition ops LSD 0 1 L or S or D unit Bound L S LS 2 1 Bound L S D LS LSD 2 2 Searching for software pipeline schedule at ii 2 Schedule found with 6 iterations in parallel done Epilog not entirely removed Collapsed epilog stages 2 Prolog not removed Collapsed prolog stages 0 Minimum required memory pad 8 bytes Mini...

Page 57: ...Lesson_c Lesson1_c Lesson2_c Lesson3_c Potential pointer aliasing info discussed in Les son 1 p p p Loop count info minimum trip count discussed in Lesson 2 p p Loop count info max trip count factor dis cussed in Lesson 2 p p Alignment info xptr yptr aligned on a word boundary discussed in Lesson 3 p Cycles per iteration discussed in Lessons 1 3 10 2 1 5 1 ...

Page 58: ...t declared locally the compiler can still have access to it in an automated way by giving it a program level view This module discusses how to do that The C6000 compiler provides two valuable switches which enable program level optimization pm and op2 When these two options are used together the compiler can automatically extract all of the information we passed in the previous examples To tell th...

Page 59: ...esson2_c Lesson3_c Potential pointer aliasing info discussed in Les son 1 p p p Loop count info minimum trip count discussed in Lesson 2 p p Loop count info max trip count factor dis cussed in Lesson 2 p p Alignment info xptr yptr aligned on a word boundary discussed in Lesson 3 p Cycles per iteration discussed in Lesson 1 3 10 2 1 5 1 Cycles per iteration with program level optimiza tion discusse...

Page 60: ...n tools you might need to modify your linear assembly code until you are satisfied with its performance When you do this you will probably want to add more detail to your linear assembly For example you might want to specify which functional unit should be used Before you use the assembly optimizer you need to know the following things about how it works A linear assembly file must be specified wi...

Page 61: ...Composer Studio you must open the ccs project l_tutorial pjt located in c ti tutorial sim62xx linear_asm Build the program and look at the software pipeline information feedback in the gener ated assembly files Example 2 14 Using the iircas4 Function in C void iircas4_1 const int n const short restrict c 4 int d 2 int y int k0 k1 i int y0 y 0 int y1 y 1 _nassert int c 0x3 0 pragma MUST_ITERATE 10 ...

Page 62: ...hs 0 0 Logical ops LS 2 1 L or S unit Addition ops LSD 4 3 L or S or D unit Bound L S LS 2 1 Bound L S D LS LSD 3 3 Searching for software pipeline schedule at ii 5 Schedule found with 4 iterations in parallel done Epilog not entirely removed Collapsed epilog stages 2 Prolog not removed Collapsed prolog stages 0 Minimum required memory pad 16 bytes Minimum safe trip count 2 From the feedback in th...

Page 63: ...assembly Notice that there are 5 cross path reads on the A side and only 3 on the B side We would like 4 cross path reads on the A side and 4 cross path reads on the B side This would allow us to schedule at an iteration interval ii of 4 instead of the current ii of 5 Example 2 16 shows how to rewrite the iircas4 function Using Linear Assembly ...

Page 64: ... D1T2 BD 1 BD1 d1 d i 1 MPYH 1 BD1 AA AE0 e0 d1 16 a1 MPYHL 1 BD0 AA AJ0 j0 d0 16 a0 MPYH 1 BD1 AB AG0 g0 d1 16 b1 MPYHL 1 BD0 AB AF0 f0 d0 16 b0 ADD 1 AJ0 AE0 AH0 h0 j0 e0 ADD 1 AH0 AY0 AK0 k0 h0 y0 ADD 1 AF0 AG0 AM0 m0 f0 g0 ADD 1 AM0 AK0 AY0 y0 m0 k0 MV 2 AA BA2 MV 2 AB BB2 MV 2 BD0 BD00 STW D1T1 AK0 BD 1 d i 1 k0 MPYH 2 BD00 BA2 BE1 e1 d0 16 a1 MPYHL 2 AK0 BA2 BJ1 j1 k0 16 a0 MPYH 2 BD00 BB2 B...

Page 65: ... read paths 1 1 Long write paths 0 0 Logical ops LS 0 2 L or S unit Addition ops LSD 5 5 L or S or D unit Bound L S LS 1 1 Bound L S D LS LSD 4 3 Searching for software pipeline schedule at ii 4 Schedule found with 5 iterations in parallel done Epilog not entirely removed Collapsed epilog stages 3 Prolog not removed Collapsed prolog stages 0 Minimum required memory pad 24 bytes Minimum safe trip c...

Page 66: ... intrinsics and code transformations This chapter discusses the following topics The compiler and its options Intrinsics Software pipelining Loop unrolling Topic Page 3 1 Writing C C Code 3 2 3 2 Compiling C C Code 3 4 3 3 Profiling Your Code 3 13 3 4 Refining C C Code 3 15 Chapter 3 ...

Page 67: ... functional unit selection Use the short data type for fixed point multiplication inputs whenever pos sible because this data type provides the most efficient use of the 16 bit multiplier in the C6000 1 cycle for short short versus 5 cycles for int int Use int or unsigned int types for loop counters rather than short or un signed short data types to avoid unnecessary sign extension instructions Wh...

Page 68: ...th the g option The profile results will be stored in a file with the vaa extension Refer to the TMS320C6000 Optimizing C C Compiler User s Guide for more information Enable the clock and use profile points and the RUN command in the Code Composer debugger to track the number of CPU clock cycles consumed by a particular section of code Use View Statistics to view the number of cycles consumed The ...

Page 69: ...iderations of optimization versus performance are also discussed The options described in Table 3 1 are obsolete or intended for debugging and could potentially decrease performance and increase code size Avoid us ing these options with performance critical code Table 3 1 Compiler Options to Avoid on Performance Critical Code Option Description g s ss gp These options limit the amount of optimizat...

Page 70: ...eculative execution The appropriate amount of pad ding must be available in data memory to insure correct execu tion This is normally not a problem but must be adhered to mi n Describes the interrupt threshold to the compiler If you know that NO interrupts will occur in your code the compiler can avoid enabling and disabling interrupts before and after soft ware pipelined loops for a code size and...

Page 71: ...automatic size controlled inlining which is en abled by o3 User specified inlining of functions is still al lowed ms2 ms3 Optimizes primarily for code size and secondly for perfor mance Although o3 is preferable at a minimum use the o option Use the pm option for as much of your program as possible The options described in Table 3 5 provide information but do not affect per formance or code size T...

Page 72: ...ctions are independent of one another it can schedule them in parallel Often it is difficult for the compiler to determine if instructions that access memory are independent The following techniques help the compiler deter mine which instructions are independent Use the restrict keyword to indicate that a pointer is the only pointer that can point to a particular object in the scope in which the p...

Page 73: ...ndency graph for this basic vec tor sum For more information about drawing a dependency graph see section 6 3 4 on page 6 11 Example 3 1 Basic Vector Sum void vecsum short sum short in1 short in2 unsigned int N int i for i 0 i N i sum i in1 i in2 i Figure 3 1 Dependency Graph for Vector Sum 1 in1 i 5 5 Load Load in2 i sum i 1 mem 1 1 Store to memory Number of cycles required to complete an instruc...

Page 74: ...estrict keyword The restrict keyword is a type qualifier that may be applied to pointers references and arrays Its use represents a guarantee by the programmer that within the scope of the pointer declaration the object pointed to can be accessed only by that pointer Any violation of this guarantee renders the program undefined This practice helps the compiler optimize certain sections of code bec...

Page 75: ...efore reading the location pointed to by b This can cause an incorrect program because both a and b point to the same object array Example 3 4 Incorrect Use of the restrict Keyword void func short a short restrict b Bad int i for i 11 i 44 i a b void main short array 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 short ptr...

Page 76: ...the mt option when compiling the code in Example 3 1 the compiler uses the assumption that that in1 and in2 do not alias memory pointed to by sum and therefore eliminates memory de pendencies among the instructions that access those variables If your code does not follow the assumptions generated by the mt option you can get incorrect results For more information on the mt option refer to the TMS3...

Page 77: ...tire pro gram it performs several additional optimizations rarely applied during file lev el optimization If a particular argument in a function always has the same value the com piler replaces the argument with the value and passes the value instead of the argument If a return value of a function is never used the compiler deletes the return code in the function If a function is not called direct...

Page 78: ... with the g option runs the standalone simulator in profiling mode Source files must be compiled with the mg profiling option for profiling to work on the standalone simulator The profile results are stored in a file called by the same name as the out file but with the vaa extension If you used the mg profiling option when compiling and linking example out use the g option to create a file in whic...

Page 79: ...n in your C code The following exam ple demonstrates how to include the clock function in your C code Example 3 5 Including the clock Function include stdio h include time h need time h in order to call clock main int argc char argv const short coefs 150 short optr 150 short state 2 const short a 150 const short b 150 int c 0 int dotp 1 0 int sum 0 short y 150 short scalar 3345 const short x 150 c...

Page 80: ...x C64x C67x instructions to optimize your C C code quickly All instructions that are not easily expressed in C C code are supported as intrinsics Intrinsics are specified with a leading underscore _ and are ac cessed by calling them as you call a function For example saturated addition can be expressed in C C code only by writ ing a multicycle function such as the one in Example 3 6 Example 3 6 Sa...

Page 81: ...each pair of signed 16 bit values C64x unsigned _avgu4 uint src1 uint src2 AVGU4 Calculates the average for each pair of un signed 8 bit values C64x unsigned _bitc4 uint src2 BITC4 For each of the 8 bit quantities in src the number of 1 bits is written to the corre sponding position in the return value C64x unsigned _bitr uint src2 BITR Reverses the order of the bits C64x uint _clr uint src2 uint ...

Page 82: ...16 bit values of src1 and src2 is added to the product of signed upper 16 bit values of src1 and src2 C64x int _dotpn2 int src1 int src2 DOTPN2 The product of signed lower 16 bit values of src1 and src2 is subtracted from the product of signed upper 16 bit values of src1 and src2 C64x int _dotpnrsu2 int src1 uint src2 DOTPNR SU2 The product of unsigned lower 16 bit val ues in src1 and src2 is subt...

Page 83: ... csta uint cstb EXTU Extracts the specified field in src2 zero extended to 32 bits The extract is per formed by a shift left followed by a un signed shift right csta and cstb are the shift left and shift right amounts respec tively uint _extur uint src2 int src1 EXTU Extracts the specified field in src2 zero extended to 32 bits The extract is per formed by a shift left followed by a un signed shif...

Page 84: ...teger int _max2 int src1 int src2 int _min2 int src1 int src2 unsigned _maxu4 uint src1 uint src2 unsigned _minu4 uint src1 uint src2 MAX2 MIN2 MAXU4 MINU4 Places the larger smaller of each pair of values in the corresponding position in the return value Values can be 16 bit signed or 8 bit unsigned C64x double _mpy2 int src1 int src2 MPY2 Returns the products of the lower and higher 16 bit values...

Page 85: ...plies the 16 LSBs of src1 by the 16 MSBs of src2 and returns the result Val ues can be signed or unsigned int _mvd int src2 MVD Moves the data from the src to the return value over 4 cycles using the multipler pipeline C64x void _nassert int Generates no code Tells the optimizer that the expression declared with the assert function is true This gives a hint to the compiler as to what optimizations...

Page 86: ...erforms saturated addition between pairs of 8 bit unsigned values in src1 and src2 C64x int _sadd2 int src1 int src2 int _saddus2 uint src1 int src2 SADD2 SADDUS2 Performs saturated addition between pairs of 16 bit values in src1 and src2 Src1 values can be signed or unsigned C64x int _sat long src2 SAT Converts a 40 bit value to an 32 bit value and saturates if necessary uint _set uint src2 uint ...

Page 87: ...gned packed 16 bit values with an additional 1 bit left shift and saturate into a double result C64x int _spack2 int src1 int src2 SPACK2 Two signed 32 bit values are saturated to 16 bit values and packed into the return value C64x unsigned _spacku4 int src1 int src2 SPACKU4 Four signed 16 bit values are saturated to 8 bit values and packed into the return value C64x int _spint float SPINT Convert...

Page 88: ...BABS4 Calculates the absolute value of the differ ences for each pair of packed 8 bit values C64x uint _swap4 uint src2 SWAP4 Exchanges pairs of bytes an endian swap within each 16 bit value C64x uint _unpkhu4 uint src2 UNPKHU4 Unpacks the two high unsigned 8 bit val ues into unsigned packed 16 bit values C64x uint _unpklu4 uint src2 UNPKLU4 Unpacks the two low unsigned 8 bit val ues into unsigned...

Page 89: ...With restrict Keywords MUST_ITERATE Pragma Word Reads void vecsum4 short restrict sum const short restrict in1 const short restrictin2 unsigned int N int i const int restrict i_in1 const int in1 const int restrict i_in2 const int in2 int restrict i_sum int sum pragma MUST_ITERATE 10 for i 0 i N 2 i i_sum i _add2 i_in1 i i_in2 i Note The MUST_ITERATE intrinsic tells the compiler that the following ...

Page 90: ...ict in1 const short restrict in2 unsigned int N int i pragma MUST_ITERATE 10 for i 0 i N i 2 _mem4 void sum i _add2 _mem4 void in1 i _mem4 void in2 i Another consideration is that the loop must now run for an even number of it erations You can ensure that this happens by padding the short arrays so that the loop always operates on an even number of elements ...

Page 91: ...hat handles all types of alignments and array sizes Example 3 10 Vector Sum With restrict Keywords MUST_ITERATE pragma and Word Reads Generic Version void vecsum5 short restrict sum const short restrict in1 const short re strict in2 unsigned int N int i test to see if sum in2 and in1 are aligned to a word boundary if int sum int in2 int in1 0x2 pragma MUST_ITERATE 20 for i 0 i N i sum i in1 i in2 ...

Page 92: ...ample also uses two sum variables sum1 and sum2 Using only one sum variable inhibits parallelism by creating a dependency between the write from the first sum calculation and the read in the second sum calculation With in a small loop body avoid writing to the same variable because it inhibits par allelism and creates dependencies Example 3 11 Dot Product Using Intrinsics int dotprod const short r...

Page 93: ... j long y0 long round 1L s 1 for j 0 j m j y0 round for i 0 i n i y0 x i j h i y j int y0 s Example 3 13 shows an optimized version of Example 3 12 The optimized version passes an int array instead of casting the short arrays to int arrays and therefore helps ensure that data passed to the function is word aligned As suming that a prototype is used each invocation of the function ensures that the ...

Page 94: ...ict short y restrict int n int m int s int i j long y0 y1 long round 1L s 1 pragma MUST_ITERATE 8 for j 0 j m 1 j y0 y1 round pragma MUST_ITERATE 8 for i 0 i n 1 i y0 _mpy x i j h i y0 _mpyh x i j h i y1 _mpyhl x i j h i y1 _mpylh x i j 1 h i y int y0 s y int y1 s short x SIZE_X h SIZE_H y SIZE_Y void main fir1 int x int h y n m s ...

Page 95: ...const float b restrict int i float sum 0 for i 0 i 512 i sum a i b i return sum In Example 3 15 the dot product example is rewritten to use double word loads and intrinsics are used to extract the high and low 32 bit values con tained in the 64 bit double The _hi and _lo instrinsics return integer values the _itof intrinsic subverts the C typing system by interpreting an integer val ue as a float ...

Page 96: ...1 _itof _lo a i _itof _lo b i return sum0 sum1 pragma DATA_ALIGN a 8 pragma DATA_ALIGN b 8 float ret_val a SIZE_A b SIZE_B void main ret_val dotprod2 double a double b In Example 3 16 the dot product example is unrolled to maximize perfor mance The preprocessor is used to define convenient macros FHI and FLO for accessing the high and low 32 bit values in a double word In this version of the loop ...

Page 97: ... 4 sum0 FHI a i FHI b i sum1 FLO a i FLO b i sum2 FHI a i 1 FHI b i 1 sum3 FLO a i 1 FLO b i 1 sum4 FHI a i 2 FHI b i 2 sum5 FLO a i 2 FLO b i 2 sum6 FHI a i 3 FHI b i 3 sum7 FLO a i 3 FLO b i 3 sum0 sum1 sum2 sum3 sum4 sum5 sum6 sum7 sum0 sum2 sum4 sum6 return sum0 sum4 void main Using 0 as the bank parameter for the DATA_MEM_BANK pragma aligns variable to a double word boundary for the C62xx C64...

Page 98: ...es a multiple of 4 times Example 3 17 Int Dot Product with Nonaligned Doubleword Reads int dotp4 const short restrict a const short restrict b unsigned int N int i sum1 0 sum2 0 sum3 0 sum4 0 for i 0 i N i 4 sum1 _mpy _lo memd8 void a i _lo _memd8 void b i sum2 _mpyh _lo memd8 void a i _lo _memd8 void b i sum3 _mpy _hi memd8 void a i _hi _memd8 void b i sum4 _mpyh _hi memd8 void a i _hi _memd8 voi...

Page 99: ...nassert int a 0x3 0 _nassert int b 0x3 0 pragma MUST_ITERATE 40 40 for i 0 i N i sum a i b i return sum Compile Example 3 18 with the following options o k Open up the assem bly file and look at the loop kernel The results are the exact same as those produced by Example 3 11 The first 2 _nassert intrinsics in Example 3 18 tell the compiler that the arrays pointed to by a and b are aligned to a wor...

Page 100: ...third argument tells the compiler what the trip count is a multiple of See the TMS320C6000 C C Compiler User s Guide for more information about the MUST_ITERATE pragma Example 3 19 and Example 3 20 show how to use the _nassert intrinsic and MUST_ITERATE pragma to get word accesses on the vector sum and the FIR filter Example 3 19 Using the _nassert Intrinsic to Generate Word Accesses for Vector Su...

Page 101: ...om Example 3 20 the optimization done by the compiler is not as optimal as the code produced in Example 3 13 but it is more optimal than the code in Example 3 12 Example 3 21 Compiler Output From Example 3 20 L3 PIPED LOOP KERNEL B0 ADD L1 A9 A7 A6 A7 A6 21 MPY M2X A3 B3 B2 21 MPYHL M1X B3 A0 A0 21 A1 B S2 L3 21 LDH D2T2 B9 8 B3 21 LDH D1T1 A8 4 A3 21 B0 ADD L2 B3 B5 B4 B5 B4 21 MPY M1X A0 B1 A9 2...

Page 102: ...B3 B7 B6 B7 B6 MPYH M1X B2 A8 A3 MPYHL M2X A8 B9 B3 A1 SUB S1 A1 1 A1 B0 LDW D1T1 A0 A8 B0 LDW D2T2 B8 B9 Example 3 23 Compiler Output From Example 3 12 L4 PIPED LOOP KERNEL A2 SUB S1 A2 1 A2 ADD L1 A5 A1 A0 A1 A0 MPY M1X B5 A4 A5 B0 B S2 L4 B0 SUB L2 B0 1 B0 A2 LDH D1T1 A3 A4 A2 LDH D2T2 B4 B5 Note The _nassert intrinsic may not solve all of your short to int or float to double accesses but it ca...

Page 103: ...f Word Accesses Without the _nassert Intrinsic file1 c int dotp short restrict a short restrict b int c int sum 0 i for i 0 i c i sum a i b i return sum file2 c include stdio h short x 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 short y 40 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 ...

Page 104: ...uses word accesses and the C6000 intrinsics L2 PIPED LOOP KERNEL A1 ADD L2 B6 B7 B7 A1 ADD L1 A6 A0 A0 MPY M2X B5 A4 B6 MPYH M1X B5 A4 A6 B0 B S1 L2 LDW D1T1 A5 4 A4 LDW D2T2 B4 4 B6 A1 SUB S1 A1 1 A1 A1 ADD S2 B5 B8 B8 A1 ADD L1 A6 A3 A3 MPY M2X B6 A4 B5 MPYH M1X B6 A4 A6 B0 SUB L2 B0 1 B0 LDW D1T1 A5 8 A4 LDW D2T2 B4 8 B5 ...

Page 105: ...ive iterations of the loop can execute at one time The shaded area represents the loop ker nel In the loop kernel all five stages execute in parallel The area immediately before the kernel is known as the pipelined loop prolog and the area immedi ately following the kernel is known as the pipelined loop epilog Figure 3 2 Software Pipelined Loop A1 B1 A2 C1 B2 A3 Pipelined loop prolog D1 C2 B3 A4 E...

Page 106: ... nec essary to safely execute the software pipelined version of the loop All software pipelined loops have a minimum safe trip count requirement If the known minimum trip count is not above the minimum safe trip count redundant loops will be generated The known minimum trip count and the minimum safe trip count for a given software pipelined loop can be found in the compiler generated comment bloc...

Page 107: ...ated In order to help the compiler generate only the software pipelined version of the loop use the MUST_ITERATE pragma and or the pm option to help the compiler determine the known minimum trip count Note Use of ms0 or ms1 may result in a performance degredation Using ms0 or ms1 may cause the compiler not to software pipeline a loop This can cause the performance of the loop to suffer When safe t...

Page 108: ...s equal some value This loop will always execute exactly 30 times pragma MUST_ITERATE 30 30 for j 0 j x j The MUST_ITERATE pragma can convey that the trip count will be great er than some minimum value or smaller than some maximum value The latter is useful when interrupts need to occur inside of loops and you are using the mi n option Refer to section 7 4 Interruptible Loops This loop will always...

Page 109: ... loops so that each iteration of the loop appears in your code This optimization increases the number of instructions available to execute in parallel You can use loop unrolling when the operations in a single iteration do not use all of the resources of the C6000 architecture There are three ways loop unrolling can be performed 1 The compiler may automatically unroll the loop 2 You can suggest th...

Page 110: ...oop then performs six memory operations per iteration which means the unrolled vector sum loop can deliver four results ev ery three cycles that is 1 33 results per cycle Example 3 28 shows four re sults for each iteration of the loop sum i and sum i sz each store an int value that represents two 16 bit values Example 3 28 is not simple loop unrolling where the loop body is simply repli cated The ...

Page 111: ...rict const short coefs restrict short out restrict int i j int sum 0 for i 0 i 40 i for j 0 j 16 j sum coefs j input i 15 j out i sum 15 For loops with a simple loop structure the compiler uses a heuristic to deter mine if it should unroll the loop Because unrolling can increase code size in some cases the compiler does not unroll the loop If you have identified this loop as being critical to your...

Page 112: ... ing the software pipeline occurs only once per invocation of the function rather than for each iteration of the outer loop The heuristic the compiler uses to determine if it should unroll the loops needs to know either of the following pieces of information Without knowing either of these the compiler will never unroll a loop The exact trip count of the loop The trip count of the loop is some mul...

Page 113: ...2 for i 0 i n i a i b i c i compiler output for above code L2 PIPED LOOP KERNEL ADD L1X B7 A3 A3 5 B0 B S1 L2 5 LDH D1T1 A4 4 A3 5 LDH D2T2 B4 4 B7 5 A1 STH D1T1 A3 A0 4 5 ADD L2X B6 A5 B6 5 LDH D2T2 B4 2 B6 5 A1 SUB L1 A1 1 A1 A1 STH D2T2 B6 B5 4 5 B0 SUB L2 B0 1 B0 5 LDH D1T1 A4 2 A5 5 Note When the interrupt threshold option is used unrolling can be used to regain lost performance Refer to sect...

Page 114: ...are pipelined See section 6 6 6 2 Live Too Long on page 6 67 and section 6 10 Live Too Long Issues on page 6 101 for examples of code that is live too long If the loop has complex condition code within the body that requires more than the five C6000 condition registers on the C62x and C67x or six con dition registers for the C64x the loop is not software pipelined Try to elim inate or combine thes...

Page 115: ...ever the compiler can actually perform some transformations and software pipeline this loop better than it can the modified code in Example 3 33 Example 3 32 Use of If Statements in Float Collision Detection Original Code int colldet const float restrict x const float restrict p float point float distance int I retval 0 float sum0 sum1 dist0 dist1 for I 0 I 28 3 I 6 sum0 x I 0 p 0 x I 1 p 1 x I 2 ...

Page 116: ...3 option to convert as many loops as possible into down counting loops If the trip counter is modified within the body of the loop it typically cannot be converted into a downcounting loop If possible rewrite the loop to not modify the trip counter For example the following code will not software pipeline for i 0 i n i i x A conditionally incremented loop control variable is not software pipelined...

Page 117: ...do with the relocation value truncated linker and assembler mes sages How to save on chip memory by moving the RTS off chip How to build your application with RTS calls either near or far How to change the default RTS data from far to near Topic Page 4 1 How to Use Linker Error Messages 4 2 4 2 How to Save On Chip Memory by Placing RTS Off Chip 4 6 Chapter 4 ...

Page 118: ... file obj Or an MVK can be the source of this message Signed 16 bit relocation out of range value truncated Located in file obj section text SPC offset 000000a4 4 1 1 How to Find The Problem These messages are similar The file is file obj the section is text and the SPC offset is 0xa4 If this happens to you when you are linking C code here is what you do to find the problem Recompile the C source ...

Page 119: ...our linker command file looking at a map file usually helps so that all the calls to atoi are close within 0x100000 words to where atoi is linked 4 1 1 2 Far Global Data If the problem instruction is an MVK then you need to understand why the constant expression does not fit For C code you might find the instruction looks like 50 000000a4 0200002A MVK _ary bss B4 5 In this case the address of the ...

Page 120: ... missing then the compiler will incorrectly assume that ary is in bss and can be accessed via the data page pointer extern far in ary ary 4 1 1 3 The MVKL Mnemonic If the MVK instruction is just a simple load of an address 123 000000a4 0200002A MVK sym B4 Then the linker warning message is telling you that sym is greater than 32767 and you will end up with something other than the value of sym in ...

Page 121: ...which have yet to be changed to MVKL then this warning may safely be ignored The loaders supplied by TI will still load and execute this out file If you implement your own loader please be aware this warning message means the F_EXEC flag in the file header is not set If your loader depends on this flag then you will have to fix your MVK instructions or use the switches described above to turn off ...

Page 122: ... floating point math on the C62x and C64x Example _divu performs 32 bit unsigned divide near calls Function calls performed with a ordinary PC relative branch instruction The destination of such branches must be within 1 048 576 0x100000 words of the branch Such calls use 1 instruction word and 1 cycle far calls Function calls performed by loading the address of the function into a register and th...

Page 123: ...header file which corre sponds to that function For instance when you call memcmp you must in clude string h If you do not include the header the memcmp call looks like a normal user call to the compiler and the effect of using mr1 does not occur 4 2 3 RTS Data Most RTS functions do not have any data of their own Data is typically passed as arguments or through pointers However a few functions do ...

Page 124: ...ch puts RTS off chip c heap 0x2000 stack 0x4000 Memory Map 1 the default MEMORY PMEM o 00000000h l 00010000h EXT0 o 00400000h l 01000000h EXT1 o 01400000h l 00400000h EXT2 o 02000000h l 01000000h EXT3 o 03000000h l 01000000h BMEM o 80000000h l 00010000h SECTIONS Sections defined only in RTS stack BMEM sysmem BMEM cio EXT0 Sections of user code and data text PMEM bss BMEM const BMEM data BMEM switc...

Page 125: ...te these sec tions separately from user sections Typically you place the stack system stack and sysmem heap of memory used by malloc etc sections in on chip memory for performance reasons The cio section is a buffer used by printf and related functions You can typically afford slower performance of such I O functions so it is placed in off chip memory The rtstext section collects all the text or c...

Page 126: ... to avoid the message If your application changes such that you later do include an RTS function with a switch section it will be linked next to the switch sections from your code This is fine except it is taking up that valuable on chip memory So you may want to check for this situation occasionally by looking at the linker map file you create with the m linker option Note Library Listed in Comma...

Page 127: ...TS Default Same as user Same as user Undefined mr0 Near Near 0 mr1 Far Far 1 The _DATA_ACCESS macro is set to always be far The _IDECL macro determines how inline functions are declared All of the RTS header files which define functions or data include linkage h header file Functions are modified with _CODE_ACCESS extern _CODE_ACCESS void exit int _status and data is modified with _DATA_ACCESS ext...

Page 128: ...o the lib directory Replace the linkage h entry in the source library ar6x r rts src linkage h Delete linkage h Rename or delete the object library you use when linking Rebuild the object library you use with the library build command listed in the readme file for that release Note that you will have to perform this process each time you install an update of the code generation toolset ...

Page 129: ...Any line of assembly code can include up to seven items Label Parallel bars Conditions Instruction Functional unit Operands Comment Topic Page 5 1 Labels 5 2 5 2 Parallel Bars 5 2 5 3 Conditions 5 3 5 4 Instructions 5 4 5 5 Functional Units 5 5 5 6 Operands 5 8 5 7 Comments 5 9 Chapter 5 ...

Page 130: ...nditions The first character of a label must be a letter or an underscore _ followed by a letter The first character of the label must be in the first column of the text file Labels can include up to 32 alphanumeric characters 5 2 Parallel Bars An instruction that executes in parallel with the previous instruction signifies this with parallel bars This field is left blank for an instruction that d...

Page 131: ...ons in Assembly Code label condition instruction unit operands comments parallel bars All C6000 instructions are conditional If no condition is specified the instruction is always performed If a condition is specified and that condition is true the instruction executes For example With this condition The instruction executes if A1 A1 0 A1 A1 0 If a condition is specified and that condition is fals...

Page 132: ...et User s Guide Figure 5 4 shows the position of the instruction in a line of assembly code Figure 5 4 Instructions in Assembly Code label condition instruction unit operands comments parallel bars Table 5 1 Selected TMS320C6x Directives Directives Description sect name Creates section of information data or code double value Reserve two consecutive 32 bits 64 bits in memory and fill with double p...

Page 133: ...ssembly Code 5 5 Functional Units The C6000 CPU contains eight functional units which are shown in Figure 5 5 and described in Table 5 2 Figure 5 5 TMS320C6x Functional Units Memory Register file A M2 L2 S2 D2 Register file B D1 M1 L1 S1 ...

Page 134: ...min max operations Quad 8 bit min max operations Arithmetic operations DP SP INT DP INT SP conversion operations S unit S1 S2 32 bit arithmetic operations 32 40 bit shifts and 32 bit bit field operations 32 bit logical operations Branches Constant generation Register transfers to from control register file S2 only Byte shifts Data packing unpacking Dual 16 bit compare operations Quad 8 bit compare...

Page 135: ...lation Loads and stores with 5 bit constant offset Loads and stores with 15 bit constant offset D2 only Dual 16 bit arithmetic operations Load and store double words with 5 bit constant Load and store non aligned words and double words 5 bit constant generation 32 bit logical operations Load doubleword with 5 bit constant offset Note Fixed pointoperations are available on all three devices Floatin...

Page 136: ...e packet can come from the register file opposite that of the other source operand When an operand comes from the other register file the unit includes an X as shown in Figure 5 8 indicating that the instruction is using one of the cross paths See the TMS320C6000 CPU and Instruction Set Reference Guide for more information on cross paths Figure 5 8 Operands in Instructions L1 A0 A1 A3 L1X A0 B1 A3...

Page 137: ... in a line of assembly code Figure 5 9 Comments in Assembly Code label condition instruction unit operands comments parallel bars The following are guidelines for using comments in assembly code A comment may begin in any column when preceded by a semicolon A comment must begin in first column when preceded by an asterisk Comments are not required but are recommended ...

Page 138: ... been hand optimized in order to direct your attention to particular coding issues The actual output from the assembly optimizer may look different depending on the version you are us ing Topic Page 6 1 Assembly Code 6 2 6 2 Assembly Optimizer Options and Directives 6 4 6 3 Writing Parallel Code 6 9 6 4 Using Word Access for Short Data and Doubleword Access for Floating Point Data 6 19 6 5 Softwar...

Page 139: ...anually Each section introduces optimization techniques in increasing complexity Section 6 3 and section 6 4 begin with a dot product algorithm to show you how to translate the C code to assembly code and then how to optimize the linear assembly code with several simple techniques Section 6 5 and section 6 6 introduce techniques for the more complex al gorithms associated with software pipelining ...

Page 140: ... the C C compiler linear assembly code which is input for the assembly optimizer and assembly code which is input for the assembler In the three sections following section 6 2 we use the dot product to demon strate how to use various programming techniques to optimize both perfor mance and code size Most of the examples provided in this book use fixed point arithmetic however the three sections fo...

Page 141: ...loop ldw reg1 reg2 add reg2 reg3 reg4 stw reg4 reg5 reg6 add 1 reg6 reg6 reg6 b loop The assembly optimizer will make sure that each store to reg5 completes be fore the next load of reg1 A suboptimal loop would result if the store to ad dress in reg5 in not in the next location to be read by reg1 For loops where reg5 is pointing to the next location of reg1 this is necessary and implies that the l...

Page 142: ...is a memory dependence from the LDW instruction to the STW instruction This means that the STW instruction must come after the LDW instruction The mdep directive does not imply that there is a memory dependence from the STW to the LDW Another mdep directive would be needed to handle that case 6 2 4 The mptr Directive The mptr directive gives the assembly optimizer information on how to avoid memor...

Page 143: ...1 ADD L2 B4 B6 B4 MPY M2X B7 A0 B6 B0 B S1 loop LDH D2T2 B5 2 B6 LDH D1T1 A4 2 A0 A1 SUB S1 A1 1 A1 A1 ADD L1 A5 A3 A5 MPY M1X B6 A0 A3 B0 ADD L2 1 B0 B0 LDH D2T2 B5 4 B7 LDH D1T1 A4 4 A0 If the arrays pointed to by ptr_a and ptr_b begin on the same bank then there will be memory bank conflicts at every cycle of the loop due to how the LDH instructions are paired By adding the mptr directive infor...

Page 144: ...b x 4 loop trip 20 20 ldh ptr_a val1 ldh ptr_b val2 mpy val1 val2 prod1 add sum1 prod1 sum1 ldh ptr_a val3 ldh ptr_b val4 mpy val3 val4 prod2 add sum2 prod2 sum2 cnt add 1 cnt cnt cnt b loop add sum1 sum2 sum1 return sum1 endproc loop kernel generated loop PIPED LOOP KERNEL A1 ADD L2 B4 B6 B4 MPY M2X B8 A0 B6 B0 B S1 loop LDH D2T2 B5 4 B8 LDH D1T1 A4 2 A0 A1 SUB S1 A1 1 A1 A1 ADD L1 A5 A3 A5 MPY M...

Page 145: ...he trip directive with just the first para meter This example tells the assembly optimizer that the loop will iterate at least ten times loop trip 10 You can also tell the assembly optimizer that your loop will execute exactly some number of times by setting the minimum_value and maximum_value pa rameters to exactly the same value This next example tells the assembly opti mizer that the loop will ...

Page 146: ...product is a sum in which each element in array a is multiplied by the corresponding element in array b Each of these products is then accumulated into sum The C code in Example 6 5 is a fixed point dot product algorithm The C code in Example 6 6 is a floating point dot product algorithm Example 6 5 Fixed Point Dot Product C Code int dotp short a short b int sum i sum 0 for i 0 i 100 i sum a i b i...

Page 147: ...tes the total of the results from the multiply MPY instruc tion The subtract SUB instruction decrements the loop counter An additional instruction is included to execute the branch back to the top of the loop The branch B instruction is conditional on the loop counter A1 and executes only until A1 is 0 6 3 2 2 Floating Point Dot Product Example 6 8 shows the linear assembly instructions used for t...

Page 148: ...unit Branch B instructions must use a S unit Note The ADD and SUB can be on the S L or D units however for Example 6 7 and Example 6 8 they are assigned as listed above The ADDSP instruction in Example 6 8 must use a L unit 6 3 4 Drawing a Dependency Graph Dependency graphs can help analyze loops by showing the flow of instruc tions and data in an algorithm These graphs also show how instructions ...

Page 149: ...A7 A1 M1 L1 D1 D1 S1 S1 Number of cycles required to complete an instruction Variable being written Instruction mnemonic Functional unit Register allocation The two LDH instructions which write the values of ai and bi are parents of the MPY instruction It takes five cycles for the parent LDH instruction to complete Therefore if LDH is scheduled on cycle i then its child MPY cannot be scheduled unt...

Page 150: ... Dot Product ai bi 4 4 5 5 SUB ADDSP MPYSP LDW LDW pi sum 1 B cntr LOOP 1 A2 A5 A6 A7 A1 M1 L1 D1 D1 S1 S1 Number of cycles required to complete an instruction Variable being written Instruction mnemonic Functional unit Register allocation The two LDW instructions which write the values of ai and bi are parents of the MPYSP instruction It takes five cycles for the parent LDW instruc tion to comple...

Page 151: ...er to 100 The ZERO instruction clears the accumulator The NOP instructions allow for the delay slots of the LDH MPY and B instructions Executing this dot product code serially requires 16 cycles for each iteration plus two cycles to set up the loop counter and initialize the accumulator 100 it erations require 1602 cycles Example 6 9 Nonparallel Assembly Code for Fixed Point Dot Product MVK S1 100...

Page 152: ... decrement loop counter A1 B S2 LOOP branch to loop NOP 2 delay slots for LDH MPY M1X A2 B2 A6 ai bi NOP delay slots for MPY ADD L1 A6 A7 A7 sum ai bi Branch occurs here Because the loads of ai and bi do not depend on one another both LDH instructions can execute in parallel as long as they do not share the same resources To schedule the load instructions in parallel allocate the functional units ...

Page 153: ...ay slots of the LDW ADDSP MPYSP and B instructions Executing this dot product code serially requires 21 cycles for each iteration plus two cycles to set up the loop counter and initialize the accumulator 100 it erations require 2102 cycles Example 6 11 Nonparallel Assembly Code for Floating Point Dot Product MVK S1 100 A1 set up loop counter ZERO L1 A7 zero out accumulator LOOP LDW D1 A4 A2 load a...

Page 154: ...decrement loop counter NOP 2 delay slots for LDW A1 B S2 LOOP branch to loop MPYSP M1X A2 B2 A6 ai bi NOP 3 delay slots for MPYSP ADDSP L1 A6 A7 A7 sum ai bi Branch occurs here Because the loads of ai and bi do not depend on one another both LDW instructions can execute in parallel as long as they do not share the same resources To schedule the load instructions in parallel allocate the functional...

Page 155: ...or 100 iterations require 801 cycles Table 6 1 compares the performance of the nonparallel code with the parallel code for the fixed point example Table 6 1 Comparison of Nonparallel and Parallel Assembly Code for Fixed Point Dot Product Code Example 100 Iterations Cycle Count Example 6 9 Fixed point dot product nonparallel assembly 2 100 16 1602 Example 6 10 Fixed point dot product parallel assem...

Page 156: ...eword LDDW instruction to read a i and a i 1 at the same time and load both into a register pair The data must be doubleword aligned in memory See the TMS320C6000 CPU and In struction Set User s Guide for more specific information on the LDDW instruc tion Note The load doubleword LDDW instruction is available on the C64x fixed point and C67x floating point device 6 4 1 Unrolled Dot Product C Code ...

Page 157: ...c variable names are used instead of ac tual registers Using symbolic names for data and pointers makes code easier to write and allows the optimizer to allocate registers However you must use the reg assembly optimizer directive See the TMS320C6000 Optimizing C C Compiler User s Guide for more information on writing linear assembly code Example 6 15 Linear Assembly for Fixed Point Dot Product Inn...

Page 158: ...rolled floating point dot product loop Symbolic variable names are used instead of actual registers Using symbolic names for data and pointers makes code eas ier to write and allows the optimizer to allocate registers However you must use the reg assembly optimizer directive See the TMS320C6000 Optimizing C C Compiler User s Guide for more information on writing linear assembly code Example 6 16 L...

Page 159: ...of LDWs MPYs and ADDs on each side To keep both sides even place the remaining two instructions B and SUB on opposite sides Figure 6 5 Dependency Graph of Fixed Point Dot Product With LDW B side A side bi bi 1 ai ai 1 pi 1 pi 5 5 5 5 2 2 sum0 sum1 cntr LOOP 1 1 1 1 LDW MPY MPYH ADD SUB B ADD LDW Similarly the dependency graph in Figure 6 6 for the floating point dot prod uct shows that the LDDW in...

Page 160: ... LDDW B side A side bi bi 1 ai ai 1 pi 1 pi 5 5 5 5 4 4 sum0 sum1 cntr LOOP 4 4 1 1 LDDW MPYSP MPYSP ADDSP SUB B ADDSP LDDW 6 4 4 Linear Assembly Resource Allocation After splitting the dependency graph for both the fixed point and floating point dot products you can assign functional units and registers as shown in the dependency graphs in Figure 6 7 and Figure 6 8 and in the instructions in Exam...

Page 161: ... 1 pi 5 5 5 5 2 ADD SUB sum0 cntr LOOP 1 B MPYH MPY 1 1 D1 D2 M1X L1 S1 S2 Example 6 17 Linear Assembly for Fixed Point Dot Product Inner Loop With LDW With Allocated Resources LDW D1 A4 A2 load ai and ai 1 from memory LDW D2 B4 B2 load bi and bi 1 from memory MPY M1X A2 B2 A6 ai bi MPYH M2X A2 B2 B6 ai 1 bi 1 ADD L1 A6 A7 A7 sum0 ai bi ADD L2 B6 B7 B7 sum1 ai 1 bi 1 SUB S1 A1 1 A1 decrement loop ...

Page 162: ... bi bi 1 ai ai 1 pi 5 5 5 5 4 ADDSP SUB sum0 cntr LOOP 4 B MPYSP MPYSP 1 1 D1 D2 M1X L1 S1 S2 Example 6 18 Linear Assembly for Floating Point Dot Product Inner Loop With LDDW With Allocated Resources LDDW D1 A4 A3 A2 load ai and ai 1 from memory LDDW D2 B4 B3 B2 load bi and bi 1 from memory MPYSP M1X A2 B2 A6 ai bi MPYSP M2X A3 B3 B6 ai 1 bi 1 ADDSP L1 A6 A7 A7 sum0 ai bi ADDSP L2 B6 B7 B7 sum1 ai...

Page 163: ...A4 A2 load ai ai 1 from memory LDW D2 B4 B2 load bi bi 1 from memory SUB S1 A1 1 A1 decrement loop counter A1 B S1 LOOP branch to loop NOP 2 MPY M1X A2 B2 A6 ai bi MPYH M2X A2 B2 B6 ai 1 bi 1 NOP ADD L1 A6 A7 A7 sum0 ai bi ADD L2 B6 B7 B7 sum1 ai 1 bi 1 Branch occurs here ADD L1X A7 B7 A4 sum sum0 sum1 The code in Example 6 19 includes the following optimizations The setup code for the loop is inc...

Page 164: ...oop counter NOP 2 A1 B S1 LOOP branch to loop MPYSP M1X A2 B2 A6 ai bi MPYSP M2X A3 B3 B6 ai 1 bi 1 NOP 3 ADDSP L1 A6 A7 A7 sum0 ai bi ADDSP L2 B6 B7 B7 sum1 ai 1 bi 1 Branch occurs here NOP 3 ADDSP L1X A7 B7 A4 sum sum0 sum1 NOP 3 The code in Example 6 20 includes the following optimizations The setup code for the loop is included to initialize the array pointers and the loop counter and to clear...

Page 165: ...Example 6 10 Fixed point dot product parallel assembly 1 100 8 801 Example 6 19 Fixed point dot product parallel assembly with LDW 1 50 8 1 402 Executing the floating point dot product with the optimizations in Example 6 20 requires only 50 iterations because you operate in parallel on both the even and odd array elements With the setup code and the final ADDSP instruction 100 iterations of this l...

Page 166: ...mbly code in Example 6 17 on page 6 24 except that the SUB instruction is now condition al on the loop counter A1 Note Making the SUB instruction conditional on A1 ensures that A1 stops decre menting when it reaches 0 Otherwise as the loop executes five more times the loop counter becomes a negative number When A1 is negative it is non zero and therefore causes the condition on the branch to be tr...

Page 167: ...B MPYH MPY 1 1 D1 D2 M2X L2 M1X L1 S1 S2 Example 6 21 Linear Assembly for Fixed Point Dot Product Inner Loop With Conditional SUB Instruction LDW D1 A4 A2 load ai and ai 1 from memory LDW D2 B4 B2 load bi and bi 1 from memory MPY M1X A2 B2 A6 ai bi MPYH M2X A2 B2 B6 ai 1 bi 1 ADD L1 A6 A7 A7 sum0 ai bi ADD L2 B6 B7 B7 sum1 ai 1 bi 1 A1 SUB S1 A1 1 A1 decrement loop counter A1 B S2 LOOP branch to t...

Page 168: ... sum1 cntr LOOP 4 4 B MPYSP MPYSP 1 1 D1 D2 M2X L2 M1X L1 S1 S2 Example 6 22 Linear Assembly for Floating Point Dot Product Inner Loop With Conditional SUB Instruction LDDW D1 A4 A2 load ai and ai 1 from memory LDDW D2 B4 B2 load bi and bi 1 from memory MPYSP M1X A2 B2 A6 ai bi MPYSP M2X A2 B2 B6 ai 1 bi 1 ADDSP L1 A6 A7 A7 sum0 ai bi ADDSP L2 B6 B7 B7 sum1 ai 1 bi 1 A1 SUB S1 A1 1 A1 decrement lo...

Page 169: ...able 6 5 shows a modulo iteration interval scheduling table for the fixed point dot product loop before software pipelining Example 6 19 Each row repre sents a functional unit There is a column for each cycle in the loop showing the instruction that is executing on a particular cycle LDWs on the D units are issued on cycles 0 8 16 24 etc MPY and MPYH on the M units are issued on cycles 5 13 21 29 ...

Page 170: ...p show ing the instruction that is executing on a particular cycle LDDWs on the D units are issued on cycles 0 10 20 30 etc MPYSPs and on the M units are issued on cycles 5 15 25 35 etc ADDSPs on the L units are issued on cycles 9 19 29 39 etc SUB on the S1 unit is issued on cycles 3 13 23 33 etc B on the S2 unit is issued on cycles 4 14 24 34 etc Table 6 6 Modulo Iteration Interval Scheduling Tab...

Page 171: ... instructions using the same resource cannot execute in parallel and therefore require at least four separate cycles to execute each instruction With the SUB and branch instructions on opposite sides of the dependency graph in Figure 6 9 and Figure 6 10 all eight instructions use a different func tional unit and no two instructions use the same cross paths 1X and 2X Because no two instructions use...

Page 172: ...re loop setup code or loop prolog Asterisks define which iteration of the loop the instruction is executing each cycle For example the rightmost column shows that on any given cycle inside the loop The ADD instructions are adding data for iteration n The MPY instructions are multiplying data for iteration n 2 The LDW instructions are loading data for iteration n 7 The SUB instruction is executing ...

Page 173: ...ÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁ ÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁ Note The asterisks indicate the iteration of the loop shading indicates the single cycle loop The rightmost column in Table 6 8 is a single cycle loop that contains the entire loop Cycles 0 8 are loop setup code or loop prolog Asterisks define which iteration of the loop the instruction is executing each cycle For example the rightmost co...

Page 174: ...instruction that is multicycle that is has delay slots other than 0 you must either unroll the loop or stagger the results When unrolling the loop multiple accumulators collect the results so that one result has finished executing and has been written into the accumulator before adding the next result of the accumulator If you do not unroll the loop then the accumulator will contain staggered resu...

Page 175: ...i 4 j x i j S S S where i is a multiple of 4 The first value of the array x x 0 is added to the accumulator sum on cycle 0 but the result is not ready until cycle 4 This means that on cycle 1 when x 1 is added to the accumulator sum sum has no value in it from x 0 Thus when this result is ready on cycle 5 sum will have the value x 1 in it instead of the value x 0 x 1 When you reach cycle 4 sum wil...

Page 176: ...ptimizer Example 6 24 Linear Assembly for Full Fixed Point Dot Product global _dotp _dotp cproc a b reg sum sum0 sum1 cntr reg ai_i1 bi_i1 pi pi1 MVK 50 cntr cntr 100 2 ZERO sum0 multiply result 0 ZERO sum1 multiply result 0 LOOP trip 50 LDW a ai_i1 load ai ai 1 from memory LDW b bi_i1 load bi bi 1 from memory MPY ai_i1 bi_i1 pi ai bi MPYH ai_i1 bi_i1 pi1 ai 1 bi 1 ADD pi sum0 sum0 sum0 ai bi ADD ...

Page 177: ...l result return sum endproc 6 5 3 Final Assembly Example 6 26 shows the assembly code for the fixed point software pipe lined dot product in Table 6 7 on page 6 35 Example 6 27 shows the assem bly code for the floating point software pipelined dot product in Table 6 8 on page 6 36 The accumulators are initialized to 0 and the loop counter is set up in the first execute packet in parallel with the ...

Page 178: ... 7 a branch executes back to LOOP until the loop counter finally decrements to 0 Once the loop counter is 0 five more branches execute because they are already in the pipe Executing the dot product code with the software pipelining as shown in Example 6 26 requires a total of 58 cycles 7 50 1 which is a significant improvement over the 402 cycles required by the code in Example 6 19 Note The code ...

Page 179: ...B4 B2 load bi bi 1 from memory A1 SUB S1 A1 1 A1 decrement loop counter A1 B S2 LOOP branch to loop LDW D1 A4 A2 load ai ai 1 from memory LDW D2 B4 B2 load bi bi 1 from memory MPY M1X A2 B2 A6 ai bi MPYH M2X A2 B2 B6 ai 1 bi 1 A1 SUB S1 A1 1 A1 decrement loop counter A1 B S2 LOOP branch to loop LDW D1 A4 A2 ld ai ai 1 from memory LDW D2 B4 B2 ld bi bi 1 from memory MPY M1X A2 B2 A6 ai bi MPYH M2X ...

Page 180: ... for Floating Point Dot Product Software Pipelined MVK S1 50 A1 set up loop counter ZERO L1 A8 sum0 0 ZERO L2 B8 sum1 0 LDDW D1 A4 A7 A6 load ai ai 1 from memory LDDW D2 B4 B7 B6 load bi bi 1 from memory LDDW D1 A4 A7 A6 load ai ai 1 from memory LDDW D2 B4 B7 B6 load bi bi 1 from memory LDDW D1 A4 A7 A6 load ai ai 1 from memory LDDW D2 B4 B7 B6 load bi bi 1 from memory LDDW D1 A4 A7 A6 load ai ai ...

Page 181: ...OOP branch to loop A1 SUB S1 A1 1 A1 decrement loop counter LOOP LDDW D1 A4 A7 A6 load ai ai 1 from memory LDDW D2 B4 B7 B6 load bi bi 1 from memory MPYSP M1X A6 B6 A5 pi a0 b0 MPYSP M2X A7 B7 B5 pi1 a1 b1 ADDSP L1 A5 A8 A8 sum0 ai bi ADDSP L2 B5 B8 B8 sum1 ai 1 bi 1 A1 B S2 LOOP branch to loop A1 SUB S1 A1 1 A1 decrement loop counter Branch occurs here ADDSP L1X A8 B8 A0 sum 0 sum0 0 sum1 0 ADDSP...

Page 182: ...pilog like that included in the second part of Example 6 28 on page 6 47 and Example 6 29 on page 6 48 Fixed Point Example To eliminate LDWs in the fixed point dot product from iterations 51 through 57 run the loop seven fewer times This brings the loop counter to 43 50 7 which means you still must execute seven more cycles of ADD instructions and five more cycles of MPY instructions Five pairs of...

Page 183: ...B4 B2 load bi bi 1 from memory A1 SUB S1 A1 1 A1 decrement loop counter A1 B S2 LOOP branch to loop LDW D1 A4 A2 load ai ai 1 from memory LDW D2 B4 B2 load bi bi 1 from memory A1 SUB S1 A1 1 A1 decrement loop counter A1 B S2 LOOP branch to loop LDW D1 A4 A2 load ai ai 1 from memory LDW D2 B4 B2 load bi bi 1 from memory A1 SUB S1 A1 1 A1 decrement loop counter A1 B S2 LOOP branch to loop LDW D1 A4 ...

Page 184: ...ADD L2 B6 B7 B7 sum1 ai 1 bi 1 MPY M1X A2 B2 A6 ai bi MPYH M2X A2 B2 B6 ai 1 bi 1 1 1 ADD L1 A6 A7 A7 sum0 ai bi ADD L2 B6 B7 B7 sum1 ai 1 bi 1 MPY M1X A2 B2 A6 ai bi MPYH M2X A2 B2 B6 ai 1 bi 1 2 2 ADD L1 A6 A7 A7 sum0 ai bi ADD L2 B6 B7 B7 sum1 ai 1 bi 1 MPY M1X A2 B2 A6 ai bi MPYH M2X A2 B2 B6 ai 1 bi 1 3 3 ADD L1 A6 A7 A7 sum0 ai bi ADD L2 B6 B7 B7 sum1 ai 1 bi 1 MPY M1X A2 B2 A6 ai bi MPYH M2...

Page 185: ...op A1 SUB S1 A1 1 A1 decrement loop counter LDDW D1 A4 A7 A6 load ai ai 1 from memory LDDW D2 B4 B7 B6 load bi bi 1 from memory MPYSP M1X A6 B6 A5 pi a0 b0 MPYSP M2X A7 B7 B5 pi1 a1 b1 A1 B S2 LOOP branch to loop A1 SUB S1 A1 1 A1 decrement loop counter LDDW D1 A4 A7 A6 load ai ai 1 from memory LDDW D2 B4 B7 B6 load bi bi 1 from memory MPYSP M1X A6 B6 A5 pi a0 b0 MPYSP M2X A7 B7 B5 pi1 a1 b1 A1 B ...

Page 186: ...SP L1 A5 A8 A8 sum0 ai bi ADDSP L2 B5 B8 B8 sum1 ai 1 bi 1 MPYSP M1X A6 B6 A5 pi a0 b0 MPYSP M2X A7 B7 B5 pi1 a1 b1 ADDSP L1 A5 A8 A8 sum0 ai bi ADDSP L2 B5 B8 B8 sum1 ai 1 bi 1 MPYSP M1X A6 B6 A5 pi a0 b0 MPYSP M2X A7 B7 B5 pi1 a1 b1 ADDSP L1 A5 A8 A8 sum0 ai bi ADDSP L2 B5 B8 B8 sum1 ai 1 bi 1 MPYSP M1X A6 B6 A5 pi a0 b0 MPYSP M2X A7 B7 B5 pi1 a1 b1 ADDSP L1 A5 A8 A8 sum0 ai bi ADDSP L2 B5 B8 B8...

Page 187: ... Continued ADDSP L1X A8 B8 A0 sum 0 sum0 0 sum1 0 ADDSP L2X A8 B8 B0 sum 1 sum0 1 sum1 1 ADDSP L1X A8 B8 A0 sum 2 sum0 2 sum1 2 ADDSP L2X A8 B8 B0 sum 3 sum0 3 sum1 3 NOP wait for B0 ADDSP L1X A0 B0 A5 sum 01 sum 0 sum 1 NOP wait for next B0 ADDSP L2X A0 B0 B5 sum 23 sum 2 sum 3 NOP 3 ADDSP L1X A5 B5 A4 sum sum 01 sum 23 NOP 3 ...

Page 188: ...prolog means that Two LDWs two MPYs and two ADDs occur in the first execution cycle of the loop Because the first LDWs require five cycles to write results into a register the MPYs do not multiply valid data until after the loop executes five times The ADDs have no valid data until after seven cycles five cycles for the first LDWs and two more cycles for the first valid MPYs Example 6 30 shows the...

Page 189: ... 1 A1 decrement loop counter A1 B S2 LOOP branch to loop ZERO L1 A2 zero out mpy input ZERO L2 B2 zero out mpy input A1 SUB S1 A1 1 A1 decrement loop counter A1 B S2 LOOP branch to loop A1 SUB S1 A1 1 A1 decrement loop counter A1 B S2 LOOP branch to loop A1 SUB S1 A1 1 A1 decrement loop counter A1 B S2 LOOP branch to loop LOOP ADD L1 A6 A7 A7 sum0 ai bi ADD L2 B6 B7 B7 sum1 ai 1 bi 1 MPY M1X A2 B2...

Page 190: ... valid data is available ensures that the final accumulator values are unaffected The loop counter is initialized to 59 to accommodate the nine extra cycles needed to prime the loop Because the first LDDWs are not issued until after nine cycles the code in Example 6 31 requires a total of 81 cycles 7 59 15 Therefore you are reducing the code size with a slight loss in performance Example 6 31 Asse...

Page 191: ...load bi bi 1 from memory MPYSP M1X A6 B6 A5 pi a0 b0 MPYSP M2X A7 B7 B5 pi1 a1 b1 ADDSP L1 A5 A8 A8 sum0 ai bi ADDSP L2 B5 B8 B8 sum1 ai 1 bi 1 A1 B S2 LOOP branch to loop A1 SUB S1 A1 1 A1 decrement loop counter Branch occurs here ADDSP L1X A8 B8 A0 sum 0 sum0 0 sum1 0 ADDSP L2X A8 B8 B0 sum 1 sum0 1 sum1 1 ADDSP L1X A8 B8 A0 sum 2 sum0 2 sum1 2 ADDSP L2X A8 B8 B0 sum 3 sum0 3 sum1 3 NOP wait for...

Page 192: ... This code shows some improvement over Example 6 30 and Example 6 31 The loop in Example 6 32 requires 63 cycles 5 57 1 and the loop in Example 6 31 requires 79 cycles 5 59 15 Example 6 32 Assembly Code for Fixed Point Dot Product Software Pipelined With Smallest Code Size B S2 LOOP branch to loop MVK S1 51 A1 set up loop counter B S2 LOOP branch to loop B S2 LOOP branch to loop ZERO L1 A7 zero ou...

Page 193: ...loop ZERO L1 A6 zero out mpysp input ZERO L2 B6 zero out mpysp input LOOP LDDW D1 A4 A7 A6 load ai ai 1 from memory LDDW D2 B4 B7 B6 load bi bi 1 from memory MPYSP M1X A6 B6 A5 pi a0 b0 MPYSP M2X A7 B7 B5 pi1 a1 b1 ADDSP L1 A5 A8 A8 sum0 ai bi ADDSP L2 B5 B8 B8 sum1 ai 1 bi 1 A1 B S2 LOOP branch to loop A1 SUB S1 A1 1 A1 decrement loop counter Branch occurs here ADDSP L1X A8 B8 A0 sum 0 sum0 0 sum...

Page 194: ... with no extrane ous loads 7 43 7 1 58 Example 6 30 Fixed point software pipelined dot product with no prolog or epilog 7 57 1 65 Example 6 32 Fixed point software pipelined dot product with smallest code size 5 57 1 63 Table 6 11 Comparison of Floating Point Dot Product Code Examples Code Example 100 Iterations Cycle Count Example 6 11 Floating point dot product nonparallel assembly 2 100 21 2102...

Page 195: ...hted Vector Sum C Code void w_vec short a short b short c short m int i for i 0 i 100 i c i m a i 15 b i 6 6 2 Translating C Code to Linear Assembly Example 6 35 shows the linear assembly that executes the weighted vector sum in Example 6 34 This linear assembly does not have functional units as signed The dependency graph will help in those decisions However before looking at the dependency graph...

Page 196: ... other resource is used more than twice the minimum iteration interval for this loop is 2 Memory operations determine the minimum iteration interval in this example Therefore before scheduling this assembly code unroll the loop and perform LDWs to help improve the performance 6 6 3 1 Unrolling the Weighted Vector Sum C Code Example 6 36 shows the C code for an unrolled version of the weighted vect...

Page 197: ... 15 AND bi_i 1 mask bi bi SHR bi_i 1 16 bi 1 bi 1 ADD pi_scaled bi ci ci m ai 15 bi ADD pi 1_scaled bi 1 ci 1 ci 1 m ai 1 15 bi 1 STH ci ciptr 2 store ci STH ci 1 ci 1ptr 2 store ci 1 cntr SUB cntr 1 cntr decrement loop counter cntr B LOOP branch to loop 6 6 3 3 Determining a New Minimum Iteration Interval Use the following considerations to determine the minimum iteration interval for the assembl...

Page 198: ...hree operations in one unit on a side would result in an minimum iteration interval of 3 Figure 6 11 shows the dependency graph divided evenly with a minimum it eration interval of 2 Figure 6 11 Dependency Graph of Weighted Vector Sum B side A side 2 STH 1 1 1 1 1 1 1 ADD 1 5 5 SHR AND LDW B SUB STH ADD SHR 2 SHR MPYHL 5 5 MPY LOOP cntr mem mem ci bi_bi 1 bi 1 bi ci 1 pi 1_scaled pi_scaled LDW pi ...

Page 199: ...i STH D2 B9 B0 2 store ci 1 A1 SUB L1 A1 1 A1 decrement loop counter A1 B S1 LOOP branch to loop 6 6 6 Modulo Iteration Interval Scheduling Table 6 12 provides a method to keep track of resources that are a modulo iteration interval away from each other In the single cycle dot product exam ple every instruction executed every cycle and therefore required only one set of resources Table 6 12 includ...

Page 200: ...s on the even cycles The MPY and MPYH are scheduled on cycle 5 because the LDW has four delay slots The MPY instructions appear in two rows because they use the M and cross path resources on cycles 5 7 9 etc The two SHR instructions are scheduled two cycles after the MPY to allow for the MPY s single delay slot The AND is scheduled on cycle 5 four delay slots after the LDW ...

Page 201: ... 1 LDW bi_i 1 LDW bi_i 1 LDW bi_i 1 LDW bi_i 1 M1 M2 L1 L2 S1 S2 1X 2X Unit Cycle 1 3 5 7 9 11 D1 D2 M1 MPY pi MPY pi MPY pi MPY pi M2 MPYHL pi 1 MPYHL pi 1 MPYHL pi 1 MPYHL pi 1 L1 AND bi AND bi AND bi AND bi L2 S1 SHR pi_s SHR pi_s SHR pi_s S2 SHR pi 1_s SHR pi 1_s SHR pi 1_s 1X MPY pi MPY pi MPY pi MPY pi 2X MPYHL pi 1 MPYHL pi 1 MPYHL pi 1 MPYHL pi 1 Note The asterisks indicate the iteration o...

Page 202: ... shows the addition of the SHR bi 1 instruction This must avoid a conflict of resources in cycles 5 and 7 which are one iteration interval away from each other Even though LDW bi_i 1 D2 cycle 0 finishes on cycle 5 its child SHR bi 1 cannot be scheduled on S2 until cycle 6 because of a resource conflict with SHR pi 1_scaled which is on S2 in cycle 7 Figure 6 12 Dependency Graph of Weighted Vector S...

Page 203: ...2 LDW bi_i 1 LDW bi_i 1 LDW bi_i 1 LDW bi_i 1 LDW bi_i 1 LDW bi_i 1 M1 M2 L1 L2 S1 S2 SHR bi 1 SHR bi 1 SHR bi 1 1X 2X Unit Cycle 1 3 5 7 9 11 13 15 D1 D2 M1 MPY pi MPY pi MPY pi MPY pi M2 MPYHL pi 1 MPYHL pi 1 MPYHL pi 1 MPYHL pi 1 L1 AND bi AND bi AND bi AND bi L2 S1 SHR pi_s SHR pi_s SHR pi_s S2 SHR pi 1_s SHR pi 1_s SHR pi 1_s 1X MPY pi MPY pi MPY pi MPY pi 2X MPYHL pi 1 MPYHL pi 1 MPYHL pi 1 ...

Page 204: ...s incorrect The ADD ci demonstrates a live too long problem No value can be live in a register for more than the number of cycles in the loop Otherwise iteration n 1 writes into the register before iteration n has read that register Therefore in a 2 cycle loop a value is written to a register at the end of cycle n then all children of that value must read the register before the end of cycle n 2 6...

Page 205: ...esource Conflict Resolved B side A side 2 STH 1 1 1 1 1 1 1 ADD 1 5 5 SHR AND LDW B SUB STH ADD SHR 2 SHR MPYHL 5 5 MPY LOOP cntr mem mem ci bi_i 1 bi 1 bi ci 1 pi 1_scaled pi_scaled LDW pi 1 pi ai_i 1 0 0 5 5 7 7 6 8 6 Note Shaded numbers indicate the cycle in which the instruction is first scheduled ...

Page 206: ...i_i 1 LDW bi_i 1 LDW bi_i 1 LDW bi_i 1 M1 M2 L1 ADD ci ADD ci L2 AND bi AND bi AND bi S1 S2 SHR bi 1 SHR bi 1 SHR bi 1 1X 2X Unit Cycle 1 3 5 7 9 11 D1 D2 M1 MPY pi MPY pi MPY pi MPY pi M2 MPYHL pi 1 MPYHL pi 1 MPYHL pi 1 MPYHL pi 1 L1 L2 S1 SHR pi_s SHR pi_s SHR pi_s S2 SHR pi 1_s SHR pi 1_s SHR pi 1_s 1X MPY pi MPY pi MPY pi MPY pi 2X MPYHL pi 1 MPYHL pi 1 MPYHL pi 1 MPYHL pi 1 Note The asterisk...

Page 207: ...orrectly is shown in Table 6 15 Figure 6 14 Dependency Graph of Weighted Vector Sum Scheduling ci 1 B side A side 2 STH 1 1 1 1 1 1 1 ADD 1 5 5 SHR AND LDW B SUB STH ADD SHR 2 SHR MPYHL 5 5 MPY LOOP cntr mem mem ci bi_i 1 bi 1 bi ci 1 pi 1_scaled pi_scaled LDW pi 1 pi ai_i 1 D1 M1 S1 L1X D1 L1 M2 S2 L2 D2 L2 D2 S2 0 5 8 7 11 6 5 9 6 9 8 7 2 10 Note Shaded numbers indicate the cycle in which the in...

Page 208: ...source conflicts and live too long problems Table 6 15 also includes the following additional changes LDW bi_i 1 D2 moved from cycle 0 to cycle 2 AND bi L2 moved from cycle 6 to cycle 7 SHR pi 1_scaled S2 moved from cycle 7 to cycle 9 MPYHL pi 1 moved from cycle 5 to cycle 6 SHR bi 1 moved from cycle 6 to 8 From the table you can see that this loop is pipelined six iterations deep be cause iterati...

Page 209: ...YHL pi 1 MPYHL pi 1 L1 ADD ci ADD ci L2 ADD ci 1 S1 B LOOP B LOOP B LOOP S2 SHR bi 1 SHR bi 1 1X ADD ci ADD ci 2X MPYHL pi 1 MPYHL pi 1 MPYHL pi 1 Unit Cycle 1 3 5 7 9 11 13 15 D1 STH ci STH ci D2 STH ci 1 M1 MPY pi MPY pi MPY pi MPY pi M2 L1 SUB cntr SUB cntr SUB cntr SUB cntr L2 AND bi AND bi AND bi S1 SHR pi_s SHR pi_s SHR pi_s S2 SHR pi 1_s SHR pi 1_s 1X MPY pi MPY pi MPY pi MPY pi 2X Note The...

Page 210: ...a b c m reg ai_i1 bi_i1 pi pi1 pi_i1 pi_s pi1_s reg mask bi bi1 ci ci1 c1 cntr MVK 1 mask set to all 1s to create 0xFFFFFFFF MVKH 0 mask clear upper 16 bits to create 0xFFFF MVK 50 cntr cntr 100 2 ADD 2 c c1 point to c 1 LOOP trip 50 LDW D2 a ai_i1 ai ai 1 LDW D1 b bi_i1 bi bi 1 MPY M1 ai_i1 m pi m ai MPYHL M2 ai_i1 m pi1 m ai 1 SHR S1 pi 15 pi_s m ai 15 SHR S2 pi1 15 pi1_s m ai 1 15 AND L2X bi_i1...

Page 211: ...eration n 1 of STH ci is executing To prevent the STH ci instruction from executing itera tion 51 while STH ci 1 executes iteration 50 execute the loop only 49 times and schedule the final executions of ADD ci 1 and STH ci 1 after exiting the loop The mask for the AND instruction is created with MVK and MVKH in paral lel with the loop prolog The pointer to the odd elements in array c is also set u...

Page 212: ...m ai 1 A1 B S1 LOOP branch to loop LDW D2 B4 B2 bi bi 1 LDW D1 A4 A2 ai ai 1 SHR S1 A5 15 A7 m ai 15 AND L2 B2 B10 B8 bi MPY M1X A2 B6 A5 m ai A1 SUB L1 A1 1 A1 decrement loop counter SHR S2 B2 16 B1 bi 1 ADD L1X A7 B8 A9 ci m ai 15 bi MPYHL M2X A2 B6 B5 m ai 1 A1 B S1 LOOP branch to loop LDW D2 B4 B2 bi bi 1 LDW D1 A4 A2 ai ai 1 SHR S2 B5 15 B7 m ai 1 15 STH D1 A9 A6 2 store ci SHR S1 A5 15 A7 m ...

Page 213: ... Vector Sum Continued STH D2 B9 B0 2 store ci 1 SHR S2 B5 15 B7 m ai 1 15 STH D1 A9 A6 2 store ci SHR S1 A5 15 A7 m ai 15 AND L2 B2 B10 B8 bi A1 SUB L1 A1 1 A1 decrement loop counter MPY M1X A2 B6 A5 m ai Branch occurs here ADD L2 B7 B1 B9 ci 1 m ai 1 15 bi 1 STH D2 B9 B0 store ci 1 ...

Page 214: ...tead of resources determine the minimum iteration interval IIR filter code contains a loop carry path output samples are used as input to the computation of the next output sample 6 7 1 IIR Filter C Code Example 6 41 shows C code for a simple IIR filter In this example y i is an input to the calculation of y i 1 Before y i can be read for the next iteration y i 1 must be computed from the previous...

Page 215: ...e same address when loading both xi 1 for one iteration and xi for the next iteration yptr is also not postincremented after storing yi 1 because yi of the next iteration is yi 1 for the current iteration Example 6 42 Linear Assembly for IIR Inner Loop LDH xptr xi xi 1 MPY c1 xi p0 c1 xi LDH xptr xi 1 xi 1 MPY c2 xi 1 p1 c2 xi 1 ADD p0 p1 s0 c1 xi c2 xi 1 LDH yptr yi yi MPY c3 yi p2 c3 yi ADD s0 p...

Page 216: ... load and store instructions use the same memory pipeline Therefore if a store is issued to a particular address on cycle n and a load from that same address is issued on the next cycle the load reads the value that was written by the store instruction Figure 6 15 Dependency Graph of IIR Filter LDH ADD SUB cntr LOOP 1 B 1 SHR yi 1 mem STH s1 s0 MPY xi 5 p0 2 LDH MPY xi 1 5 p1 2 ADD LDH MPY yi p2 1...

Page 217: ...uses only three non M units so this does not affect the minimum iteration interval and no other unit is used more than twice Table 6 16 Resource Table for IIR Filter a A side b B side Unit s Instructions Total Unit Unit s Instructions Total Unit M1 2 MPYs 2 M2 MPY 1 S1 B 1 S2 SHR 1 D1 2 LDHs 2 D2 STH 1 L1 S1 or D1 ADD SUB 2 L2 or S2 D2 ADD 1 Total non M units 5 Total non M units 3 However the IIR ...

Page 218: ...e the MPY p2 instruction can read yi 1 while it is still in a register you can reduce the loop carry path by six cycles LDH yi is no longer in the graph Instead you can issue LDH y 0 once outside the loop In every iteration after that the y 1 values written by the SHR instruction are valid y inputs to the MPY instruction Figure 6 16 Dependency Graph of IIR Filter With Smaller Loop Carry LDH ADD SU...

Page 219: ...yi ADD s0 p2 s1 c1 xi c2 xi 1 c3 yi SHR s1 15 y yi 1 STH y yptr store yi 1 cntr SUB cntr 1 cntr decrement loop counter cntr B LOOP branch to loop 6 7 5 Linear Assembly Resource Allocation Example 6 44 shows the same linear assembly instructions as those in Example 6 43 with the functional units and registers assigned Example 6 44 Linear Assembly for IIR Inner Loop With Allocated Resources LDH D1 A...

Page 220: ...Á ÁÁÁÁ ÁÁÁÁÁ ÁÁÁÁÁ ÁÁÁÁÁ ÁÁÁÁÁ ÁÁÁÁ ÁÁÁÁ 1X ÁÁÁÁÁ ÁÁÁÁÁ ÁÁÁÁ ÁÁÁÁ ÁÁÁÁÁ ÁÁÁÁÁ ÁÁÁÁÁ ÁÁÁÁÁ 1X ÁÁÁÁ ÁÁÁÁ ÁÁÁÁÁ ÁÁÁÁÁ ÁÁÁÁÁ ÁÁÁÁÁ ÁÁÁÁ ÁÁÁÁ 2X ÁÁÁÁÁ ÁÁÁÁÁ ÁÁÁÁ ÁÁÁÁ ÁÁÁÁÁ ÁÁÁÁÁ ÁÁÁÁÁ ÁÁÁÁÁ 2X ÁÁÁÁ ÁÁÁÁ ÁÁÁÁÁ ÁÁÁÁÁ ÁÁÁÁÁ ÁÁÁÁÁ ADD s1 ÁÁÁÁ ÁÁÁÁ Unit Cycle ÁÁÁÁÁ ÁÁÁÁÁ 2 ÁÁÁÁ ÁÁÁÁ 6 ÁÁÁÁÁ ÁÁÁÁÁ 10 14 18 ÁÁÁÁÁ ÁÁÁÁÁ Unit Cycle ÁÁÁÁ ÁÁÁÁ 3 ÁÁÁÁÁ ÁÁÁÁÁ 7 ÁÁÁÁÁ ÁÁÁÁÁ 11 15 19 ÁÁÁÁ ÁÁÁÁ D1 ÁÁÁÁÁ ÁÁÁÁÁ ÁÁÁÁ ÁÁ...

Page 221: ... hand Example 6 45 Linear Assembly for IIR Filter global _iir _iir cproc x y c1 c2 c3 reg xi xi1 yi1 reg p0 p1 p2 s0 s1 cntr MVK 100 cntr cntr 100 LDH D2 y yi1 yi 1 LOOP trip 100 LDH D1 x xi xi MPY M1 c1 xi p0 c1 xi LDH D1 x xi1 xi 1 MPY M1X c2 xi1 p1 c2 xi 1 ADD L1 p0 p1 s0 c1 xi c2 xi 1 MPY M2X c3 yi1 p2 c3 yi ADD L2X s0 p2 s1 c1 xi c2 xi 1 c3 yi SHR S2 s1 15 yi1 yi 1 STH D2 yi1 y store yi 1 cnt...

Page 222: ...D2 B4 B2 load y 0 outside of loop MVK S1 100 A1 set up loop counter LDH D1 A4 A2 xi A1 SUB L1 A1 1 A1 decrement loop counter MPY M1 A6 A2 A5 c1 xi LDH D1 A4 A3 xi 1 MPY M1X B6 A3 A7 c2 xi 1 A1 B S1 LOOP branch to loop MPY M2X A8 B2 B3 c3 yi LOOP ADD L1 A5 A7 A9 c1 xi c2 xi 1 LDH D1 A4 A2 xi ADD L2X B3 A9 B5 c1 xi c2 xi 1 c3 yi A1 SUB L1 A1 1 A1 decrement loop counter MPY M1 A6 A2 A5 c1 xi LDH D1 A...

Page 223: ...e int if_then short a int codeword int mask short theta int i sum cond sum 0 for i 0 i 32 i cond codeword mask if theta cond sum a i else sum a i mask mask 1 return sum Branching is one way to execute the if then else statement branch to the ADD when the if statement is true and branch to the SUB when the if statement is false However because each branch has five delay slots this method requires a...

Page 224: ...cond cond CMPEQ theta cond if theta cond LDH aptr ai a i if ADD sum ai sum sum a i if SUB sum ai sum sum a i SHL mask 1 mask mask mask 1 cntr ADD 1 cntr cntr decrement counter cntr B LOOP for LOOP CMPEQ is used to create IF The ADD is conditional when IF is nonzero corre sponds to then the SUB is conditional when IF is 0 corresponds to else A conditional MVK performs the cond C statement If the re...

Page 225: ... a SUB each of these nodes is a possible input to the next itera tion of either node The LDH ai instruction is a parent of both ADD sum and SUB sum be cause both instructions read ai CMPEQ if is also a parent to ADD sum and SUB sum because both read IF for the conditional execution The result of SHL mask is read on the next iteration by the AND cond instruction Figure 6 17 Dependency Graph of If T...

Page 226: ...e on the S L or D units The AND can be on a S or L unit or D unit C64x only From Table 6 18 you can see that no one resource is used more than two times so the minimum iteration interval is still 2 Table 6 18 Resource Table for If Then Else Code a A side b B side Unit s Instructions Total Unit Unit s Instructions Total Unit M1 0 M2 0 S1 SHL B 2 S2 MVK 1 D1 LDH 1 L2 CMPEQ 1 L1 S1 or D1 ADD SUB 2 L2...

Page 227: ...nctional units and regis ters that are used in the inner loop Example 6 49 Linear Assembly for Full If Then Else Code global _if_then _if_then cproc a cword mask theta reg cond if ai sum cntr MVK 32 cntr cntr 32 ZERO sum sum 0 LOOP trip 32 AND S2X cword mask cond cond codeword mask cond MVK S2 1 cond cond CMPEQ L2 theta cond if theta cond LDH D1 a ai a i if ADD L1 sum ai sum sum a i if SUB D1 sum ...

Page 228: ...1 LOOP for LOOP LDH D1 A4 A5 a i SHL S1 A6 1 A6 mask mask 1 AND S2X B4 A6 B2 cond codeword mask B2 MVK S2 1 B2 cond B0 ADD L2 1 B0 B0 decrement counter B0 B S1 LOOP for LOOP LDH D1 A4 A5 a i CMPEQ L2 B6 B2 B1 theta cond SHL S1 A6 1 A6 mask mask 1 AND S2X B4 A6 B2 cond codeword mask ZERO L1 A7 zero out accumulator LOOP B0 ADD L2 1 B0 B0 decrement counter B2 MVK S2 1 B2 cond B0 B S1 LOOP for LOOP LD...

Page 229: ... the actual number of times you want the loop to execute in this case 29 32 3 Example 6 51 Assembly Code for If Then Else With Loop Count Greater Than 3 B S1 LOOP for LOOP LDH D1 A4 A5 a i MVK S2 29 B0 set up loop counter SHL S1 A6 1 A6 mask mask 1 AND S2X B4 A6 B2 cond codeword mask B2 MVK S2 1 B2 cond B S1 LOOP for LOOP LDH D1 A4 A5 a i CMPEQ L2 B6 B2 B1 theta cond SHL S1 A6 1 A6 mask mask 1 AND...

Page 230: ...g Assembly Code via Linear Assembly Table 6 19 Comparison of If Then Else Code Examples Code Example Cycles Cycle Count Example 6 50 If then else assembly code 2 32 6 70 Example 6 51 If then else assembly code with loop count greater than 3 2 32 4 68 ...

Page 231: ...nimum iteration interval of 3 provides a 25 improvement in throughput three cycles to do two iterations rather than the four cycles required in Example 6 51 6 9 1 Unrolled If Then Else C Code Example 6 52 shows the unrolled version of the if then else C code in Example 6 47 on page 6 86 Example 6 52 If Then Else C Code Unrolled int unrolled_if_then short a int codeword int mask short theta int i s...

Page 232: ...odeword maski condi condi codeword maski condi MVK 1 condi condi CMPEQ theta condi ifi theta condi LDH aptr ai a i ifi ADD sumi ai sumi sum a i ifi SUB sumi ai sumi sum a i SHL maski 1 maski 1 maski 1 maski 1 AND codeword maski 1 condi 1 condi 1 codeword maski 1 condi 1 MVK 1 condi 1 condi 1 CMPEQ theta condi 1 ifi 1 theta condi 1 LDH aptr ai 1 a i ifi 1 ADD sumi 1 ai 1 sumi 1 sum a i 1 ifi 1 SUB ...

Page 233: ...ee non M instructions can execute per cycle Figure 6 18 shows the dependency graph for the unrolled if then else code Nine instructions are on the A side and seven instructions are on the B side Figure 6 18 Dependency Graph of If Then Else Code Unrolled ADD cntr Loop 1 B 1 CMPEQ ifi 1 sumi 1 1 SUB condi 1 1 MVK A side B side ADD 1 1 1 LDH ai 1 5 5 maski 1 1 1 sumi 1 1 1 SHL condi 1 1 AND LDH ai 5 ...

Page 234: ...l of nine instructions can be performed with the minimum iteration interval of 3 because only seven non M instructions are on the B side the minimum iteration interval is still 3 Table 6 20 Resource Table for Unrolled If Then Else Code a A side b B side Unit s Instructions Total Unit Unit s Instructions Total Unit M1 0 M2 0 S1 MVK and 2 SHLs 3 S2 MVK and B 2 D1 2 LDHs 2 L2 CMPEQ 1 L1 CMPEQ 1 L2 pr...

Page 235: ...ERO sumi sumi 0 ZERO sumi1 sumi 1 0 LOOP trip 32 AND L1X cword mask cdi cdi codeword maski cdi MVK S1 1 cdi cdi CMPEQ L1X theta cdi ifi theta cdi LDH D1 a ai a i ifi ADD L1 sumi ai sumi sum a i ifi SUB D1 sumi ai sumi sum a i SHL S1 mask 1 mask maski 1 maski 1 AND L2X cword mask cdi1 cdi 1 codeword maski 1 cdi1 MVK S2 1 cdi1 cdi 1 CMPEQ L2 theta cdi1 ifi1 theta cdi 1 LDH D1 a ai1 a i 1 ifi1 ADD L2...

Page 236: ...2X B4 A6 B2 condi 1 codeword maski 1 ZERO L1 A7 zero accumulator B2 MVK S2 1 B2 condi 1 CMPEQ L1X B6 A2 A1 theta condi SHL S1 A6 1 A6 maski maski 1 1 LDH D1 A4 A5 a i ZERO L2 B7 zero accumulator LOOP CMPEQ L2 B6 B2 B1 theta condi 1 B0 ADD D2 1 B0 B0 decrement counter LDH D1 A4 B5 a i 1 B0 B S2 LOOP for LOOP SHL S1 A6 1 A6 maski 1 maski 1 AND L1X B4 A6 A2 condi codeword maski A1 ADD L1 A7 A5 A7 sum...

Page 237: ... the if then else code examples Table 6 21 Comparison of If Then Else Code Examples Code Example Cycles Cycle Count Example 6 50 If then else assembly code 2 32 6 70 Example 6 51 If then else assembly code with loop count greater than 3 2 32 4 68 Example 6 55 Unrolled if then else assembly code 3 16 5 53 ...

Page 238: ...oblem simply by moving the parent to a later cycle This is not always a valid solution 6 10 1 C Code With Live Too Long Problem Example 6 56 shows C code with a live too long problem that cannot be solved by rescheduling the parent instruction Although it is not obvious from the C code the dependency graph in Figure 6 19 on page 6 103 shows a split join path that causes this live too long problem ...

Page 239: ...0 15 a1 a1 a0 15 MPY a1 d a2 a2 a1 d ADD a2 a0 a3 a3 a2 a0 ADD sum0 a3 sum0 sum0 a3 MPY bi c b0 b0 bi c SHR b0 15 b1 b1 b0 15 MPY b1 e b2 b2 b1 e ADD b2 b0 b3 b3 b2 b0 ADD sum1 b3 sum1 sum1 b3 cntr SUB cntr 1 cntr decrement loop counter cntr B LOOP branch to loop 6 10 3 Drawing a Dependency Graph Figure 6 19 shows the dependency graph for the live too long code This algorithm includes three separa...

Page 240: ... via Linear Assembly Figure 6 19 Dependency Graph of Live Too Long Code SUB cntr LOOP 1 B 1 A side B side SHR a1 a0 2 MPY ADD sum0 1 ai 5 LDH 2 1 MPY a2 1 ADD a3 2 SHR b1 b0 2 MPY ADD sum1 1 bi 5 LDH 2 1 MPY b2 1 ADD b3 2 Split join path Split join path ...

Page 241: ...L1 S1 or D1 2 ADDs 2 L2 S2 or D2 2 ADDs and SUB 3 Total non M units 5 Total non M units 5 However the minimum iteration interval is determined by both resources and data dependency A loop carry path determined the minimum iteration interval of the IIR filter in section 6 7 Loop Carry Paths on page 6 77 In this example a live too long problem determines the minimum iteration interval 6 10 4 1 Split...

Page 242: ...sources and the data dependencies of the split join path Although unrolling the loop allows you to achieve the highest possible loop throughput unrolling the loop does increase the code size 6 10 4 3 Inserting Moves Another solution to the live too long problem is to break up the lifetime of a0 and b0 by inserting move MV instructions The MV instruction breaks up the left path of the split join pa...

Page 243: ...d The choice of units for the ADDs and SUB is flexible and represents one of a number of possibilities One goal is to ensure that no functional unit is used more than the minimum iteration interval or two times The two 2X paths and one 1X path are required because the values c d and e reside on the side opposite from the instruction that is reading them If these values had created a bottleneck of ...

Page 244: ...rom memory LDH D2 b bi load bi from memory MPY M1 ai c a_0 a0 ai c SHR S1 a_0 15 a_1 a1 a0 15 MPY M1X a_1 d a_2 a2 a1 d MV D1 a_0 a0p save a0 across iterations ADD L1 a_2 a0p a_3 a3 a2 a0 ADD L1 sum0 a_3 sum0 sum0 a3 MPY M2X bi c b_0 b0 bi ci SHR S2 b_0 15 b_1 b1 b0 15 MPY M2X b_1 e b_2 b2 b1 e MV D2 b_0 b0p save b0 across iterations ADD L2 b_2 b0p b_3 b3 b2 b0 ADD L2 sum1 b_3 sum1 sum1 b3 cntr SU...

Page 245: ...ry B2 SUB S2 B2 1 B2 decrement loop counter MPY M1 A0 A6 A3 a0 ai c MPY M2X B0 A6 B10 b0 bi c LDH D1 A4 A0 load ai from memory LDH D2 B4 B0 load bi from memory B2 SUB S2 B2 1 B2 decrement loop counter B2 B S1 LOOP branch to loop SHR S1 A3 15 A5 a1 a0 15 SHR S2 B10 15 B5 b1 b0 15 MPY M1 A0 A6 A3 a0 ai c MPY M2X B0 A6 B10 b0 bi c LDH D1 A4 A0 load ai from memory LDH D2 B4 B0 load bi from memory MPY ...

Page 246: ... D1 A3 A2 save a0 across iterations MPY M2X B5 A8 B7 b2 b1 e MV D2 B10 B8 save b0 across iterations B2 SUB S2 B2 1 B2 decrement loop counter B2 B S1 LOOP branch to loop ADD L1 A1 A9 A1 sum0 a3 ADD L2 B1 B9 B1 sum1 b3 SHR S1 A3 15 A5 a1 a0 15 SHR S2 B10 15 B5 b1 b0 15 MPY M1 A0 A6 A3 a0 ai c MPY M2X B0 A6 B10 b0 bi c LDH D1 A4 A0 load ai from memory LDH D2 B4 B0 load bi from memory Branch occurs he...

Page 247: ...tuation is to perform LDWs instead of LDHs to read two data values at a time Although using LDW works for the h array the x array presents a different problem because the C6x does not allow you to load values across a word boundary For example on the first outer loop j 0 you can read the x array elements 0 and 1 2 and 3 etc as long as elements 0 and 1 are aligned on a 4 byte word boundary However ...

Page 248: ...ions for every four multiply accumulate operations Now the memory accesses no longer limit the performance Example 6 61 FIR Filter C Code With Redundant Load Elimination void fir short x short h short y int i j sum0 sum1 short x0 x1 h0 h1 for j 0 j 100 j 2 sum0 0 sum1 0 x0 x j for i 0 i 32 i 2 x1 x j i 1 h0 h i sum0 x0 h0 sum1 x1 h0 x0 x j i 2 h1 h i 1 sum0 x1 h1 sum1 x0 h1 y j sum0 15 y j 1 sum1 ...

Page 249: ...ut successive even elements are loaded inside the loop Example 6 62 Linear Assembly for FIR Inner Loop LDH D2 x_1 2 x1 x1 x j i 1 LDH D1 h 2 h0 h0 h i MPY M1 x0 h0 p00 x0 h0 MPY M1X x1 h0 p10 x1 h0 ADD L1 p00 sum0 sum0 sum0 x0 h0 ADD L2X p10 sum1 sum1 sum1 x1 h0 LDH D1 x 2 x0 x0 x j i 2 LDH D2 h_1 2 h1 h1 h i 1 MPY M2 x1 h1 p01 x1 h1 MPY M2X x0 h1 p11 x0 h1 ADD L1X p01 sum0 sum0 sum0 x1 h1 ADD L2 ...

Page 250: ...ncy graph of the FIR filter with redundant load elimination Figure 6 21 Dependency Graph of FIR Filter With Redundant Load Elimination L2 L2X A side B side ADD ADD 1 1 B 1 1 SUB 1 1 2 ADD ADD 2 5 5 5 LDH LDH 5 MPY 5 LDH 5 2 2 MPY MPY 5 5 LDH cntr loop sum0 sum0 h0 p10 p00 x0 x1 h1 p01 sum1 sum1 p11 D1 D1 D2 D2 M2 M2X M1 M1X L1 L1X S2 S2 MPY ...

Page 251: ...n M units 6 1X paths 2 2X paths 2 6 11 5 Linear Assembly Resource Allocation Example 6 63 shows the linear assembly with functional units and registers assigned Example 6 63 Linear Assembly for Full FIR Code global _fir _fir cproc x h y reg x_1 h_1 sum0 sum1 ctr octr reg p00 p01 p10 p11 x0 x1 h0 h1 rstx rsth ADD h 2 h_1 set up pointer to h 1 MVK 50 octr outer loop ctr 100 2 MVK 64 rstx used to rst...

Page 252: ...um1 15 SUB x rstx x reset x pointer to x j SUB h_1 rsth h_1 reset h pointer to h 0 octr B OUTLOOP branch to outer loop endproc 6 11 6 Final Assembly Example 6 64 shows the final assembly for the FIR filter without redundant load instructions At the end of the inner loop is a branch to OUTLOOP that executes the next outer loop The outer loop counter is 50 because iterations j and j 1 execute each t...

Page 253: ...A9 zero out sum0 ZERO L2 B9 zero out sum1 LDH D2 B4 2 B0 h1 h i 1 LDH D1 A4 2 A0 x0 x j i 2 LDH D1 A5 2 A1 h0 h i LDH D2 B5 2 B1 x1 x j i 1 B2 SUB S2 B2 1 B2 decrement inner loop counter LDH D2 B4 2 B0 h1 h i 1 LDH D1 A4 2 A0 x0 x j i 2 B2 B S2 LOOP branch to inner loop LDH D1 A5 2 A1 h0 h i LDH D2 B5 2 B1 x1 x j i 1 MPY M1 A0 A1 A7 x0 h0 B2 SUB S2 B2 1 B2 decrement inner loop counter LDH D2 B4 2 ...

Page 254: ...LDH D2 B5 2 B1 x1 x j i 1 ADD L1X B7 A9 A9 sum0 x1 h1 ADD L2 B8 B9 B9 sum1 x0 h1 MPY M2X A0 B0 B8 x0 h1 MPY M1 A0 A1 A7 x0 h0 B2 SUB S2 B2 1 B2 decrement inner loop cntr LDH D2 B4 2 B0 h1 h i 1 LDH D1 A4 2 A0 x0 x j i 2 inner loop branch occurs here A2 B S1 OUTLOOP branch to outer loop SUB L1 A4 A3 A4 reset x pointer to x j SUB L2 B4 B6 B4 reset h pointer to h 0 SHR S1 A9 15 A9 sum0 15 SHR S2 B9 1...

Page 255: ... in bank 0 An LDW from address 0 loads bytes 0 through 3 in banks 0 and 1 Because each bank is single ported memory only one access to each bank is allowed per cycle Two accesses to a single bank in a given cycle result in a memory stall that halts all pipeline operation for one cycle while the second value is read from memory Two memory operations per cycle are allowed without any stall as long a...

Page 256: ...N 5 8N 4 13 12 5 4 2 3 10 11 8N 2 8N 3 Bank 1 Bank 0 8N 1 8N 9 8 1 0 8M 6 8M 5 8M 4 8M 2 8M 3 8M 1 8M Memory block 0 Memory block 1 Bank 3 Bank 2 Bank 1 Bank 0 If each array in a loop resides in a separate memory block the 2 cycle loop in Example 6 61 on page 6 111 is sufficient This section describes a solution when two arrays must reside in the same memory block ...

Page 257: ...anch to inner loop LDH D1 A5 2 A1 h0 h i LDH D2 B5 2 B1 x1 x j i 1 ADD L1X B7 A9 A9 sum0 x1 h1 ADD L2 B8 B9 B9 sum1 x0 h1 MPY M2X A0 B0 B8 x0 h1 MPY M1 A0 A1 A7 x0 h0 B2 SUB S2 B2 1 B2 decrement inner loop cntr LDH D2 B4 2 B0 h1 h i 1 LDH D1 A4 2 A0 x0 x j i 2 It is not always possible to fully control how arrays are aligned especially if one of the arrays is passed into a function as a pointer an...

Page 258: ...f LDH x0 All MPYs must be five or six cycles after their LDH parents No MPYs on the same side A or B can be on the same loop cycle Figure 6 24 Dependency Graph of FIR Filter With Even and Odd Elements of Each Array on Same Loop Cycle 2 7 6 1 0 6 1 A side B side 2 MPY 5 5 5 LDH LDH 5 MPY 5 LDH 5 MPY MPY 5 5 LDH h0 p10 p00 x0 x1 h1 p01 p11 Note Numbers in bold represent the cycle the instruction is ...

Page 259: ...C code after unrolling the inner loop one more time This solution adds to the flexibility of scheduling and allows you to write FIR filter code that never has memory hits regardless of array alignment and memory block Example 6 66 FIR Filter C Code Unrolled void fir short x short h short y int i j sum0 sum1 short x0 x1 x2 x3 h0 h1 h2 h3 for j 0 j 100 j 2 sum0 0 sum1 0 x0 x j for i 0 i 32 i 4 x1 x ...

Page 260: ...ADD p00 sum0 sum0 sum0 x0 h0 ADD p10 sum1 sum1 sum1 x1 h0 LDH x x2 x2 x j i 2 LDH h h1 h1 h i 1 MPY x1 h1 p01 x1 h1 MPY x2 h1 p11 x2 h1 ADD p01 sum0 sum0 sum0 x1 h1 ADD p11 sum1 sum1 sum1 x2 h1 LDH x x3 x3 x j i 3 LDH h h2 h2 h i 2 MPY x2 h2 p02 x2 h2 MPY x3 h2 p12 x3 h2 ADD p02 sum0 sum0 sum0 x2 h2 ADD p12 sum1 sum1 sum1 x3 h2 LDH x x0 x0 x j i 4 LDH h h3 h3 h i 3 MPY x3 h3 p03 x3 h3 MPY x0 h3 p1...

Page 261: ...5 Dependency Graph of FIR Filter With No Memory Hits 1 1 B SUB LOOP cntr A side B side 2 2 1 2 1 1 1 2 1 1 1 1 2 2 2 5 5 5 5 5 5 5 5 MPY MPY LDH ADD ADD ADD ADD ADD ADD ADD MPY MPY LDH LDH MPY MPY LDH LDH LDH MPY ADD MPY LDH sum1 sum1 sum1 sum1 h3 p13 p03 h1 x2 p11 p01 sum0 sum0 sum0 sum0 x3 h2 p02 p12 p00 p10 x1 h0 2 5 LDH 5 5 5 5 5 5 5 x0 ...

Page 262: ... a base location off set and is incremented a number of times each time through the loop Without the mptr directives the loads of x1 and h0 are scheduled in parallel and the loads of x2 and h1 are scheduled in parallel This results in a 50 chance of a memory conflict on every cycle Example 6 68 Linear Assembly for Full Unrolled FIR Filter global _fir _fir cproc x h y reg x_1 h_1 sum0 sum1 ctr octr...

Page 263: ... h1 LDH D1 x_1 2 x3 x3 x j i 3 LDH D1 h 2 h2 h2 h i 2 MPY M1X x2 h2 p02 x2 h2 MPY M1 x3 h2 p12 x3 h2 ADD L1 p02 sum0 sum0 sum0 x2 h2 ADD L2X p12 sum1 sum1 sum1 x3 h2 LDH D2 x 2 x0 x0 x j i 4 LDH D2 h_1 2 h3 h3 h i 3 MPY M2X x3 h3 p03 x3 h3 MPY M2 x0 h3 p13 x0 h3 ADD L1X p03 sum0 sum0 sum0 x3 h3 ADD L2 p13 sum1 sum1 sum1 x0 h3 ctr SUB S2 ctr 1 ctr decrement loop counter ctr B S2 LOOP branch to loop...

Page 264: ... cycle 2 of the loop it is live for two cycles cycles 1 and 2 of the loop If another value is written at the end of cycle 2 and read on cycle 0 the next iteration of the loop it is also live for two cycles cycles 3 and 0 of the loop Because both of these values are not live on the same cycles they can occupy the same register Only after scheduling these instructions and their children do you know ...

Page 265: ...tion and no memory hits At the end of the inner loop there is a branch to OUTLOOP to execute the next outer loop The outer loop counter is set to 50 because iterations j and j 1 are executing each time the inner loop is run The inner loop counter is set to 8 because iterations i i 1 i 2 and i 3 are executing each inner loop iteration 6 12 9 Comparing Performance The cycle count for this nested loo...

Page 266: ...x j i 1 ZERO L1 A9 zero out sum0 ZERO L2 B9 zero out sum1 LDH D1 A8 2 B6 h1 h i 1 LDH D2 B4 2 A1 h0 h i LDH D1 A4 2 A5 x3 x j i 3 LDH D2 B1 2 B5 x0 x j i 4 LDH D2 B4 2 A7 h2 h i 2 LDH D1 A8 2 B8 h3 h i 3 B2 SUB S2 B2 1 B2 decrement loop counter LDH D2 B1 2 B0 x2 x j i 2 LDH D1 A4 2 A0 x1 x j i 1 LDH D1 A8 2 B6 h1 h i 1 LDH D2 B4 2 A1 h0 h i MPY M1X B5 A1 A0 x0 h0 MPY M2X A0 B6 B6 x1 h1 LDH D1 A4 2...

Page 267: ...A9 A9 sum0 x3 h3 B2 B S1 LOOP branch to loop MPY M2 B0 B6 B7 x2 h1 MPY M1 A0 A1 A1 x1 h0 B2 LDH D2 B4 2 A7 h2 h i 2 B2 LDH D1 A8 2 B8 h3 h i 3 B2 SUB S2 B2 1 B2 decrement loop counter ADD L2 B7 B9 B9 sum1 x0 h3 ADD L1 A0 A9 A9 sum0 x0 h0 MPY M2X A5 B8 B8 x3 h3 MPY M1X B0 A7 A5 x2 h2 B2 LDH D2 B1 2 B0 x2 x j i 2 B2 LDH D1 A4 2 A0 x1 x j i 1 inner loop branch occurs here A2 B S2 OUTLOOP branch to ou...

Page 268: ...d FIR Filter C Code Example 6 70 shows the FIR filter C code after unrolling the inner loop identi cal to Example 6 66 on page 6 122 Example 6 70 Unrolled FIR Filter C Code void fir short x short h short y int i j sum0 sum1 short x0 x1 x2 x3 h0 h1 h2 h3 for j 0 j 100 j 2 sum0 0 sum1 0 x0 x j for i 0 i 32 i 4 x1 x j i 1 h0 h i sum0 x0 h0 sum1 x1 h0 x2 x j i 2 h1 h i 1 sum0 x1 h1 sum1 x2 h1 x3 x j i...

Page 269: ...d as follows Put the outer loop and branch instructions in parallel with the prolog Create an epilog to the inner loop Put some outer loop instructions in parallel with the inner loop epilog 6 13 3 Final Assembly Example 6 71 shows the final assembly for the FIR filter with a software pipe lined outer loop Below the inner loop starting on page 6 134 each instruc tion is marked in the comments with...

Page 270: ...et up inner loop counter A2 SUB S1 A2 1 A2 decrement outer loop counter LDH D2 B1 2 B0 x2 x j i 2 LDH D1 A4 2 A0 x1 x j i 1 ZERO L1 A9 zero out sum0 ZERO L2 B9 zero out sum1 LDH D1 A8 2 B6 h1 h i 1 LDH D2 B4 2 A1 h0 h i LDH D1 A4 2 A5 x3 x j i 3 LDH D2 B1 2 B5 x0 x j i 4 OUTLOOP LDH D2 B4 2 A7 h2 h i 2 LDH D1 A8 2 B8 h3 h i 3 B2 SUB S2 B2 2 B2 decrement loop counter LDH D2 B1 2 B0 x2 x j i 2 LDH D...

Page 271: ...1X B5 A1 A0 x0 h0 MPY M2X A0 B6 B6 x1 h1 LDH D1 A4 2 A5 x3 x j i 3 LDH D2 B1 2 B5 x0 x j i 4 ADD L2X A7 B9 B9 sum1 x3 h2 ADD L1X B8 A9 A9 sum0 x3 h3 B2 B S1 LOOP branch to loop MPY M2 B0 B6 B7 x2 h1 MPY M1 A0 A1 A1 x1 h0 LDH D2 B4 2 A7 h2 h i 2 LDH D1 A8 2 B8 h3 h i 3 B2 SUB S2 B2 1 B2 decrement loop counter ADD L2 B7 B9 B9 sum1 x0 h3 ADD L1 A0 A9 A9 sum0 x0 h0 MPY M2X A5 B8 B8 x3 h3 MPY M1X B0 A7...

Page 272: ...p h1 h i 1 LDH D2 B4 2 A1 p h0 h i SHR S2 B9 15 B9 e sum1 15 LDH D1 A4 2 A5 p x3 x j i 3 LDH D2 B1 2 B5 p x0 x j i 4 STH D1 A9 A6 2 e y j sum0 15 STH D2 B9 B11 2 e y j 1 sum1 15 ZERO S1 A9 o zero out sum0 ZERO S2 B9 o zero out sum1 outer loop branch occurs here 6 13 4 Comparing Performance The improved cycle count for this loop is 2006 cycles 50 7 4 6 6 6 The outer loop overhead for this loop has ...

Page 273: ...es the overhead entirely 6 14 1 Unrolled FIR Filter C Code Example 6 72 shows the same unrolled FIR filter C code that used in the previous example Example 6 72 Unrolled FIR Filter C Code void fir short x short h short y int i j sum0 sum1 short x0 x1 x2 x3 h0 h1 h2 h3 for j 0 j 100 j 2 sum0 0 sum1 0 x0 x j for i 0 i 32 i 4 x1 x j i 1 h0 h i sum0 x0 h0 sum1 x1 h0 x2 x j i 2 h1 h i 1 sum0 x1 h1 sum1...

Page 274: ...i MPY x0 h0 p00 x0 h0 MPY x1 h0 p10 x1 h0 ADD p00 sum0 sum0 sum0 x0 h0 ADD p10 sum1 sum1 sum1 x1 h0 LDH x x2 x2 x j i 2 LDH h h1 h1 h i 1 MPY x1 h1 p01 x1 h1 MPY x2 h1 p11 x2 h1 ADD p01 sum0 sum0 sum0 x1 h1 ADD p11 sum1 sum1 sum1 x2 h1 LDH x x3 x3 x j i 3 LDH h h2 h2 h i 2 MPY x2 h2 p02 x2 h2 MPY x3 h2 p12 x3 h2 ADD p02 sum0 sum0 sum0 x2 h2 ADD p12 sum1 sum1 sum1 x3 h2 LDH x x0 x0 x j i 4 LDH h h3...

Page 275: ...ters then the store counter becomes 0 to shift and store the result Example 6 74 Linear Assembly for FIR Outer Loop sctr SUB sctr 1 sctr dec store lp cntr sctr SHR sum07 15 y0 sum0 15 sctr SHR sum17 15 y1 sum1 15 sctr STH y0 y 2 y j sum0 15 sctr STH y1 y_1 2 y j 1 sum1 15 sctr MVK 4 sctr reset store lp cntr pctr SUB pctr 1 pctr dec pointer reset lp cntr pctr SUB x rstx2 x reset x ptr pctr SUB x_1 ...

Page 276: ... h3 h4 h5 h6 h7 for j 0 j 100 j 2 sum0 0 sum1 0 x0 x j for i 0 i 32 i 8 x1 x j i 1 h0 h i sum0 x0 h0 sum1 x1 h0 x2 x j i 2 h1 h i 1 sum0 x1 h1 sum1 x2 h1 x3 x j i 3 h2 h i 2 sum0 x2 h2 sum1 x3 h2 x4 x j i 4 h3 h i 3 sum0 x3 h3 sum1 x4 h3 x5 x j i 5 h4 h i 4 sum0 x4 h4 sum1 x5 h4 x6 x j i 6 h5 h i 5 sum0 x5 h5 sum1 x6 h5 x7 x j i 7 h6 h i 6 sum0 x6 h6 sum1 x7 h6 x0 x j i 8 h7 h i 7 sum0 x7 h7 sum1 ...

Page 277: ...tional on the same value as the store counter because when sctr is 0 the end of one inner loop has been reached and the first ADD which adds the previous sum07 to p00 must not be executed The first ADD for sum0 writes to the same register as the first MPY p00 The second ADD reads p00 and p01 At the beginning of each inner loop the first ADD is not performed so the second ADD correctly reads the re...

Page 278: ...j sum0 15 sctr STH y1 y_1 2 y j 1 sum1 15 MV x01 x01b move to other reg file MPYLH h01 x01b p10 p10 h i 0 x j i 1 sctr ADD p10 sum17 p10 sum1 p10 p10 sum1 MPYHL h01 x23 p11 p11 h i 1 x j i 2 ADD p11 p10 sum11 sum1 p11 MPYLH h23 x23 p12 p12 h i 2 x j i 3 ADD p12 sum11 sum12 sum1 p12 MPYHL h23 x45 p13 p13 h i 3 x j i 4 ADD p13 sum12 sum13 sum1 p13 MPYLH h45 x45 p14 p14 h i 4 x j i 5 ADD p14 sum13 su...

Page 279: ... 1 octr dec outer lp cntr octr B LOOP Branch outer loop 6 14 6 Translating C Code to Linear Assembly Inner Loop and Outer Loop Example 6 77 shows the linear assembly with functional units assigned As in Example 6 68 on page 6 125 symbolic names now have an A or B in front of them to signify the register file where they reside Although this allocation is one of many possibilities one goal is to kee...

Page 280: ...ach outer loop MVK 60 rstx2 used to rst x pointer each outer loop MVK 64 rsth1 used to rst h pointer each outer loop MVK 64 rsth2 used to rst h pointer each outer loop MVK 201 octr loop ctr 201 100 2 32 8 1 MVK 4 pctr pointer reset lp cntr 32 8 MVK 5 sctr reset store lp cntr 32 8 1 ZERO sum07 sum07 0 ZERO sum17 sum17 0 mptr x x 0 mptr x_1 x 4 mptr h h 0 mptr h_1 h 4 LOOP trip 8 LDW D1T1 h 2 h01 h ...

Page 281: ...13 MPYLH M1 h45 x45 p14 p14 h i 4 x j i 5 ADD L2X p14 sum13 sum14 sum1 p14 MPYHL M2X h45 x67 p15 p15 h i 5 x j i 6 ADD S2 p15 sum14 sum15 sum1 p15 MPYLH M2 h67 x67 p16 p16 h i 6 x j i 7 ADD L2 p16 sum15 sum16 sum1 p16 MPYHL M1X h67 x8 p17 p17 h i 7 x j i 8 ADD L2X p17 sum16 sum17 sum1 p17 MPY M1 h01 x01 p00 p00 h i 0 x j i 0 sctr ADD L1 p00 sum07 p00 sum0 p00 p00 sum0 MPYH M1 h01 x01 p01 p01 h i 1...

Page 282: ... L1X p06 sum05 sum06 sum0 p06 MPYH M2 h67 x67 p07 p07 h i 7 x j i 7 ADD L1X p07 sum06 sum07 sum0 p07 sctr MVK S1 4 sctr reset store lp cntr pctr SUB S1 pctr 1 pctr dec pointer reset lp cntr pctr SUB S2 x rstx2 x reset x ptr pctr SUB S1 x_1 rstx1 x_1 reset x_1 ptr pctr SUB S1 h rsth1 h reset h ptr pctr SUB S2 h_1 rsth2 h_1 reset h_1 ptr pctr MVK S1 4 pctr reset pointer reset lp cntr octr SUB S2 oct...

Page 283: ...ltiply accumulates per cycle are still executing Table 6 27 Resource Table for FIR Filter Code a A side b B side Unit s Total Unit Unit s Total Unit M1 8 M2 8 S1 7 S2 6 D1 5 D2 6 L1 8 L2 8 Total non M units 20 Total non M units 20 1X paths 7 2X paths 5 6 14 8 Final Assembly Example 6 78 shows the final assembly for the FIR filter with the outer loop conditionally executing in parallel with the inn...

Page 284: ... lp cntr MVK S1 64 A5 used to reset h ptr 16 4 MVK S2 64 B5 used to reset h ptr 16 4 ADD L2X A6 2 B6 point to y j 1 LDW D1 A0 2 A9 h i 4 h i 5 LDW D2 B2 2 B8 h i 6 h i 7 A1 SUB S1 A4 A3 A4 reset x ptr A1 SUB S2 B1 B14 B1 reset x ptr A1 SUB S1 A0 A5 A0 reset h ptr LDH D2 B1 A8 x j i 8 ADD S2X A10 0 B8 move to other reg file MVK S1 5 A2 set store lp cntr MPYLH M2X A8 B8 B4 p10 h i 0 x j i 1 A1 SUB S...

Page 285: ... i 4 x j i 4 ADD L1X B9 A13 A13 sum0 p02 MPYLH M2 B8 B10 B13 p16 h i 6 x j i 7 ADD L2X A10 B7 B7 sum1 p13 LDW D1 A0 2 A9 h i 4 h i 5 LDW D2 B2 2 B8 h i 6 h i 7 A1 SUB S1 A4 A3 A4 reset x ptr MPY M2 B8 B10 B11 p06 h i 6 x j i 6 MPYH M1 A9 A11 A11 p05 h i 5 x j i 5 ADD L1X B13 A13 A9 sum0 p03 ADD L2X A10 B7 B7 sum1 p14 A1 SUB S2 B1 B14 B1 reset x ptr A1 SUB S1 A0 A5 A0 reset h ptr LDH D2 B1 A8 x j i...

Page 286: ...i 1 Branch occurs here A2 SHR S1 A10 15 A12 Asum0 15 A2 STH D2 B11 B6 2 y j 1 Bsum1 15 A2 STH D1 A12 A6 2 y j Asum0 15 6 14 9 Comparing Performance The cycle count of this code is 1612 50 8 4 0 12 The overhead due to the outer loop has been completely eliminated Table 6 28 Comparison of FIR Filter Code Code Example Cycles Cycle Count Example 6 61 FIR with redundant load elimination 50 16 2 9 6 2 2...

Page 287: ...register assignment is included followed by code generation of interruptible code and finally descriptions of interrupt subroutines Topic Page 7 1 Overview of Interrupts 7 2 7 2 Single Assignment vs Multiple Assignment 7 3 7 3 Interruptible Loops 7 5 7 4 Interruptible Code Generation 7 6 7 5 Interrupt Subroutines 7 11 Chapter 7 ...

Page 288: ...s that allow this to occur automatically Once an interrupt is taken an interrupt subroutine performs certain tasks or actions as required by the event Servicing an interrupt involves switching contexts while saving all state of the machine Thus upon return from the inter rupt operation of the interrupted algorithm is resumed as if there had been no interrupt Saving state involves saving various re...

Page 289: ...upt is taken on cycle 3 At this point all instructions which have begun execution are allowed to complete and no new instructions execute So 3 cycles after the interrupt is taken on cycle 3 the LDW instruction writes to A1 After the interrupt service routine has been processed program execution continues on cycle 4 with the ADD instruction In this case the ADD reads register A1 and will be reading...

Page 290: ...6 NOP 2 7 MPY M1 A6 A4 A5 uses new A1 result of LDW Both examples involve exactly the same schedule of instructions The only dif ference is the register allocation The single assignment register allocation as shown in Example 7 2 can result in higher register pressure Example 7 2 uses one more register than Example 7 1 The next section describes how to generate interruptible and non interruptible ...

Page 291: ...ible There are two options for making a loop with an iteration interval less than 6 interruptible 1 Simply slow down the loop and force an iteration interval of 6 cycles This is not always desirable since there will be a performance degradation 2 Unroll the loop until an iteration interval of 6 or greater is achieved This ensures at least the same performance level and in some cases can im prove p...

Page 292: ...s with minimum iteration intervals less than 6 are not inter ruptible higher iteration intervals might be used which results in lower per formance Unrolling the loop however prevents this reduction in perfor mance See section 7 4 4 Higher register pressure in single assignment can cause data spilling to memory in both looped code and non looped code when there are not enough registers to store all...

Page 293: ...the specified threshold number is not exceeded In other words the user can specify a threshold or maximum interrupt delay that allows the compiler to use multiple assignment in loops that do not exceed this threshold The code outside of loops can have interrupts disabled and also use multiple assignment as long as the threshold of uninterruptible cycles is not exceeded If the compiler cannot deter...

Page 294: ...as follows Example 7 3 Dot Product With MUST_ITERATE Pragma Guaranteeing Minimum Trip Count int dot_prod short a short b int n int i sum 0 pragma MUST_ITERATE 20 for i 0 i n i sum a i b i return sum With the MUST_ITERATE pragma the compiler only knows that this loop will execute at least 20 times Even with the interrupt threshold set at 100 by the mi option the compiler will still produce a 6 cycl...

Page 295: ...in linear assembly code by specifying both the minimum and maximum trip counts with the trip directive Note The compiler does not take stalls memory bank conflict external memory access time cache miss etc into account Because of this it is recom mended that you are conservative with the threshold value Let us now assume the worst case scenario the application needs to be inter ruptible at any giv...

Page 296: ...loop that will execute a factor of 4 between 16 and 48 times Example 7 6 Dot Product With MUST_ITERATE Pragma Guaranteeing Trip Count Range and Factor of 4 int dot_prod short a short b int n int i sum 0 pragma MUST_ITERATE 16 48 4 for i 0 i n i sum a i b i return sum The compiler knows that the trip count is some factor of four The compiler will unroll this loop such that four iterations of the lo...

Page 297: ... C Compiler The C C compiler automatically generates ISRs with the keyword interrupt The interrupt function must be declared with no arguments and should return void For example interrupt void int_handler unsigned int flags Alternatively you can use the interrupt pragma to define a function to be an ISR pragma INTERRUPT func The result of either case is that the C C compiler automatically creates ...

Page 298: ...ust be used to return from the routine If this is the NMI ISR a B NRP must be used instead An NOP 4 is required after the last LDW in this case to ensure that B0 is restored before returning from the interrupt Example 7 7 Hand Coded Assembly ISR Assume Register B0 B4 A0 are the only registers used by the ISR and no other functions are called STW B0 B15 store B0 to stack STW A0 B15 store A0 to stac...

Page 299: ...R and CSR to a register which is not being used or to or some other memory location usually the stack Once these have been saved you can reenable the appropriate interrupts This involves resetting the GIE bit and then doing any necessary modifications to the IER providing only certain interrupts are allowed to interrupt the particular ISR On return from the ISR the original val ues of the IRP IER ...

Page 300: ... to stack contains IRP STW B1 B15 store B1 to stack contains IER STW A0 B15 store A0 to stack original CSR Beginning of ISR code End of ISR code B restore Branch to restore routine disable CSR in delay slots of branch MVKL 0FFFEh A0 create mask to disable GIE bit MVKLH 0FFFFh A0 MVC CSR B5 read current CSR AND A0 B5 B5 AND B5 with mask MVC B5 CSR write new CSR with GIE disabled restore restore rou...

Page 301: ... presented earlier for the entire C6000 family as these concepts also ap ply to the C64x The sample code that is used in this chapter is included on the Code Genera tion Tools and Code Composer Studio CD ROM When you install your code generation tools the example code is installed in the c6xtools directory Use the code in that directory to go through the examples in this chapter Topic Page 8 1 Ove...

Page 302: ...tions associated with 40 bit operations allowing more flexible scheduling of high precision code 8 1 2 Greater Memory Bandwidth The C64x provides double word load and store instructions LDDW and STDW which can access 64 bits of data at a time Up to two double word load or store instructions can be issued every cycle This provides a peak band width of 128 bits per cycle to on chip memory 8 1 3 Supp...

Page 303: ...d for accessing packed data types without the restric tions imposed by 32 bit or 64 bit alignment boundaries The C64x can access up to 64 bits per cycle at any byte boundary with non aligned load and store instructions LDNW LDNDW STNW and STNDW 8 1 5 Additional Specialized Instructions The C64x also provides a number of new bit manipulation and other special ized instructions for improving perform...

Page 304: ...ream of instructions This saves code size and dramatically boosts performance The C64x provides a rich family of instructions which are designed to work with packed data processing At the core of this paradigm are packed data types which are designed to store multiple data elements in a single machine register Packed data processing instructions are built around these data types to provide a flexi...

Page 305: ...g er 32 bit register These partitions are merely logical partitions As with all C64x instructions instructions which operate on packed data types operate directly on the 64 general purpose registers in the register file There are no special packed data registers How data in a register is interpreted is deter mined entirely by the instruction that is using the data Figure 8 1 and Figure 8 2 illustr...

Page 306: ...r This allows the instructions which are not sensitive to sign bits such as adds and subtracts to operate on signed and unsigned data equally The distinction between signed and un signed comes into play primarily when performing multiplication comparing elements and unpacking data since either sign or zero extension must be performed Table 8 2 provides a summary of the operations that the C64x pro...

Page 307: ...ght shift only Multiply Yes Yes Dot Product Yes Yes Max Min Compare Yes Yes CMPEQ works with signed or unsigned Pack Yes Yes Yes Yes Unpack Yes Yes Yes See Table 8 4 for 16 bit un packs Only signed by unsigned support in these categories 8 2 4 Packing and Unpacking Data The C64x provides a family of packing and unpacking instructions which are used for converting between various packed and non pac...

Page 308: ...t values UNPKHU4 UNPKLU4 _unpkhu4 _unpklu4 Unpacking unsigned 8 bit data into 16 bits Preparing 8 bit data to be interleaved SPACKU4 _spacku4 Saturating 16 bit quantities down to unsigned 8 bit values packing them together SHLMB SHRMB SWAP4 ROTL _shlmb _shrmb _swap4 _rotl Rearranging packed 8 bit quantities The _packXX2 group of intrinsics work by extracting selected half words from two 32 bit reg...

Page 309: ...b_hi b_lo a_hi a_lo b a c b a c b a c b a c The saturating pack intrinsic _spack2 is closely related to the _pack2 intrin sic The main difference is that the saturating pack first saturates the signed 32 bit source values to signed 16 bit quantities and then packs these results into a single 32 bit word This makes it easier to support applications which generate some intermediate 32 bit results bu...

Page 310: ...s Table 8 4 Unpacking Packed 16 bit Quantities to 32 bit Values Type Position C code Assembly code Signed 16 bit Upper half dst signed src 16 SHR src 16 dst Lower half dst _ext src 16 16 EXT src 16 16 dst Unsigned 16 bit Upper half dst unsigned src 16 SHRU src 16 dst Lower half dst _ext src 16 16 EXTU src 16 16 dst For 8 bit data the C64x provides the _packX4 _spacku4 and _unpkX4 intrin sics for p...

Page 311: ...ckX4 and _spacku4 b_3 b_1 a_3 a_1 c c _packh4 b a b_3 b_1 b a b a c b_2 b_0 a_2 a_0 c _packl4 b a signed 16 bit signed 16 bit c _spacku4 b a sat a_hi sat b_lo sat b_hi c sat a_lo Saturation Unsigned 8 bit signed 16 bit signed 16 bit b a b_2 b_0 a_3 a_2 a_1 a_0 b_3 b_2 b_1 b_0 a_3 a_2 a_1 a_0 sat b_hi sat b_lo sat a_hi sat a_lo ...

Page 312: ...4x also provides a handful of additional byte manipulating operations that have proven useful in various algorithms These operations are neither packs nor unpacks but rather shuffle bytes within a word Uses include con volution oriented algorithms complex number arithmetic and so on Opera tions in this category include the intrinsics _shlmb _shrmb _swap4 and _rotl The first three in this list are ...

Page 313: ..._2 a_3 b_0 c _shrmb b a a_1 a_3 a_2 a_0 a_0 a_2 a_3 a_1 b a b _swap4 a 8 2 5 Optimizing for Packed Data Processing The C64x supports two basic forms of packed data optimization namely vec torization and macro operations Vectorization works by applying the exact same simple operations to several elements of data simultaneously Kernels such as vector sum and vector multi ply shown in Example 8 1 and...

Page 314: ... vector of elements and the same operation is being applied to each ele ment Pure vector code has no computation between adjacent elements when calculating results Also input and output arrays tend to have the same num ber of elements Figure 8 8 illustrates the general form of a simple vector op eration that operates on inputs from arrays A and B producing an output C such as our Vector Sum and Ve...

Page 315: ...hat there is significant mathematical interaction between adjacent ele ments Simple examples include dot product operations and complex multi plies as shown in Example 8 3 and Example 8 4 Example 8 3 Dot Product int dot_prod const short restrict a const short restrict b int len int i int sum 0 for i 0 i len i sum b i a i return sum Example 8 4 Vector Complex Multiply void vec_cx_mpy const short re...

Page 316: ...s a pure vector operation per formed on vectors of complex numbers as its name implies However it is not in implementation because neither the language type nor the hardware itself directly supports a complex multiply as a single operation The complex multiply is built up from a number of real multiplies with the com plex numbers stored in arrays of interleaved real and imaginary values As a resul...

Page 317: ...rate how single in struction multiple data optimizations apply to each of these 8 2 6 Vectorizing With Packed Data Processing The most basic packed data optimization is to use wide memory accesses in other words word and double word loads and stores to access narrow data such as byte or half word data This is a simple form of vectorization as de scribed above applied only to the array accesses Wid...

Page 318: ...ents packed into a register pair The array elements are packed with two elements in each regis ter across two registers Each register contains data in the packed 16 bit data type illustrated in Figure 8 2 For the moment assume that the arrays are double word aligned as shown in Example 8 5 For more information about methods for working with arrays that are not specially aligned see section 8 2 8 T...

Page 319: ...signed b_hi b_lo unsigned c_hi c_lo for i 0 i len i 4 a_hi _hi const double a i a_lo _lo const double a i b_hi _hi const double b i b_lo _lo const double b i somehow the ADD occurs here with results in c_hi c_lo double c i _itod c_hi c_lo Figure 8 11 Array Access in Vector Sum by LDDW a 0 16 bits a 1 a 2 a 3 a 7 a 4 a 5 a 6 64 bits a 1 a 0 a 2 a 3 32 bits 32 bits a 3 a 2 a 1 a 0 LDDW 64 bit regist...

Page 320: ...ts of data The next step is to find a method to quickly add them The _add2 intrinsic provides just that It adds corresponding packed elements in two different words producing two packed sums It provides exactly what is needed a vector addition Figure 8 13 illustrates Figure 8 13 Vector Addition a 1 a 0 a_lo c_lo _add2 b_lo a_lo b_lo b 1 b 0 c_lo c 1 b 1 a 1 c 0 b 0 a 0 So putting in _add2 to perfo...

Page 321: ...ch as loop unrolling and software pipelin ing These and other optimizations are described in detail throughout Chapter 6 8 2 6 2 Vectorizing the Vector Multiply The vector multiply shown in Figure 8 8 is similar to the vector sum in that the algorithm is a pure vector algorithm One major difference is the fact that the intermediate values change precision In the context of vectorization this chang...

Page 322: ...s a i mult b i 32 bits a i b i c i 16 bits Right shift Notice that the values are still loaded and stored as 16 bit quantities There fore you should use the same basic flow as the vector sum Example 8 7 shows this starting point Figure 8 11 and Figure 8 12 also apply to this exam ple to illustrate how data is being accessed ...

Page 323: ..._itod c_hi c_lo The next step is to perform the multiplication The C64x intrinsic _mpy2 performs two 16 16 multiplies providing two 32 bit results packed in a 64 bit double This provides the multiplication The _lo and _hi intrinsics allow separation of the two separate 32 bit products Figure 8 15 illustrates how _mpy2 works Figure 8 15 Packed 16 16 Multiplies Using _mpy2 a 1 a 0 a_lo b_lo b 1 b 0 ...

Page 324: ...to Perform the Vector Multiply void vec_mpy1 const short restrict a const short restrict b short restrict c int len int shift int i unsigned a_hi a_lo Packed 16 bit values unsigned b_hi b_lo Packed 16 bit values double c_hi_dbl c_lo_dbl 32 bit prod in 64 bit pairs int c_hi3 c_hi2 c_lo1 c_lo0 Separate 32 bit products unsigned c_hi c_lo Packed 16 bit values for i 0 i len i 4 a_hi _hi const double a ...

Page 325: ...k2 before the shift If the shift amount is exactly 16 eliminate the shift The _packh2 effectively performs part of the shift shifting right by 16 so that the job can be finished with a _shr2 intrinsic Figure 8 17 illustrates how this works If the shift amount is less than 16 only use the _shr2 intrinsic if the 32 bit products can be safely truncated to 16 bits first without losing significant digi...

Page 326: ...bit product Right shifts Signed 32 bit product c 1 c 0 _pack2 Original data flow Signed 32 bit product Modified data flow Signed 32 bit product Discarded sign bits Discarded sign bits 16 bit result 16 bit result c 1 c 0 16 bits 16 bits Right shifts _packh2 16 bit result 16 bit result 16 bits 16 bits ...

Page 327: ...lt 16 bit result Truncated bits Signed 32 bit product Signed 32 bit product _pack2 16 bits 16 bits Discarded sign bits Discarded sign bits 16 bits 16 bits 16 bit result 16 bit result Whether or not the 16 bit shift version is used consider the vector multiply to be fully optimized from a packed data processing standpoint It can be further optimized using the more general techniques such as loop un...

Page 328: ...together _dotpn2 DOTPN2 Performs two 16x16 multiplies and subtracts the sec ond product from the first _dotprsu2 DOTPRSU2 Performs two 16x16 multiplies adds products togeth er and shifts rounds the sum _dotpnrsu2 DOTPNRSU2 Performs two 16x16 multiplies subtracts the 2nd product from the 1st and shifts rounds the difference _dotpu4 _dotpsu4 DOTPU4 DOTPSU4 Performs four 8x8 multiplies and adds produ...

Page 329: ...on 8 2 3 The result of those steps is the intermedi ate code shown in Example 8 9 Example 8 9 Vectorized Form of the Dot Product Kernel int dot_prod const short restrict a const short restrict b short restrict c int len int i int sum 0 32 bit accumulation unsigned a3_a2 a1_a0 Packed 16 bit values unsigned b3_b2 b1_b0 Packed 16 bit values double c3_c2_dbl c1_c0_dbl 32 bit prod in 64 bit pairs int c...

Page 330: ...swer here Figure 8 18 illustrates how the _dotp2 intrinsic operates Other _dotp intrin sics operate similarly Figure 8 18 Graphical Representation of the _dotp2 Intrinsic c _dotp2 b a a_hi a_lo a b b_hi b_lo 32 bit register 32 bit register a_hi b_hi a_lo b_lo 16 bit 16 bit 32 bit 32 bit add a_hi b_hi a_lo b_lo c c _dotp2 b a 32 bit This operation exactly maps to the operation the dot product kerne...

Page 331: ...dot products on pairs of elements totalling the results in the accumulator sum _dotp2 a3_a2 b3_b2 sum _dotp2 a1_a0 b1_b0 return sum At this point the code takes full advantage of the new features that the C64x provides In the particular case of this kernel no further optimization should be necessary The tools produce an optimal single cycle loop using the com piler version that was available at th...

Page 332: ...r to the one that used with the Dot Product kernel in Section 8 2 4 1 First the loads and stores are vectorized in order to bring data in more efficiently Next operations are combined together into intrinsics to make full use of the machine Example 8 12 illustrates the vectorization step For details consult the earlier examples such as the Vector Sum The complex multiplication step itself has not ...

Page 333: ...sented as a3 a2 j and a1 a0 j That is the real components are a3 and a1 and the imaginary components are a2 and a0 a3_a2 _hi const double a i a1_a0 _lo const double a i Load two complex numbers from the b array b3_b2 _hi const double b i b1_b0 _lo const double b i Separate the 16 bit coefficients so that the complex multiply may be performed This portion needs further optimization a3 signed a3_a2 ...

Page 334: ...rinsic can be used to calculate the real component of the output Figure 8 19 shows how this flow would work Figure 8 19 The _dotpn2 Intrinsic Performing Real Portion of Complex Multiply a_real a_imaginary a b b_real b_imaginary 32 bit register 32 bit register a_real b_real a_imaginary b_imaginary 16 bit 16 bit 32 bit 32 bit sub a_real b_real a_imag b_imag c c _dotpn2 b a 32 bit The calculation for...

Page 335: ...nary component of the output Figure 8 20 _packlh2 and _dotp2 Working Together Real Imaginary a b a_imaginary b_real a_real b_imaginary add a_imag b_real a_real b_imag c c _dotp2 b _packl2 a a 32 bit a _packlh2 a a a Imaginary Real Imaginary Real Once both the real and imaginary components of the result are calculated it is necessary to convert the 32 bit results to 16 bit results and store them In...

Page 336: ...are a2 and a0 a3_a2 _hi const double a i a1_a0 _lo const double a i Load two complex numbers from the b array b3_b2 _hi const double b i b1_b0 _lo const double b i Perform the complex multiplies using _dotp2 _dotpn2 c3 _dotpn2 b3_b2 a3_a2 Real c2 _dotp2 b3_b2 _packlh2 a3_a2 a3_a2 Imaginary c1 _dotpn2 b1_b0 a1_a0 Real c0 _dotp2 b1_b0 _packlh2 a1_a0 a1_a0 Imaginary Pack the 16 bit results from the u...

Page 337: ...ed Non Aligned Data must be aligned on a boundary equal to its width Data may be aligned on any byte boundary Can read or write bytes half words words and double words Can only read or write words and double words Up to two accesses may be issued per cycle for a peak bandwidth of 128 bits cycle Only one non aligned access may be issued per cycle for a peak bandwidth of 64 bits cycle Bank conflicts...

Page 338: ... bytes from a 115 through a 122 d _memd8 void a 115 It is easy to modify code to use non aligned accesses Example 8 15 below shows the Vector Sum from Example 8 6 rewritten to use non aligned memory accesses As with ordinary array references the compiler will opti mize away the redundant references Example 8 15 Vector Sum Modified to use Non Aligned Memory Accesses void vec_sum const short restric...

Page 339: ...ndent access algorithms include motion compensation which must read image blocks from arbitrary locations in the source image In each of these cases it is extremely difficult to transform the problem into one which uses aligned memory accesses while still vectorizing the code Often the result with aligned memory accesses is worse than if the code were not optimized for packed data processing at al...

Page 340: ..._cmpXX4 in conjunction with the expand intrinsics _xpnd2 and _xpnd4 The packed compare intrinsics compare packed data elements producing a small bitfield which describes the results of the independent comparisons For _cmpeq2 _cmpgt2 and _cmplt2 the intrinsic returns a two bit field containing the results of the two separate comparisons For _cmpeq4 _cmpgtu4 and _cmpltu4 the intrinsic returns a four...

Page 341: ...ics work from a bitfield such as the bitfield returned by the compare intrinsics The _xpnd2 and _xpnd4 intrinsics expand the lower 2 or 4 bits of a word to fill the entire 32 bit word of the result The _xpnd2 intrinsic expands the lower two bits of the input to two half words whereas _xpnd4 ex pands the lower four bits to four bytes The expanded output is suitable for use as a mask for instance fo...

Page 342: ...essing on the C64x 8 42 Figure 8 23 Graphical Illustration of _xpnd2 Intrinsic a b_hi b_lo b b xpnd2 a xpnd xpnd Figure 8 24 Graphical Illustration of _xpnd4 Intrinsic a b_3 b b xpnd4 a b_2 b_1 b_0 xpnd xpnd xpnd xpnd ...

Page 343: ...h unsigned char restrict image int count unsigned char threshold int i for i 0 i count i if image i threshold image i 0 Vectorization techniques are applied to the code as described in Section 8 2 giving the result shown in Example 8 17 The _cmpgtu4 intrinsic compares against the threshold values and the _xpnd4 intrinsic generates a mask for setting pixels to 0 Note that the new code has the restr...

Page 344: ...pixels from input image one double word p7_p6_p5_p4 _hi double image i p3_p2_p1_p0 _lo double image i Compare each of the pixels to the threshold c7_c6_c5_c4 _cmpgtu4 p7_p6_p5_p4 t3_t2_t1_t0 c3_c2_c1_c0 _cmpgtu4 p3_p2_p1_p0 t3_t2_t1_t0 Expand the comparison results to generate a bitmask x7_x6_x5_x4 _xpnd4 c7_c6_c5_c4 x3_x2_x1_x0 _xpnd4 c3_c2_c1_c0 Apply mask to the pixels Pixels that were less tha...

Page 345: ...rs The branch is not taken and the loop counter is not updated If the loop counter was not initially negative decrement the loop counter and write the new value back to the register file This step does not occur for BPOS If the loop counter was not initially negative issue the branch Code will begin executing at the branch s destination after the branch s delay slots From linear assembly the branc...

Page 346: ...proc count reg i iters flag ZERO iters Initialize our return value to 0 CMPLT count 1 flag flag B does_not_iterate Do not iterate if count MV count i i count loop trip 1 This loop is guaranteed to iterate at least once ADD iters 1 iters iters SUB i 1 i i i B loop while i 0 does_not_iterate return iters Return our number of iterations endproc Using BDEC the loop is written similarly However the loo...

Page 347: ...loop to execute extra itera tions and then compensate for these iterations after the loop This is particu larly effective in cases where the cost of the conditional flow before the loop is greater than the cost of executing the body of the loop as in the example above Example 8 21 shows one way to apply this modification Example 8 21 Loop Tip Count Using BDEC With Extra Loop Iterations global _cou...

Page 348: ... call directive The Assembly Op timizer issues an ADDKPC as part of the function call sequence for this call as shown in the compiler output in Example 8 23 Example 8 22 Using the call Directive in Linear Assembly data hello string Hello World 0 text global _puts global _main _main cproc reg pointer loop MVKL hello pointer Generate a 32 bit pointer to the MVKH hello pointer phrase Hello World call...

Page 349: ...igned memory accesses do not cause bank conflicts This is due to the fact that no other memory access can execute in parallel with a non aligned memory access As a result the mptr directive has no effect on non aligned load and store instructions 8 3 2 Avoiding Cross Path Stalls 8 3 2 1 Architectural Considerations The C6000 CPU components consist of Two general purpose register files A and B Eigh...

Page 350: ...ss path accesses allowing multiple functional units per side to read the same cross path source simulta neously Thus the cross path operand for one side may be used by up to two of the functional units on that side in an execute packet In the C62x C67x only one functional unit per data path per execute packet can get an operand from the opposite register file On the C64x a delay clock cycle is int...

Page 351: ...for i 0 i n i c i m a i 15 b i This algorithm requires two loads a multiply a shift an add and a store Only the D units on the C6000 architecture are capable of loading storing values from to memory Since there are two D units available it would appear this algorithm would require two cycles to produce one result considering three D operations are required Be aware however that the input and outpu...

Page 352: ...n 1 w mask to isolate bn SHR S1 bi_i1 16 bi1 shift bn bn 1 by 16 to isolate bn 1 ADD L2X pi_s bi ci add sprod0 bn ADD L1X pi1_s bi1 ci1 add sprod1 bn 1 STH D2 ci c 2 store 16 bits cn STH D1 ci1 c1 2 store 16 bits cn 1 cntr SUB cntr 1 cntr decrement loop count cntr B LOOP branch to loop if loop count 0 endproc In the implementation above 16 bit values two at a time with the LDW instruc tion into a ...

Page 353: ...f loop count 0 A1 ADD L1 0xffffffff A1 A1 decrement loop count LDW D1T1 A7 A3 load 32 bits bn bn 1 LDW D2T2 B5 B2 load 32 bits an an 1 A2 MPYSU M1 2 A2 A2 A2 STH D2T2 B1 B4 4 store 16 bits cn 1 A2 STH D1T1 A6 A8 4 store 16 bits cn ADD L1X A4 B0 A6 add sprod1 bn 1 ADD L2X B8 A0 B1 add sprod0 bn SHR S2 B9 0xf B0 shift prod1 right by 15 sprod1 SHR S1 A3 0x10 A4 shift bn bn 1 by 16 to isolate bn 1 MPY...

Page 354: ...1 A4 0xf A3 shift prod0 right by 15 sprod0 LDW D1T1 A9 A5 load 32 bits bn bn 1 In Example 8 27 the assembly optimizer has created a two cycle loop with out a cross path stall The loop count decrement instruction and the conditional branch to loop based on the value of loop count instruction have been replaced with a single BDEC instruction In the instruction slot created by combining these two ins...

Page 355: ...rence to techniques that can be used to optimize loops and in most cases refers you to specific sections within this book providing additional detail Topic Page A 1 Loop Disqualification Messages A 2 A 2 Pipeline Failure Messages A 4 A 3 Investigative Feedback A 10 Appendix A ...

Page 356: ...trol flow or parallel instructions as input to linear assembly Loop Contains a Call Description There are occasions when the compiler may not be able to inline a function call that is in a loop This may be due to the compiler being unable to inline the function call the loop could not be software pipelined Solution If the caller and the callee are C or C use pm and op2 See the TMS320C6000 Opimizin...

Page 357: ...ng Disabled Software pipelining has been disabled by a command line option Pipelining will be turned off when using the mu option not using o2 o3 or using ms2 ms3 Uninitialized Trip Counter The trip counter may not have been set to an initial value Suppressed to Prevent Code Expansion Software pipelining may be suppressed because of the ms1 flag When the ms1 flag is used software pipelining is dis...

Page 358: ...n the loop to specific machine registers A0 A15 and B0 B15 for the C62x and C67x or A0 A31 and B0 B31 for the C64x There are oc casions when software pipelining this particular ii is not possible This may be due to the loop schedule found requiring more registers than the C6000 has available The analyzing feedback example shows ii 12 Cannot allocate machine registers Regs Live Always 1 5 A B side ...

Page 359: ...ge 6 94 See section 3 4 3 4 Loop Unrolling in C on page 3 44 TMS320C6000 C C Compiler User s Guide Cycle Count Too High Not Profitable Description In rare cases the iteration interval of a software pipelined loop is higher than a non pipelined list scheduled loop In this case it is more efficient to execute the non software pipelined version Solution Split into multiple loops or reduce the complex...

Page 360: ...profitably pipelined Based on the available information on the largest possible trip count the compiler estimates that it will always be more profitable to execute a non pipelined version than to execute the pipe lined version given the schedule that it found at the current iteration interval Solution Probably best optimized by another technique i e unroll the loop completely For more information ...

Page 361: ... trip count information For more information See section 3 4 3 3 Communicating Trip Count Information to the Compiler on page 3 43 See section 6 2 5 The trip Directive on page 6 8 Register is Live Too Long Description Sometimes the compiler finds a valid software pipeline schedule but one or more of the values is live too long Lifetime of a register is determined by the cycle a value is written in...

Page 362: ...nto two separate loops If multiple conditionals are used in the loop allocation of these conditionals is the reason for the failure Try writing linear assembly and partition all instruc tions writing to condition registers evenly between the A and B sides of the machine For the C62x and C67x if there is an uneven number put more on the B side since there are 3 condition registers on the B side and...

Page 363: ...pelining on the C6000 If possible rewrite the loop to not modify the trip counter by adding a separate variable to be modified The fact that the loop counter is used in the loop is actually determined much earlier in the loop qualification stage of the compiler Why did the compiler try to schedule this anyway The reason has to do with the mh option This op tion allows for extraneous loads and faci...

Page 364: ...piler must assume they might This can cause a dependency often between the load of one pointer and the store of another that does not really exist For software pipelined loops this can greatly degrade performance Solution Use pm program level optimization to reduce memory pointer aliasing Add restrict declarations to all pointers passed to a function whose objects do not overlap Use mt option to a...

Page 365: ...n c code Use the trip directive to specify loop count information in linear assembly Alternatively the compiler may be able to determine this information on its own when you compile the function and callers with pm and op2 For More Information See section 3 4 3 3 Communicating Trip Count Information to the Compiler on page 3 43 See section 6 2 5 The trip Directive on page 6 8 Uneven Resources Desc...

Page 366: ... Bank Conflicts Description In cases where the compiler generates 2 memory accesses in one cycle and those accesses are either 8 bytes apart on a C620x device 16 bytes apart on a C670x device or 32 bytes apart on a C640x device AND both accesses reside within the same memory block a memory bank stall will occur To avoid this degradation memory bank conflicts can be completely avoided by either pla...

Page 367: ...accesses LDW STW for any short accesses being performed Solution Use word accesses for short arrays declare int or use _nassert and use mpy intrinsics to multiply upper and lower halves of registers Try to employ redundant load elimination technique if possible Use LDW STW instructions for accesses to memory For More Information See section 3 4 2 Using Word Accesses for Short Data C on page 3 24 S...

Page 368: ...d vector sum 6 76 functional units in 5 5 instructions in 5 4 labels in 5 2 linear dot product fixed point 6 10 6 20 6 24 6 30 6 39 dot product floating point 6 21 6 25 6 31 6 40 FIR filter 6 113 6 115 6 124 6 126 FIR filter outer loop 6 139 FIR filter outer loop conditionally executed with inner loop 6 142 6 144 FIR filter unrolled 6 138 if then else 6 88 6 91 6 96 6 99 IIR filter 6 79 6 83 live ...

Page 369: ... 5 8 cproc directive 2 25 CPU elements 1 2 D D functional units 5 7 data types 3 2 dependency graph dot product fixed point 6 12 dot product fixed point parallel execution 6 15 with LDW 6 22 6 24 6 30 dot product floating point with LDW 6 23 6 25 6 31 drawing 6 11 steps in 6 12 FIR filter with arrays aligned on same loop cycle 6 122 with no memory hits 6 125 with redundant load elimination 6 114 i...

Page 370: ...ry hits 6 130 with redundant load elimination no memory hits outer loop software pipelined 6 134 linear assembly for inner loop 6 113 for outer loop 6 139 for unrolled inner loop 6 124 for unrolled inner loop with mptr directive 6 126 with inner loop unrolled 6 138 with outer loop conditionally executed with in ner loop 6 142 6 144 software pipelining the outer loop 6 132 using word access in 3 29...

Page 371: ...weighted vector sum 6 62 little endian mode and MPY operation 6 21 live too long code 6 68 C code 6 102 inserting move MV instructions 6 106 unrolling the loop 6 106 issues 6 102 and software pipelining 3 50 created by split join paths 6 105 load doubleword LDDW instruction 6 19 word LDW instruction 6 19 long data type 3 2 loop carry path described 6 78 counter handling odd numbered 3 27 unrolling...

Page 372: ...a MUST_ITERATE 3 44 preparation for tutorial 2 1 priming the loop described 6 51 printf function 3 3 program level optimization 3 7 prolog 3 41 6 51 6 53 pseudo code for single cycle accumulator with ADDSP 6 37 R redundant load elimination 6 111 loops 3 43 reg directive 2 25 6 20 6 21 register allocation 6 128 operands 5 8 resource conflicts described 6 65 live too long issues 6 68 6 102 table FIR...

Page 373: ...5 V vector sum function See also weighted vector sum C code 3 8 with const keywords _nassert word reads 3 25 with const keywords _nassert word reads and loop unrolling 3 46 with const keywords _nassert and word reads generic 3 26 3 27 with three memory operations 3 45 word aligned 3 46 dependency graph 3 8 handling odd numbered loop counter with 3 27 handling short aligned data with 3 27 rewriting...

Reviews:

No comments

Related manuals for TMS320C6000 DSP

Professional Series

Brand: PACOM Pages: 15

Univerge SV9300

Brand: NEC Pages: 26

Brand: NEC Pages: 20

DS2000 IntraMail

Brand: NEC Pages: 2

Brand: Hal Communications Pages: 25

Brand: Paradyne Pages: 1

Brand: IBA Pages: 29

Brand: Maxim Pages: 26

Brand: Barco Pages: 121

Brand: Cameo Pages: 23

Compact NVR-AS 4000

Brand: IndigoVision Pages: 38

Brand: Intel Pages: 958

Brand: IEI Technology Pages: 4

Brand: Rockwell Automation Pages: 21

Brand: Primera Pages: 82

HYPER-BT-E38 1 Series

Brand: IEI Technology Pages: 95

Brand: FabiaTech Pages: 3

Brand: Stahl Pages: 52

Brands by name

0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Popular brands

Load more brands