background image

 

4-12 

Optimizing DSP56300/DSP56600 Applications

MOTOROLA

Using the DMA

Data Transfer Optimization Hints

4.5

DATA TRANSFER OPTIMIZATION HINTS

Some points should be bared in mind when optimizing the code for 
performance:

• While transferring words between two data memory 

locations takes approximately the same number of cycles if 
done either by software or by DMA, the DMA has an 
advantage when transferring to or from program memory. 
This is due to the 6 cycles required for every software access 
(MOVEM instruction) to program memory.

• The DSP56300 memory RAM is organized in 256-word 

blocks (addresses in a block share the sixteen Most 
Significant Bits of the address). A ROM block is 3 K words 
long. If both the core and the DMA access addresses in the 
same block, the DMA access stalls until the core stops its 
access to that block. To avoid such stalls, the core and the 
DMA should be made to work on separate memory blocks. 
However, in case requirements for overall efficiency 
outweigh possible stalls, the programmer should still be 
aware of the possible DMA stall, and perhaps write the loops 
so that the core will not access the same memory block in 
every clock. The following loop generates an access to the 
source memory block every clock, and will stall a parallel 
DMA transfer to that block for as long as the loop lasts:

move

x:(r0)+,a

move

x:(r0)+,b

DO

#(N/2-1),_BE_NASTY_TO_DMA

move

x:(r0)+,aa,y:(r4)+

;r0 points to the 

;memory block that is 

;also used by the DMA

move

x:(r0)+,bb,y:(r4)+

;r4 points to other 

;internal memory

_BE_NASTY_TO_DMA

move

a,y:(r4)+

move

b,y:(r4)+

• The following more considerate loop lasts longer, but enables 

the DMA to access the memory block, too:

DO

#N,_IM_OK_DMA_OK

move

x:(r0)+,x0

;r0 points to memory block

;also used by the DMA

move

x0,y:(r4)+

;r4 points to other 

;internal memory

_IM_OK_DMA_OK

Summary of Contents for DSP56300

Page 1: ...M o t o r o l a s H i g h P e r f o r m a n c e D S P T e c h n o l o g y APR20 D DSP56300 DSP56600 Application Optimization forthe DigitalSignal Processors ...

Page 2: ...st Performance 1 7 1 4 3 Appendixes 1 8 SECTION 2 DATA OPERATIONS 2 1 2 1 USING THE DUAL DATA PATHS 2 1 2 2 16 BIT ARITHMETIC MODE DSP56300 ONLY 2 6 2 3 THE MAX INSTRUCTION 2 7 2 4 USING THE BARREL SHIFTER 2 8 2 5 BIT MANIPULATION INSTRUCTIONS 2 10 2 6 DOUBLE PRECISION ARITHMETIC 2 11 2 7 USING LESS STRAIGHT FORWARD INSTRUCTIONS 2 13 SECTION 3 PROGRAM CONTROL 3 1 3 1 HARDWARE LOOPS 3 1 3 2 THE HAR...

Page 3: ...urst Mode 5 6 5 2 MEMORY SWITCH 5 9 5 3 USING THE BOOTSTRAP ROM 5 11 SECTION 6 PIPELINE INTERLOCKS 6 1 6 1 DATA ALU PIPELINE INTERLOCKS 6 1 6 1 1 What are the Data ALU Pipeline Interlocks 6 2 6 1 2 Avoiding Data ALU Pipeline Interlocks 6 3 6 1 2 1 Code Reorder 6 3 6 1 2 2 Loop Unrolling 6 4 6 1 2 2 1 Loop Unrolling in N Array Scale Routine 6 4 6 1 2 2 2 Unrolling in Memory Array Copy Routine 6 5 6...

Page 4: ...oop LA or LA 1 6 11 6 4 1 5 MOVE from the System Stack High SSH 6 11 6 4 1 6 Conditional Instructions 6 11 6 4 2 Avoiding Program Flow Control Pipeline Interlocks 6 11 SECTION 7 COMPACT OPCODE USE 7 1 7 1 CYCLE COUNT OF AN INSTRUCTION 7 1 7 1 1 Opening small REP and DO Loops 7 1 7 1 2 Replacing Jumps with Conditional Execution Instructions 7 2 7 1 3 Inverting Condition in Conditional Jump Instruct...

Page 5: ...S A 3 APPENDIX B DEBUG AND TEST SUPPORT B 1 B 1 ONCE PORT FEATURES B 1 B 2 JTAG PORT FEATURES B 2 B 3 ADDRESS TRACING B 3 APPENDIX C USING THE PROFILER C 1 C 1 SCOPE C 1 C 2 CREATING A PROFILER C 1 C 3 THE PROFILING REPORT C 2 C 3 1 Basic Report C 2 C 3 2 Symbol Report C 3 C 3 3 Instruction Set Usage Report C 3 C 3 4 Code Coverage Report C 5 C 3 5 Basic Subroutine Report C 6 C 3 6 Subroutine Call ...

Page 6: ... 1 The Fast Normalization Operation for the DSP56300 2 9 Figure 2 2 48 48 bit Multiplication with 48 Bits of the Result Kept 2 12 Figure 3 1 State of the Stack When IRQA Is Serviced 3 5 Figure 4 1 DMA Addressing Modes for SCI Transmitters 4 10 Figure 5 1 DSP56302 Memory Maps 5 10 ...

Page 7: ...ta Operations Using Multi shift 2 8 Table 2 5 Bit manipulation instructions 2 10 Table 3 1 Implicit Stack Activity 3 4 Table 3 2 Registers Involved in Stack Extension Operation 3 7 Table 3 3 Stack Status Information 3 9 Table 3 4 Options for Parallel Moves and Conditional Execution 3 12 Table 3 5 Instructions with Program Memory Arguments 3 14 Table 5 1 Example for Cycle Count with Cache Enabled V...

Page 8: ... this document is to describe the new and the DSP56000 based features of the DSP56300 and DSP56600 cores in order to help the DSP software engineer to fully utilize the processor resources and generate an optimized application The document is a supplement to the detailed DSP56300 and DSP56600 Family Manuals 1 1 DSP56300 CORE FAMILY The DSP56300 core consists of the Expansion Port and DRAM Controll...

Page 9: ...een these derivatives are the size of the on chip memory and the types of on chip peripherals and hardware accelerators 1 2 DSP56600 CORE FAMILY The DSP56600 core consists of the External Memory Interface port Data ALU Address Generation Unit Program Control Unit PLL Clock Oscillator On Chip Emulation module and the Peripheral and Memory Expansion Busses The main differences between the DSP56300 a...

Page 10: ...der generation 24 bit DSP family the DSP56000 The following tables shortly describe these enhancements 1 3 1 Instruction Set Enhancements Many instructions were added in order to support the target applications of the new DSP cores Table 1 1 New Instructions in DSP56300 and DSP56600 Opcodes Opcodes Exist in DSP56300 Exist in DSP56600 MAX Transfer by Signed Value MAXM Transfer by Magnitude INSERT I...

Page 11: ...LRA Load Relative Address BSR BScc Branch Subroutine always conditionally BRA Bcc Branch Target always conditionally BSset BSclr Branch Subroutine on Bit Set Clear BRset BRclr Branch Target on Bit Set Clear DO Forever DO Loop Forever DOR Forever DO Loop Forever Relative BRKcc Break Loop Conditionally TRAPcc TRAP Conditionally IFcc Execute Instruction Conditionally VSL Viterbi Shift Left Table 1 1 ...

Page 12: ... multibit operations The address and offset registers of the DSP56300 R0 R7 N0 N7 were extended to 24 bit wide to support larger memory sizes The DSP56300 has a 16 bit Arithmetic operating mode such that 16 bit exact algorithms can be implemented without any overhead The DSP56300 and the DSP56600 have an on chip Hardware Stack Extension mechanism that makes the Stack depth practically unlimited Ro...

Page 13: ...n be used to optimize applications Section 1 Introduction DSP56300 core family DSP56600 core family Enhancements over the DSP56000 Section 2 Data Operations How to organize data in memory to use parallel moves How to use the barrel shifter in various applications The benefit and use of the 16 bit Arithmetic support Some examples that show the benefit of few of the new arithmetic and logical instru...

Page 14: ... task switching Burst mode for DRAMs Memory banks between program and data Using the bootstrap ROM 1 4 2 Optimizing the Code for Best Performance The next two sections include general explanation of the various pipeline stall conditions and how they can be avoided in order to get faster execution times In addition some observations on the instruction set are included along with recommended usage f...

Page 15: ...cle count of an instruction Addressing modes Word count of an instruction Peripheral addressing 1 4 3 Appendixes There are three appendices providing supplementary information about application design guidelines Appendix A Saving Power Appendix B Debug and Test Support Appendix C Using the Profiler ...

Page 16: ...emory sections the X memory and the Y memory with the Data ALU This parallelism allows the DSP56300 DSP56600 core to execute more effectively for example executing a FIR tap in one clock cycle Data that is moved in parallel into a register is ready for use in the next instruction and does not interfere with the current value of the operands in execution In the above example the values of X0 and Y0...

Page 17: ...to generate the address for both the X memory and the Y memory For example mac x0 x1 a l r0 X The syntax L R0 X is equivalent to moving X R0 to X0 and Y R0 to X1 then incrementing R0 The name long addressing refers to the fact that such addressing enables to access two data registers as if they were one 48 bit long register Not all the DSP56300 DSP56600 instructions support parallel moves In gener...

Page 18: ...AC Signed operands Signed Multiply and Accumulate and Round MACR Transfer by Signed Value MAX Transfer by Magnitude MAXM Signed Multiply MPY Signed Multiply and Round MPYR Negate Accumulator NEG Logical Complement NOT Logical Inclusive OR OR Non immediate Round Accumulator RND Rotate Left ROL Rotate Right ROR Subtract Long with Carry SBC Subtract SUB Non immediate Shift Right and Subtract Accumula...

Page 19: ...available for long addressing are listed in Table 2 3 Note Some syntax combinations of the accumulators differ only in shifting limiting if the register is the source or implicit register updates if they are destination For example compare A10 with A In the AB and BA combinations each accumulator has same behavior as a regular move such as move a x r0 Transfer Data ALU Register TFR Test Accumulato...

Page 20: ...thms involving complex numbers the efficient solution uses one memory space for the real part of the numbers while the other memory space is used for the imaginary part In those examples there is a logical separating criterion between the data placed in the X and Y memories In many applications however variables may be split up between the X and Y memories based on no other criterion than the abil...

Page 21: ...s For more information on the 16 bit Arithmetic mode see Section 3 4 in the DSP56300 Family Manual for a general description and Appendix A in the same manual Instruction set for a detailed description on the functionality of each instruction affected by this mode Using the 16 bit Arithmetic mode may give many advantages from a general system point of view Ability to implement a 16 bit exact algor...

Page 22: ...instruction is effectively executed in one clock cycle Previously such functionality was achieved in two cycles for example cmp a b tlt a b Note This example differs from the MAX functionality only in the status update The MAXM instruction can be used to find the largest number in an array of values in N 10 clock cycles cycles move DATA_POINTER r0 1 clr b x r0 a 1 3 interlock rep n 5 maxm a b x r0...

Page 23: ... deserves special attention as it can effectively replace several instructions in many common algorithms The NORMF instruction arithmetically shifts the data from the destination accumulator D in the direction and amount specified by the first operand C If C 0 then D is arithmetically shifted to the left by C bits If C 0 then D is arithmetically shifted to the right by C bits The operand C should ...

Page 24: ... maximizing calculation accuracy implement floating point routines normalizing data blocks and more For example consider the following routine for efficiently normalizing a data block The first pass finds the normalization factor using MAXM and CLB and the second pass performed the normalization itself NORMALIZING A DATA BLOCK X base base address of un normalized data Y base base address of normal...

Page 25: ...mulator These instructions are summarized in Table 2 5 The EXTRACT U and INSERT instruction use a control operand C that specifies the bit field to be extracted or inserted The bit field Table 2 5 Bit manipulation instructions Mnemonic Function Operands EXTRACT Extract a bit field C S D C Source field position width S Data source D Data Destination 6 bit immediate or X0 X1 Y0 Y1 A1 B1 A B A B EXTR...

Page 26: ...rithmetic operations if the operands are longer than standard accumulator size Using these instructions can help achieve enhanced precision with minimum software overhead The examples below relate to the DSP56300 core register size 24 bits for data registers 56 bits for an accumulator but can be adapted for the DSP56600 core by changing the register size accordingly The normal ADD and SUB instruct...

Page 27: ... 48x48 bit multiplication with 48 bit result first operand X1 X0 second operand Y1 y0 result is in accumulator A mpyuu x0 y0 a x0 u y0 u a dmacsu y1 x0 a a 24 y1 s x0 u a macsu x1 y0 a a x1 s y0 u a dmacss x1 y1 a a 24 x1 s y1 s a Figure 2 2 48 48 bit Multiplication with 48 Bits of the Result Kept 0 23 0 23 0 23 0 23 0 47 0 47 0 47 0 47 0 47 X0 X1 Y0 Y1 X0 u Y0 u Y1 s X0 u X1 s Y0 u X1 s Y1 s Resu...

Page 28: ...AIGHT FORWARD INSTRUCTIONS The rich instruction set includes many instructions that are in fact combinations of smaller atomic operations Among these instructions are ADDL ADDR MAX EXTRACT INSERT MACR and MPYR A good example of using some of these less straight forward instructions is the SQROOT routine The following is a straight forward implementation of that routine sqroot determine 2nd term an...

Page 29: ...causing some reduction in total cycle count sqroot determine 2nd term and add contribution asr a 40 y1 sub y1 a 80 x1 a L_Temp1 sub x1 a a1 x0 a L_Temp1 x0 swTemp determine 3rd term and add contribution mpy x0 x0 b b swTemp 2 addr a b b1 x1 x1 swTemp2 b L_Temp0 determine 4th term and add contribution mpy x0 x1 a 70 y1 a swTemp x swTemp2 addr b a a1 y0 y0 swTemp3 a L_Temp1 determine partial 5th ter...

Page 30: ...ssembly implementation of the main loop of the code may look like this move MEMORY_AREA r0 clr a 100 b move x r0 x0 _LOOP_TOP add x0 a x r0 x0 sub 1 b tst b jne _LOOP_TOP Using hardware looping this code looks like move MEMORY_AREA r0 clr a x r0 x0 do 100 _LOOP_END add x0 a x r0 x0 _LOOP_END There is more to hardware loops than easy programming The loop control hardware is optimized for maximum pi...

Page 31: ...gh level for loops are normally implemented in assembly with the DO instruction The instructions DO FOREVER and BRKcc break on condition may be used to implement high level while or repeat loops efficiently The following example is a wave generator that sends data to a peripheral Host Interface HI08 in this example until a hardware interrupt IRQA sets a flag signalling the end of the loop The core...

Page 32: ...rand A1 Note Both ENDDO and BRKcc have sequence restrictions as shown in the DSP56300 and DSP56600 Family Manuals Appendix B 3 2 THE HARDWARE STACK The DSP56300 DSP56600 hardware stack enables the user to nest DO loops and subroutines called by software or interrupts with no software overhead With the Stack Extension enabled the hardware stack can accommodate an unlimited nesting level of DO loops...

Page 33: ...structions on the stack Some instructions update other registers as well For complete information on an instruction refer to Appendix A in the DSP56300 and DSP56600 Family Manuals Table 3 1 Implicit Stack Activity Activity Triggered by Instruction or Condition Implicit Stack Actions Taken jump to subroutine JSR BSR JScc BScc condition true JSCLR BSCLR condition true JSSET BSSET condition true SP S...

Page 34: ...ave a subroutine call push data into the stack Had IRQA been a long interrupt another push would have been done the saved values being SSH 529 PC and SSL C18300 SR The different values of the LF and FV bits in SR are saved as the nesting proceeds no loop finite loop infinite loop exit DO loop immediately BRKcc condition true PC LA 1 SR SSL SP SP 1 LA SSH LC SSL SP SP 1 Figure 3 1 State of the Stac...

Page 35: ...p after normal loop termination sp 1 rts after execution SP 0 execution returns to main Direct user access with the MOVEC instruction is possible to SSL SSH Note that MOVEC to from SSH implicitly increments or decrements SP while the same instruction on SSL has no effect on SP A manual pop operation will usually have the format movec ssl destination 1 movec ssh destination 2 implicit sp decrement ...

Page 36: ...If the stack extension is disabled the values of SP are bounded to 0 15 and selection of other values cause a stack error exception When the stack extension is enabled SP may hold values from 0 up to the value stored in SZ A push increments SP by 1 a pop decrements it by 1 SZ stores the maximum stack depth During stack extension operation if SP becomes greater than SZ a stack overflow exception oc...

Page 37: ...ed connection between the values of SP and SC EP holds the pointer to the data memory location where the extension is stored The address space X or Y is selected by setting the XYS bit in the OMR EP has no default value and should be initialized by the user Each push that activates the extension causes two memory writes after which EP is incremented by 2 since one stack entry is composed of 2 word...

Page 38: ...eset initial state EXTEN_START equ 1024 start address of stack extension in data area MEM_SIZE equ 512 stack ext size in data area maximum stack size hardware extension in units of two 24 bit words STACK_LIMIT equ MEM_SIZE 2 14 move EXTEN_START ep set ext pointer in data memory move STACK_LIMIT sz set stack limit bset M_XYS omr select y space bset M_SEN omr enable stack extension Note The stack ex...

Page 39: ...ove T1_task_reg_area r7 Load pointer move x0 x r7 Save registers move r6 x r7 Save registers move x OS_temp r0 Pull r7 move r0 x r7 Save r7 move n0 x r7 Save n0 move n1 x r7 Save n1 move lc x r7 move la x r7 3 At this point all the registers were saved as a mirror of the T1 task but the stack has some data in it that belongs to the T1 task as well This data should also be copied to some memory are...

Page 40: ...sp Restore SP move x r0 ep Restore EP move 2 sc reset Stack Counter rep 14 move ssh x OS_dummy move x OS_sc_temp sc Restore sc move x OS_r7_temp r7 Restore r7 7 The last thing the Operating System dispatcher should do is to execute an RTI instruction which will give control back to the new task T2 Activate T2 rti 3 5 CONDITIONAL DALU INSTRUCTIONS The DSP56300 600 instruction set has a group of ari...

Page 41: ... moves exclude each other The options the user has to modify these instructions are summarized in Table 3 4 Another data changing instruction that could be executed conditionally is Tcc transfer on condition This instruction could also be used to transfer AGU registers conditionally On the other hand it does not have a parallel move option see Appendix A in the DSP56300 and DSP56600 Family Manuals...

Page 42: ... level code line and its translation in assembly listed below if X0 A X1 A then A A y0 b b y0 cmp X0 a x r0 x1 test for X0 a parallel field used to set X1 for next cmp cmp X1 a IFgt U test for X1 a only if the last condition was true add y0 a IFgt A A y0 B B y0 only if both conditions were true add y0 b IFgt 3 6 PC RELATIVE INSTRUCTIONS Many of the DSP56300 control instructions require a program l...

Page 43: ...ctions that use PC relative arguments Note The instructions LRA load PC relative address and LUA load effective address can be used to calculate and load PC relative address or an effective address respectively LRA is a very efficient and common way for a program to monitor directly the PC value during runtime Note The DSP56600 does not include the complete set of PC Relative instructions like the...

Page 44: ... clear set destination JSCLR JSSET BSCLR BSSET DO loop last address DO DOR lock unlock cache sector address in sector PLOCK PUNLOCK PLOCKR PUNLOCKR calculate and load absolute address effective address LUA calculate and load PC relative address absolute address1 or disp register LRA move from to program memory program memory source dest MOVEM addr 64 Table 3 5 Instructions with Program Memory Argu...

Page 45: ...ample using PC relative addressing cmp x0 a bne _CONT1 inc b _CONT1 rts After assembly the labels _CONT and _CONT1 have definite values In the first example if _CONT 4095 then the Assembler must use the 2 word opcode placing the value of _CONT as the second word _CONT1 however has the value of 2 therefore fitting into the 1 word opcode version of the instruction Bcc Branch on Condition Furthermore...

Page 46: ... flush or stall the pipeline behaves as if the two interrupt words were originally part of the program sequence Long interrupts If one of the instructions is of a change of flow type execution cost is much higher Normally the minimum activity includes a jump to a subroutine which is relatively a long instruction since it pushes data to the stack Normally at the subroutine end there is a correspond...

Page 47: ...D essi0 receive data interrupt movep x M_RX0 x M_HTX from ESSI receive register to host interface transmit register occupies 2 words org p I_SI0TD essi0 transmit data interrupt movep x M_HRX x M_TX0 from host interface receive register to ESSI transmit register occupies 2 words In a second example provided below the MOVEP instruction is used with a pointer It is a 1 word instruction and that leave...

Page 48: ... routine using a don t care bit in the modifier register org p I_SI0TD essi0 transmit data interrupt movep x r5 x M_TX0 r5 transmit data buffer pointer bset 22 m5 flag for data process routine using a don t care bit in the modifying register somewhere in the program org p INITIALIZE move RECIEVE_DATA_BUF r4 move RECIEVE_DATA_BUF_SIZE 1 m4 bclr 22 m4 move TRANSMIT_DATA_BUF r5 move TRANSMT_DATA_BUF_...

Page 49: ...3 20 Optimizing DSP56300 DSP56600 Applications MOTOROLA Program Control Using Fast Interrupts ...

Page 50: ...ore Transfer modes support flexible triggering for words lines and blocks This section provides some application examples for using the DMA functions It assumes basic familiarity with the DMA For detailed information about the DMA see Section 8 of the DSP56300 Family Manual Note Although may of the DMA registers can be used as general purpose registers if not otherwise used do not use the control ...

Page 51: ...ing DMA channel 0 The block size is BLOCK_SIZE data words The core uses R0 as the pointer to the data area under work and R1 to point to memory locations where the top addresses of the two memory areas are stored Modulo 1 Addressing mode is used with R1 to quickly load R0 with the block address and switch between the two memory areas In this application it is up to the user to stop processing the ...

Page 52: ...r DPR 11 highest channel priority DCON 0 continuous mode disabled DRS 00000 DMA request source don t care arbitrary value D3D 0 3 dimensional mode disabled DAM 5 3 101 destination address post increment DAM 2 0 101 source address post increment DS 3 2 00 transfer destination x memory DS 1 0 01 transfer source y memory movep 9e02d1 x M_DCR0 initialize DMA control reg and initiate first transfer jmp...

Page 53: ...evices DRAMs SRAMs and SSRAMs and has the following supporting features Programmable number of wait states Specialized address attributes pins which can be used as programmable chip selects masking address ranges and memory spaces x y or p each may support a different memory type On chip DRAM controller with programmable in page and out of page wait states and refresh control More detailed informa...

Page 54: ...ADDRequ 200 internal address of transfer destination NUM_24_Wequ 512 number of 24 bit words to read initialize BIU AAR_WORDequ MSP_8_BIT 16 8a1 AAR0 value only relevant non zero values listed BAC MSP_8_BIT 8 bits for address compare bits BNC 1000 number of address bits to compare 8 BPAC 1 packing mode enabled BYEN 1 Y data space enabled BAT 01 external access type Synchronous RAM movep AAR_WORD x ...

Page 55: ...d can transfer data to and from them thus giving the user a powerful alternative for driving peripherals Examples for interrupt driven core handling were given earlier in Section 3 Using the DMA to handle peripheral requests has the following advantages 1 Saves core MIPS because the DMA is triggered independently and transfers the data in parallel to the core 2 Frees core address registers that pr...

Page 56: ...errupts enabled DTM 101 word transfer triggered by request source DE is not disarmed at end of word trans DPR 11 highest channel priority DCON 0 continuous mode disabled DRS 01010 DMA request source ESSI0 receive data D3D 0 3 dimensional mode disabled DAM 5 3 101 destination address post increment DAM 2 0 100 source address no update DS 3 2 00 transfer destination x memory DS 1 0 00 transfer sourc...

Page 57: ...gisters SRXL SRXM SRXH The byte that is read is positioned in the 24 bit data bus accordingly the other two bytes read as zeros Therefore the three words that were read must be OR ed to give the 24 bit data The basic DMA addressing scheme needed for transmitting one 24 bit word from the DSP56300 is a block of three transfers used to write the three bytes of the original word The source address is ...

Page 58: ... implements the second option using one DMA channel for feeding the SCI transmitter Words for transmission are arranged in a data buffer As shown in Figure 4 1 on page 4 10 each word in the buffer should be written 3 times to the SCI transmit registers so that all 3 bytes are transmitted After all the buffer is transmitted the core is interrupted This addressing is defined in the 3 D Addressing mo...

Page 59: ...2 3 1 2 in DCOM bits 11 6 0 in DCOL bits 5 0 movep M_STXL x M_DDR0 destination base address SCI Transmit Low register movep TX_BUF x M_DSR0 source base address address of memory transmit data area movep COUNT0 x M_DCO0 load number of transfers before core is interrupted movep 0 x M_DOR0 offset register 0 added every word DCOL to source address movep 1 x M_DOR1 offset register 1 added every 3 words...

Page 60: ...led DRS 01111 DMA request source SCI transmit D3D 1 3 dimensional mode enabled DAM 5 3 111 dest addressing 3D with offsets DOR2 DOR3 DAM 2 0 source address 3D DOR0 DOR1 DAM 1 0 00 3D counter mode DCOH 12 bits DCOM 6 bits DCOL 6 bits DS 3 2 01 transfer destination y memory DS 1 0 00 transfer source x memory movep 4c7f84 x M_DCR0 load control register not triggered initialize SCI movep 8200 x M_SCR ...

Page 61: ...core stops its access to that block To avoid such stalls the core and the DMA should be made to work on separate memory blocks However in case requirements for overall efficiency outweigh possible stalls the programmer should still be aware of the possible DMA stall and perhaps write the loops so that the core will not access the same memory block in every clock The following loop generates an acc...

Page 62: ...f the DSP56300 core and a cacheable memory array part of the on chip Program RAM that may be used to store the cached instructions When the cache controller is disabled the Cache Enable bit in the SR is cleared the cacheable memory behaves like regular Program RAM and is accessible to the user as part of the internal program memory space When the cache controller is enabled the Cache Enable bit in...

Page 63: ...from a benchmark for FIR lattice filter the DSP56300 Family Manual Appendix C In the example each external fetch inserts 3 wait states Therefore the execute time needed for each instruction in the loop is 4 cycles 1 cycle for execution and 3 wait states for the instruction that is being fetched in parallel In other words due to the pipelining the wait states of an instruction stalls the execution ...

Page 64: ...ode section will be executed again e g if it was a part of a subroutine then it will be all hits and run according to the 3N 10 formula as if it were in the internal memory There is no penalty for a cache miss above the needed wait states associated with the external access itself All cache operations are done in parallel to program execution without any performance cost 5 1 1 Cache Sectors A chip...

Page 65: ...number of sectors keeping fragments of unused sector space at a minimum 5 1 2 Control of Sector Allocation Allocation of sectors is done automatically they are allocated as instructions are fetched If an instruction does not have a sector with a fitting tag it is a sector miss If a sector is available it s tag register is written with the tag field of the instruction s address and the instruction ...

Page 66: ...its tag register Selecting a sector for allocation from one of the available unlocked sectors is always done by the controller hardware using the LRU algorithm The PLOCK and PUNLOCK are given the address argument by using one of the regular addressing modes Absolute Address or Memory Indirect Using an Address Register The PLOCKR and PUNLOCKR use a PC relative displacement to calculate the address ...

Page 67: ...refore it will be a cache hit The instructions that were fetched during the burst cause the regular wait states as defined for that type of access The Burst mode is intended for working with DRAM external memory where an out of page access causes more wait states than an in page access In an application that uses the same DRAM for both data and program memory the program s serial flow of fetches w...

Page 68: ...atrix move MAT_B r4 3x3 matrix move MAT_X r1 1x3 output matrix move 2 m0 Modulo 3 move 8 m4 Modulo 9 move m0 m1 Modulo 3 Table 5 2 shows the example program and the relevant cycle count In the external access columns po and pi designate out of page or in page program fetches while do and di designate out of page or in page data accesses respectively Table 5 2 Cycle Count Example With and Without B...

Page 69: ...rformed during its execution An access cycle count is the number of wait states 1 The instruction s execution time is in parallel to the access cycles i11 mac x0 y0 a x r0 x0 y r4 y0 1do 1di 1po 21 2di 6 i12 macr x0 y0 a 1do 1di 1po 21 2di 1po 3pi 24 i13 nop 1do 1po 18 1do 9 i14 move a y r1 1do 1di 1po 21 2di 6 i15 nop 1pi 3 i16 nop 1pi 3 1po 3pi 18 i17 nop 1do 9 1do 9 TOTAL 257 186 i0 nop 1po 9 1...

Page 70: ...e cut in external access time achieved by using the burst mode is a net increase in performance 5 2 MEMORY SWITCH Each chip has a fixed amount of internal RAM divided between x y and p spaces This architecture allows fetching an instruction in parallel to two data moves but does not allow the use of data space for program instructions and vice versa Some members of the DSP56300 600 families suppor...

Page 71: ...Memory Internal Memory 1400 1C00 4C00 5000 1400 5C00 6000 1400 1C00 1400 Program Memory Y Memory X Memory Program Memory Y Memory X Memory Default Mode Memory Switch Mode External Memory External Memory External Memory External Memory External Memory External Memory Internal Memory Internal Memory Internal Memory Internal Memory Icache Icache AA0835 ...

Page 72: ...d be done with care the user may not use memory addresses that change their locations while executing the switch instruction at least six instructions afterwards The cache must also be disabled A proper switching sequence should be run from program memory addresses that do not change their physical mapping during the switch when the cache is disabled and without data accesses to the data areas tha...

Page 73: ... program control to that address Consult the user s manual of the chip you are using to see if the chip has a boot ROM program and if it does what boot options are available and what is the expected data format A full program listing is also provided After the bootstrap program has finished the mode bits in the OMR may be modified by the user s program and serve as general purpose flags In case of...

Page 74: ...hip both data movements and instruction fetches then the impact of these interlocks will be negligible However it is very important for the user to be familiar with and know ways to avoid those interlocks that are caused from certain dependencies between instructions and operands Note The DSP56300 and DSP56600 assemblers generate a warnings for every occurrence of a pipeline interlock These warnin...

Page 75: ...ts from previous instruction with X1 Y1 Transfer Interlock A transfer interlock causes a single cycle delay in the execution of the second MOVE instruction the one that reads the accumulator or one of its parts It is caused by moving the contents of an or parts of accumulator that was the destination of the preceding or second preceding MOVE instruction Example move X R0 A Move memory to A move A1...

Page 76: ... c mpy x1 y0 B x r3 r7 A bWr mac x0 y1 B x r3 r1 B bWr dWi T1 get first index sub B A A a T1 c get second index addl A B A x r1 B a T1 a PUT c to x b mpy y1 y0 A B x r7 B dWr B c PUT a mac x1 x0 Ay r4 n4 B A dWi bWr T2 B c r4 ptr to next c sub B A x r2 x0y r6 y0 A T2 c d x0 next Wi y0 next Wr addl A B A y r1 B T2 c b update r4 A next a PUT d move x r0 AB y r7 PUT b A next a move y r4 B B next c Th...

Page 77: ...les demonstrate possible applications for this method 6 1 2 2 1 Loop Unrolling in N Array Scale routine The following code segment is used for scaling an array of N positive numbers clr A x r0 B rep N max B A x r0 A Largest value of N numbers clb A B Count leading bits of the largest number move x r1 A do N _end normf B1 A Scaling block of N numbers move x r1 AA y r4 _end The read operation of acc...

Page 78: ...f this task would be to unroll the loop while using double parallel moves move X start r0 starting address of source array in x memory move Y start r4 starting address of destination array in y memory move x r0 a read first word from source memory DO N 2 1 _end move x r0 aa y r4 read source array write previous data move x r0 aa y r4 write destination memory read next data _end move a y r4 write l...

Page 79: ...cles move var_a r4 a array in Y memory space move var_b r0 b array in X memory space move var_c x0 constant to add do N _1Loop handle Y array move y r4 a read data word add x0 a add constant move a y r4 store result and increment pointer _1Loop do N _2Loop handle X array move x r0 a read data word add x0 a add constant move a x r0 store result and increment pointer _2Loop By combining the two loop...

Page 80: ...struction will be delayed by one instruction cycle Example Tcc A1 B r0 r1 Tcc instruction R1 is conditional destination move r1 x0 R1 is source operand Address Generation Interlock is caused by a MOVE instruction that uses one of the AGU registers R0 R7 for address generation while one of the three preceding instructions used one of the register set Ri Ni or Mi members as a destination register Ex...

Page 81: ...ions inside the sequence such that the new sequence of instructions will not generate interlock cycles An example of code reordering is described in the following example move 1 r1 move 3 r2 move 50 y0 move table r0 move x r0 x0 In the above example 3 address generation pipeline interlock cycles are added to the execution of the last instruction By reordering the instructions in that code however ...

Page 82: ...chip hardware stack has only two words inside The stack is declared as stack empty and any additional pop operation will activate the stack extension mechanism 6 3 2 Avoiding Stack Extension Delays The best way to avoid stack extension delays is to make sure that the number of stack levels used during execution of critical code segments will not be larger than fourteen If this is the case and upon...

Page 83: ...e described in detail in the Family Manual For the other cases the following legend is used I1 An address of an instruction where I2 I3 I4 are used to indicate the next instructions in the program flow MOVE any type of MOVE MOVEM MOVEP MOVEC BSET BCHG BCLR BTST JMP any type of JMP Jcc BRA Bcc JSR JScc BSR BScc JSET JCLR JSSET JSCLR BRSET BRCLR BSSET BSCLR LA the last address of a DO LOOP LA i the ...

Page 84: ...nal Instructions Whenever I1 is a conditional change of flow instruction e g Jcc and the condition is false then I2 will be delayed by 1 clock cycle 6 4 2 Avoiding Program Flow Control Pipeline Interlocks The common way to avoid a flow control pipeline interlock is to reorder the code or to use the locations near the end of a Do Loop for some other useful instructions Note Some sequences are restr...

Page 85: ...t version loop reordered the main point the CMP and subsequent branch are split between two iterations execution time of one iteration condition true 7 clocks move X r0 B read first data to B cmp B x0 first compare before loop DO N 1 LoopEnd1 blt cont SR updated in previous loop iteration move r4 sub x0 b cont add b a move X r0 B read next data to B cmp B A LoopEnd1 cmp B A after SR pop new CMP is...

Page 86: ...pects of the instruction set Please consult with Appendix B of the DSP56300 and DSP56600 Family Manuals for details on the exact cycle count and word count of each instruction 7 1 CYCLE COUNT OF AN INSTRUCTION Most of the instructions are executed in one clock cycle Among them are most of the arithmetic instructions and the move instructions But some instructions need more clock cycles to execute ...

Page 87: ...ve 9 n0 move 9 n4 do N _loop move x r0 x0 y r4 y0 DUP 8 mac x0 y0 a x r0 x0 y r4 y0 ENDM mac x0 y0 a x r0 n0 x0 y r4 n4 y0 mac x0 y0 a x r1 x1 _loop 7 1 2 Replacing Jumps with Conditional Execution Instructions The various JUMP and BRANCH instructions are a multi cycle instructions that needs several cycles to decode before actually branching to the target When a code needs to branch to certain lo...

Page 88: ...he target is not taken It is advised to choose the exact condition of the JUMP such that in most cases the target will be taken Example 7 1 First Example Original Code with Conditional Branch tst a bgt _else add x0 b bra _endif _else add y0 b _endif Example 7 2 First Example Code with Conditional Branch Replaced by Conditional Execution Opcodes IFcc tst a add x0 b ifgt add y0 b ifle Example 7 3 Se...

Page 89: ...be saved tst a bge frequent_code rare_error frequent_code Another example is the implementation of a CASE structure or an FSM Finite State Machine in code switch a case 0 a 2 break case 4 a b break case 9 a a break default a x0 The straight forward implementation would be tst a beq _case_0 cmp 4 a beq _case_4 cmp 9 a beq _case_9 _default add x0 a bra _end_case _case_0 add 2 a bra _end_case _case_4...

Page 90: ...count of an instruction may depend upon the specific addressing mode used with this instruction It is essential that the user will recognize these addressing modes in order to decrease the cycle count of the entire application 7 2 1 Single Cycle Addressing Modes Many addressing modes especially in the MOVE instructions are single cycle Some addressing modes add an additional cycle to the execution...

Page 91: ... without a significant increase in code length 7 2 3 Short Immediate Mode There are some MOVE instructions that permit the specification of immediate data numbers in a small range so that a second word is not required Example move 5 r0 This instruction executes in one clock cycle This makes it possible to initialize registers without executing 2 word 2 cycle instructions 7 2 4 Short Immediate Oper...

Page 92: ... 4TH_R If MC MB MA 011 goto 4th routine dc 5TH_R If MC MB MA 100 goto 5th routine dc 6TH_R If MC MB MA 101 goto 6th routine dc 7TH_R If MC MB MA 110 goto 7th routine dc 8TH_R If MC MB MA 111 goto 8th routine 7 2 6 Word Count Some instructions have single word versions that should be used when possible It is advisable to consult the Family Manuals for details on the word count of the various instru...

Page 93: ...n this code two data arrays were put into the same memory space while the code had to access an item from each array one after the other Instead if one of the arrays can be put into the other data memory space Y in this example then the two items can be accessed on the same instruction move x r0 x0 y r4 y0 7 4 2 Using the TFR instructions The TFR instruction is unique by giving the ability to comb...

Page 94: ...ter or accumulator in the code Optimization can be accomplished in this area also Example move r1 r0 move 0 a move y0 a0 This example can be optimized by using the CLR instructions and by combining a move instruction with the CLR to a parallel opcode clr a r1 r0 move y0 a0 Another example add x0 a clr b move y0 b0 This can be optimized by add x0 a 0 b move y0 b0 ...

Page 95: ...7 10 Optimizing DSP56300 DSP56600 Applications MOTOROLA Compact Opcode Use Special Instructions ...

Page 96: ...The WAIT instruction turns off most of the core and chip logic until one of the following events occur An Interrupt request from one of the following sources an external interrupt request pin an interrupt request from an on chip peripheral Note An interrupt request will terminate the Wait mode only if it is enabled and given the appropriate interrupt priority by programming the Interrupt Priority ...

Page 97: ... the JTAG port Assertion of the RESET input signal During Stop mode the entire chip function is shut down A common use of the Stop mode is in systems that process data on time intervals When processing is complete for a specific interval the chip can enter Stop mode until the next time slot This reduces overall power consumption Power consumption during Stop Standby Mode is almost zero in the rang...

Page 98: ...led by clearing the CE Cache Enable bit in the Status Register SR Bit 19 DSP56300 only Direct Memory Access DMA Controller Each DMA channel has a special DMA Enable bit DE in its control register When all these bits are cleared the DMA controller is disabled and will not consume any power supply current DSP56300 only External Bus When the application does not require access to external devices I O...

Page 99: ...A 4 Optimizing DSP56300 DSP56600 Applications MOTOROLA Saving Power Disabling Functional Blocks ...

Page 100: ...emulator system B 1 OnCE PORT FEATURES The OnCE port is a Motorola designed module used in DSP chips to debug application software used with the chip The port allows non intrusive interaction with the DSP and is accessible through the pins of the JTAG interface The OnCE module supports a special debug environment that makes it possible to examine the contents of registers memory or on chip periphe...

Page 101: ...ctions between integrated circuits after they are assembled onto a printed circuit board Boundary scan allows a tester to observe and control signal levels at each component pin through a shift register placed next to each pin This is important for testing continuity and determining if pins are stuck at a one or zero level The JTAG port has the following capabilities Perform boundary scan operatio...

Page 102: ... AT mode is enabled by setting the ATE bit in OMR the DSP56300 DSP56600 core reflects the addresses of internal fetches and program space moves MOVEM to the External Address Bus if the Address Bus is not needed by the DSP56300 DSP56600 Core for external accesses During an Address Trace AT cycle the RD and WR strobes and the chip select or address attribute signals are deasserted This guarantees th...

Page 103: ...B 4 Optimizing DSP56300 DSP56600 Applications MOTOROLA Debug and Test Support Address Tracing ...

Page 104: ...ral part of the Motorola DSP Simulator the code that is to be profiled is first loaded into the Simulator The embedded profiler is activated using the Simulator s log command by specifying the p command option To invoke the profiler type the command LOG P filename Note filename is the name of the output file into which the profile report will be written The DSP program should be assembled using th...

Page 105: ...zed and uninitialized and how many instruction words the program occupies The dynamic subsection describes how many data and instruction words were moved between the DSP core and memory during execution of the DSP program It also describes the number of instructions executed the number of clock cycles executed and the number of clock cycles spent on stalls and interlocks Example C 1 depicts the ba...

Page 106: ...age report section provides a profile of how the DSP program utilizes programming aspects of the DSP instruction set architecture For each assembly mnemonic the number of occurrences and the percentage out of the total number of instruction occurrences is given both static and dynamic counts This information is displayed twice once ordered alphabetically by mnemonics once ordered in descending per...

Page 107: ...nd 13 0 19 36372 0 14 andi 50 0 75 7357 0 03 asl 133 1 98 866526 3 35 asr 166 2 47 534554 2 07 Example C 4 Typical MOVE Instruction Statistics Parallel move instruction dynamic breakdown move type single double L space unpaired 5567175 2367038 930924 paired 4213351 8960271 379552 Example C 5 Typical Dynamic Addressing Mode Breakdown Dynamic addressing mode breakdown instruction group operand modes...

Page 108: ...truction itself For each source line containing a macro invocation which resulted in expansion into more than one instruction the instructions expanded by the macro invocation are displayed in disassembly form Example C 6 depicts part of the Code Coverage Report in ASCII format Example C 6 Code Coverage Report 0164 00010D 100 100 clr B A y0 0165 00010E 1700 500 100 rep 16 0166 00010F 1700 100 1600...

Page 109: ... from which it has been invoked and the subroutines which it has invoked For each pair of caller callee relationship the report provides the number of times caller has called callee and the number of cycles spent during those invocations The format of the Subroutine Call Graph report follows that of the Unix gprof utility Example C 8 on page C 7 depicts part of the Subroutine Call Graph report in ...

Page 110: ...ot been invoked during the program simulation will appear in this report as disconnected nodes Example C 8 Typical Subroutine Call Graph Report Subroutine Call Graph report speechEncoder calls 100 100 cycles 9189668 aflat calls 100 cycles 15100 9174568 flat calls 100 100 cycles 1676900 rcToCorrDpL calls 100 100 cycles 188300 vad_algorithm calls 100 100 cycles 323968 swComfortNoise calls 100 100 cy...

Page 111: ...am using the g option and execute the program on the simulator profiler The profile report that is generated can then serve as a powerful tool for selecting the code sections to be optimized The profiler report provides a variety of metrics which can improve the DSP programmer s understanding of the program s characteristics Several sections of the report can be of useful in applying specific debu...

Page 112: ...ion where personal injury or death may occur Should Buyer purchase or use Motorola products for any such unintended or unauthorized application Buyer shall indemnify and hold Motorola and its officers employees subsidiaries affiliates and distributors harmless against all claims costs damages and expenses and reasonable attorney fees arising out of directly or indirectly any claim of personal inju...

Reviews: