Texas Instruments DSP/BIOS Real-Time Analysis, User Manual, Page: 27 / 29

Share

12 pages before
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

Page: 27 / 29

Texas Instruments DSP/BIOS Real-Time Analysis User Manual Download Page 27

SPRAA56

DSP/BIOS Real-Time Analysis (RTA) and Debugging Applied to a Video Application

27

Appendix A. Performance Impact

A.1 Overhead of Performance Measurement Techniques

Because most of the benchmarking APIs are called once every 30 frames, the additional CPU
load expected after adding the instrumentation is low. The measured performance of the
benchmarking techniques is given in Table 3. A spreadsheet containing the expected and actual
timing values is provided with the software distribution.

Table 3.

Measured Performance of Benchmarking Techniques

Benchmark

Execution Time
(Avg) [instr]

Execution Time
(Max) [instr]

CPU Load

Execution Rate
[per N frames]

MBX check in process Task

3641

17112

0.00018205

1

LOAD module call

1182

2432

0.00000197

30

Single Call to UTL_stsStart

517

13968

0.00043945

16

Single Call to UTL_stsStop

325

488

0.00027625

16

Capture Task benchmarking

1848

15064

0.00000308

30

Display Task benchmarking

2288

7824

3.81333E-06

30

Process Task benchmarking

3196

18568

5.32667E-06

30

Control Task

1533

2856

0.00007665

?

SubTotal Load (Task bchmrk)

7332

0.0003666

30

SubTotal Load (UTL calls)

13472

0.0006736

30

Total Load of benchmarking

17357.4

0.00086787

1

These benchmarks are given in instructions, and the individual CPU load of each function is
calculated by dividing the benchmark by 20M instructions per frame, the number of cycles
available on a 600 MHz 64x device in a 30 fps NTSC system.

These benchmarks were measured using UTL_stsStart and UTL_stsStop API calls bracketing
the regions of code to be benchmarked. For example, to benchmark the LOAD_getcpuload
function, the measurement code was the following:

UTL_stsStart( stsBenchmark1 );
benchVid.cpuLoad.current = LOAD_getcpuload();
UTL_stsStop( stsBenchmark1 );

This method of benchmarking allows execution time to be measured in real-time, although if an
interrupt or context switch occurs between the UTL calls, the time spent executing the interrupt
or out-of-context code would also be included in the benchmark.

A.2 RTA Effects on CPU Load

The CPU load was measured with RTA debugging turned off and the UTL_DBGLEVEL set to
40. The total CPU load of the application with the instrumentation turned off was 93% average
and 95% peak. The CPU load of the instrumented application was 93% average and 95% peak
when using the same video content, a repeating high-motion sequence from a DVD. The
benchmarking did not make a statistically significant impact on the CPU load.

«
...
25
26
27
28
29
»

Summary of Contents for DSP/BIOS Real-Time Analysis

Page 1: ...ements for Viewing RTA Benchmarks 7 3 Modifications to the Base Example 7 3 1 Splitting the Encode and Decode CELLs 8 3 2 Adding the Control TSK and MBX Communication 8 3 3 Querying the H 263 Encoder for Status 9 3 4 Controlling the Frame Rate 10 4 RTA Techniques for Performance Measurement 11 4 1 Measuring Function Execution Time with the UTL Module 11 4 2 Measuring Task Scheduling Latencies 12 4...

Page 2: ...isplay frame rates can differ by design or under overloaded conditions where frames are dropped Therefore it is important to measure all three frame rates separately Resolution is the size in pixels of the capture processing and display Resolution is typically static at run time so it is not usually benchmarked with real time tools However it is important to know the capture processing and display...

Page 3: ...verview The base h263_loopback example used to create the application described here is a video application supplied with the TMS320DM642 evaluation module board support package After you install the board support package the source code and included object libraries for the base example are in the CCS_install_dir boards evmdm642 examples video h263_loopback directory The H 263 loopback example wa...

Page 4: ...The example s first stage is a task called tskInput which runs the tskVideoInput function The task receives digital video buffers from the device driver It then converts the buffers to the 4 2 0 format from the 4 2 2 formatted data it receives from the driver The next stage the tskVideoProcess task which runs the tskProcess function The task includes algorithms that require input data in the 4 2 0...

Page 5: ...P BIOS SCOM Synchronization and pointer passing mechanism for data flow between TSKs RF CHAN Instantiates and serially executes xDAIS compliant algorithms RF CELL Container for xDAIS algorithms in a CHAN RF ALGRF Encapsulates the procedure for xDAIS algorithm instantiation RF The following module provides an interface to the video port device driver and is described in The TMS320DM642 Video Port M...

Page 6: ...logs and statistics accumulators For greater efficiency the target does not execute log or statistics APIs unless tracing is enabled This module contains two user defined TRC flags that can be toggled using the DSP BIOS RTA Control Panel in Code Composer Studio The application can use these bits to enable or disable sets of explicit instrumentation The program can use the TRC_query API to check th...

Page 7: ...TA is disabled the Message Log Statistics View Execution Graph and other RTA windows are updated only when the DSP is halted An update displays the most recent contents of their respective buffers This stop mode of RTA offers a good compromise when some visibility is required but the additional code and background function calls are undesirable Stop mode can also occur if RTA is enabled but the CP...

Page 8: ...he H 263 encoder and decoder are wrapped in sequential CELLs in a single channel This is suitable for an example application but in actual video systems the input to the decoder would be an encoded bitstream from an external source and the output from the encoder would be sent to an external source such as a network stream or a hard disk drive Splitting the encoder and decoder into separate channe...

Page 9: ... the control task from adding latency or CPU overhead when responding to control commands The control commands are only serviced at times when the three TSKs in the data stream are all in the blocked state and the processor would normally be running its background loop Figure 3 shows the task partitioning added to the application flow in Figure 2 Device Driver Device Driver Device Driver Buffer 3 ...

Page 10: ...cessed or displayed prompting the display driver to re display the most recent frame The capture frame rate and display frame rate are left unchanged at DISPLAYRATE which is set to 30 frames for second in NTSC applications or 25 frames per second in PAL applications Because the capture driver is using external memory bandwidth to copy unused frames from the video port FIFO to external buffers it m...

Page 11: ...r functions of interest and UTL_stsPeriod was used in each of the three data tasks to measure the period of one complete loop through each task Because the UTL module acts as a wrapper for DSP BIOS STS objects the STS objects needed to be created during DSP BIOS configuration The following naming convention is used to create the statistics objects sts task pseudonym function benchmarked The appIns...

Page 12: ...ctoInput SYS_FOREVER end of main processing loop 4 3 Measuring End to End Latencies End to end latency is the time between the capture of a video frame in real time and the display of that same video frame some number T of milliseconds later Long latencies are undesirable in bi directional video applications such as in a video conferencing systems Such latency causes delays between questions and r...

Page 13: ...er second or Hz of the capture processing or display of video frames by the system In video systems it is possible for the display frame rate to exceed the capture and or processing frame rate so it is often important to measure it separately for the capture processing and display stages in the data stream In this example application the actual frame rate is measured at each stage and user control...

Page 14: ...ocessing stages that add to the load Before integrating such functions into the system you may want to estimate their effects on real time performance One way to estimate the effects of an additional load is with a dummy load of NOP instructions Such a dummy load function is provided in the dummyLoad c file of this example It can be controlled from the h263rateControl gel file which manipulates th...

Page 15: ...object var CpuLoadCheck tibios IDL create CpuLoadCheck CpuLoadCheck fxn prog extern LOAD_idlefxn 2 Include load c and load h in the project 3 Call LOAD_getcpuload as needed within your application thrProcRta cpuLoad LOAD_getcpuload The project keeps track of the number of times the idle loop is entered over a time period specified by the window variable in load c The CPU load reported by LOAD_getc...

Page 16: ...stand the memory bus utilization of the whole system and its components Data structures for measuring the memory bus utilization of the input processing and display tasks are included in the modified example The actual values logged into the data structures are estimated based on the defined size of the frames being moved to internal buffers for processing For the case of YUV4 2 0 to YUV4 2 2 colo...

Page 17: ... STS object named sts task BusUtil for viewing in the DSP BIOS Statistics View tool This results in a bus utilization statistic in bytes per second 4 8 Bitrate and Frame Type Bitrate is important in applications that do encoding or decoding The bitrate of encoded video often varies greatly with different video content increasing to high values during periods of high motion and image complexity and...

Page 18: ...mple to set the target bitrate while other applications require more advanced control The percentage of macroblocks that are intracoded is another benchmark that could potentially be useful Some encoders can report this benchmark but the H 263 encoder algorithm used in this application does not This number is the percentage of blocks for which no suitable motion vector could be found to describe t...

Page 19: ...s to a third party receiving application The current size of the debug structure is small defined in Appendix A so sending the structure once every 30 frames would introduce a negligible load on the system and the network yet could still provide useful information at that rate 4 10 Application Specific Control via GEL Scripts in CCStudio As mentioned earlier run time control is provided by the h26...

Page 20: ...than the platform specified in the requirements list 5 2 Running the Application 1 Copy the h263loopback_rta zip file to a working directory and extract its contents 2 Open CCStudio and open the h263loopback_rta pjt project The project file references all source and object files required to build the executable Source filenames with _rta at the end have been modified for this note Source filenames...

Page 21: ...g and select Properties Then enable and select the file CPU Load Graph Shows the percentage utilization of the DSP core in non idle tasks RTA Control Panel You may want to lower the update polling rate of the real time windows this makes the instrumentation less intrusive Right click on the RTA control panel and choose Properties You can change the update rates of various RTA windows starting from...

Page 22: ... Figure 6 Workspace Including RTA Windows 5 3 Interpreting the Benchmarks There are a total of 20 statistics measured by the application 16 application specific STS objects and 4 objects created automatically with the TSKs Figure 7 shows a sample Statistics View of all these measurements ...

Page 23: ...bjects on the target DSP 5 3 1 Expected Values for the STS Objects Table 1 shows expected and measured values for the STS benchmarks in the instrumented application The right column is blank in case you want to fill in your own measurements stsInVidPeriod stsOutVidPeriod and stsProcPeriod are all expected to be 33 33 ms because this is the amount of time between successive frames in an NTSC video ...

Page 24: ... stsInVidBusUtil 28 512 000 Bps stsOutVidPeriod 33 33ms 33 29 ms stsOutVidTotal 2 43 ms stsOutVidCell0 2 5ms 2 41 ms stsOutVidWait0 33ms 30 35 ms stsOutVidBusUtil 28 512 000 Bps stsProcPeriod 33 33ms 33 26 ms stsProcTotal Cell0 Cell1 24 07 ms stsProcCell0 18 97 ms stsProcCell1 5 09 ms stsProcNframes 1 second 30 frames 498 84 ms stsProcBusUtil 26 926 600 Bps The typical expected values for task sch...

Page 25: ...trate for the encoder algorithm between 32 and 15000 passthroughReference Set to 1 to bypasses the decoder and output the frame captured by the encoder without any modification Set to 0 to use the decoder color Set to 1 to enable color processing Set to 0 to disable color processing This slider can be used to benchmark the application with and without color processing enabled 5 4 1 Debug Mode The ...

Page 26: ...mentation in the capture and display tasks using the USER0 and USER1 bits in the RTA Control Panel They are turned on by default In order to view the latency from the input to output task it is necessary to turn these bits on After a typical latency measurement is recorded the amount of data the capture and display tasks deliver to the Message Log may be more than is useful 6 References H 263 Loop...

Page 27: ...006736 30 Total Load of benchmarking 17357 4 0 00086787 1 These benchmarks are given in instructions and the individual CPU load of each function is calculated by dividing the benchmark by 20M instructions per frame the number of cycles available on a 600 MHz 64x device in a 30 fps NTSC system These benchmarks were measured using UTL_stsStart and UTL_stsStop API calls bracketing the regions of cod...

Page 28: ...irements All sizes are in 8 bit bytes Table 4 Memory Footprint Details All RTA Features Enabled as shipped Remove D RTA_INCLUDED Build Option Remove UTL Calls Set UTL_DEBUGLEVEL 0 Remove Both D RTA_INCLUDED Build Option and UTL Calls Code Size 11 406 788 11 405 076 11 402 856 11 401 272 Data Size 3347 3347 2643 2643 Bss Stack 5392 5392 5392 5392 Total 11 415 527 11 413 815 11 410 891 11 409 307 Co...

Page 29: ...nt that any license either express or implied is granted under any TI patent right copyright mask work right or other TI intellectual property right relating to any combination machine or process in which TI products or services are used Information published by TI regarding third party products or services does not constitute a license from TI to use such products or services or a warranty or end...

Reviews:

No comments