Freescale Semiconductor PowerQUICC III Скачать руководство пользователя страница 10

Страница: 10 / 16

PowerQUICC III Performance Monitors, Rev. 2

Freescale Semiconductor

Examples

Figure 2. Cache Example

The code used for this example is compiled with -O2 optimizations. As shown in

Figure 2

, it consists of a

constant number of instructions executed, 0x0147_3F03 instructions. Based on the total time it takes to
execute the application, a comparison can easily be made between caches disabled (at the far right), only
L1 caches enabled, L1 and L2 caches enabled, and L1, L2 caches, and Branch Prediction Unit (BPU)
enabled.

Due to the limited number of core performance counters, it is necessary to run through the application
twice (for each scenario, for a total of 8 times) to collect all data represented in

Figure 2

Through analysis such as this, it is possible to tune L2 cache usage as data-only cache, instruction-only
cache, or unified cache.

6.2

Example: Branch Prediction

As seen in

Figure 2

, enabling branch prediction can significantly increase performance. In this application,

it brought down the total number of cycles and increased IPC. It is possible to collect more information
about the BTB.

Figure 3. BPU Miss Rate

This example uses the same code as the example in

Section 6.1, “Example: Cache Performance.”

L1 and

L2 caches are enabled, as is the BPU. By collecting data on branches finished and branch hits, it is possible
to calculate a branch miss rate for this particular application.

6.3

Example: DDR Performance

It may be desirable to determine the performance of the DDR controller and possibly optimize parameters.
This example illustrates the impact of tweaking the BSTOPRE field.

L2 & BPU enabled

L2 enabled

L2 D isabled

C aches Disabled

Instructions Com pleted

C e:Ref:2

1473f03

C ore Cycles

C e:Ref:1

11e73d8

1425a5e

145d091

c236a2d

Instruction L1 cache reloads

C E:C om :60

4aa

D ata L1 cache reloads

C E:C om :41

351a

3523

3522

Loads Com pleted

C E:C om :9

2f5588

Stores C om pleted

C E:C om :10

17a750

Instr Accesses to L2 that hit

SE:ref:22 (0x1

440

Instr to L2 that m iss

SE:2:59 (0x7b

D ata Accesses to L2 that hit

SE:ref:23 (0x1

2ed1

2ed3

D ata to L2 that m iss

SE:4:57 (0x79

3732

372d

L1 Instruction M iss R atio

2.93756E-06

5.56737E-05

L1 D ata M iss R atio

0.0029

#D IV/0!

L2 M iss R atio

0.542089985

0.520377095

#D IV/0!

#DIV/0!

T otal M iss R atio

0.001584798

0.001536049

#D IV/0!

#DIV/0!

T otal Hit Ratio

99.8415%

99.8464%

#D IV/0!

#DIV/0!

C ore C ycles (decim al)

18,772,952

21,125,726

21,352,593

203,647,533

IPC

1.1424E+00

1.0152E+00

1.0044E+00

1.0531E-01

Instructions Completed

Ce:Ref:2

1473f09

Branches finished

Ce:Com:12

73cd6

Branch Hits

Ce:COM:17

73cbf

Branch Miss Rate

0.0048%

Содержание PowerQUICC III

Страница 1: ...ce manual The e500 core level performance monitors enable the counting of e500 specific events for example cache misses mispredicted branches or the number of cycles an execution unit stalls These are configured by a set of special purpose registers that can only be written through supervisor level accesses The core level event counters are also available through a read only set of user level regi...

Страница 2: ...or level registers These consist of the same four counters UPMC0 UPMC3 associated local control registers UPMLCa0 UPMLCa3 and global control register UPMGC0 Additionally the core performance monitor may use the external core input pm_event as well as the performance monitor mark bit in the MSR MSR PMM to control which processes are monitored 2 1 Counter Events Counter events are listed in the Powe...

Страница 3: ...ance analysis and characterization These include Instructions per cycle IPC Instructions per packet IPP Packets per second PPS Branch misses per total branches Branches per 1000 instructions L1 instruction cache miss rate L1 data cache miss rate L2 cache core miss rate L2 cache non core miss rate Memory system page hit ratio Note that because these calculations make use of both the core events and...

Страница 4: ...1 Instructions per packet IPP instructions completed accepted frames on TSEC1 SE Ref 36 CE Ref 2 CE Ref 2 SE Ref 36 Packets per second PPS accepted frames on TSEC1 Time SE Ref 36 CE Ref 1 SE Ref 36 CE Ref 1 Processor Frequency Branch miss ratio branches mispredicted branches finished CE Com 12 CE Com 17 CE Com 12 CE Com17 CE Com 12 Branches per 1000 instructions 1000 branches finished kilo instruc...

Страница 5: ... int CCSB 0xE1060 0x00170000 PMLCa5 Start Global Control Register unsigned int unsigned int CCSB 0xE1000 0x00000000 PMGC0 The above code shows a sequence for initializing counters PMC2 PMC5 to zero then setting up the local control registers to count the events required for the metric previously mentioned The global control register is then set to 0x0 which will start the counting Note that becaus...

Страница 6: ...han there are PMCs For example Table 2 lists all the events necessary to calculate the full list of metrics from Table 1 Table 2 Events Necessary for Data Collection of Common Metrics Core Event System Event CE Ref 2 SE C0 CE Com 12 SE Ref 36 CE Com 17 SE Ref 22 CE Com 68 SE Ref 23 CE Com 9 SE Ref 24 CE Com 10 SE C1 54 CE Com 41 SE C2 59 SE C4 57 SE C2 SE C4 SE C6 SE C8 ...

Страница 7: ...ication It is important to understand the system impact of turning on the performance counters to ensure that they do not have adverse affects on the system 5 1 Core Clock Cycles There are two available methods for obtaining the core clock cycles They may be measured directly using the core event CE Ref 1 or calculated using the system event SE C0 CCB clock cycles Multiplying the number of CCB clo...

Страница 8: ... chaining is carried out by configuring PMLCa0 EVENT 83 PMLCa1 EVENT 2 In this manner a 64 bit counter for CE Ref 2 is created The total number of instructions completed can be interpreted as Eqn 1 5 3 Burstiness The system performance monitor counters include a burstiness counting feature to aid in characterizing events that occur in rapid succession followed by a relatively long pause Event burs...

Страница 9: ... The number of CCB cycles may be used as a counter in order to periodically sample data every x CCB cycles x 1 MHz seconds 5 4 3 Debugger The CCB platform counter SE C0 will continue to increment even if the core is halted by a debugger Carefully consider the implications of this when sampling counters during debugging 6 Examples The metrics listed in Section 4 Performance Metrics are generic appl...

Страница 10: ...s is the BPU By collecting data on branches finished and branch hits it is possible to calculate a branch miss rate for this particular application 6 3 Example DDR Performance It may be desirable to determine the performance of the DDR controller and possibly optimize parameters This example illustrates the impact of tweaking the BSTOPRE field L2 BPU enabled L2 enabled L2 Disabled Caches Disabled ...

Страница 11: ...e 4 or 224 core clock cycles 6 4 I O and Compute Bound Systems Analysis of the performance monitors may be useful in determining if a system is I O or compute bound I O bound may imply I O to the core and force a stall while waiting for data However interfaces may also become saturated without core involvement The MPC8560 for example may act as a RapidIO to PCI X bridge without core intervention I...

Страница 12: ...centage of cycles spent reading writing DDR or the LBC is so low 1 03 the system in this scenario is compute bound To ensure accuracy of these claims cycles writing to LBC SDRAM counter SE C6 55 may be added to the metrics sampled However this would require two samples per run since both this metric and ECM dispatches to LBC are counter specific to C6 The code executed in these examples is previou...

Страница 13: ... number of metrics is used Half of the metrics should be higher bound metrics meaning a higher value of the metric is considered better and the other half should be lower bound metrics where a lower value is considered better These higher bound and lower bound metrics are plotted along alternate radial lines in the Kiviat graph For example the following might be plotted CPU efficiency higher bound...

Страница 14: ... shows a Kiviat graph for an unbalanced system Figure 8 Kiviat Graph for Unbalanced System 0 000 20 000 40 000 60 000 80 000 100 000 CPU Efficiency Cycles Read Wr DDR Cache Hit Ratio Branch Miss Rate Overall DDR Page Hit Rate Cycles Reading LBC SDRAM Packets Per Second TSEC1 L2 Non Core Miss Rate ...

Страница 15: ...r the LBC making it easy to identify that problem Figure 9 is an example correlating to the corrected system with cache enabled for the LBC This system appears balanced as its plot looks more star shaped 8 Revision History Table 3 provides a revision history for this application note Table 3 Document Revision History Rev Number Date Substantive Change s 2 03 2014 Added new Figure 4 1 08 06 2008 In...

Страница 16: ... may be provided in Freescale data sheets and or specifications can and do vary in different applications and actual performance may vary over time All operating parameters including typicals must be validated for each customer application by customer s technical experts Freescale does not convey any license under its patent rights nor the rights of others Freescale sells products pursuant to stan...

Результаты поиска

Содержание PowerQUICC III

Отзывы:

Бренды по названию

Популярные бренды

Freescale Semiconductor PowerQUICC III, Примечание к применению

Результаты поиска

Содержание PowerQUICC III

Отзывы:

Бренды по названию

Популярные бренды