Callgrind: a call-graph generating cache and branch prediction profiler
--simulate-wb=<yes|no> [default:
no]
Specify whether write-back behavior should be simulated, allowing to distinguish LL caches misses with and without
write backs. The cache model of Cachegrind/Callgrind does not specify write-through vs. write-back behavior, and
this also is not relevant for the number of generated miss counts. However, with explicit write-back simulation it can
be decided whether a miss triggers not only the loading of a new cache line, but also if a write back of a dirty cache line
had to take place before. The new dirty miss events are ILdmr, DLdmr, and DLdmw, for misses because of instruction
read, data read, and data write, respectively. As they produce two memory transactions, they should account for a
doubled time estimation in relation to a normal miss.
--simulate-hwpref=<yes|no> [default:
no]
Specify whether simulation of a hardware prefetcher should be added which is able to detect stream access in the
second level cache by comparing accesses to separate to each page. As the simulation can not decide about any timing
issues of prefetching, it is assumed that any hardware prefetch triggered succeeds before a real access is done. Thus,
this gives a best-case scenario by covering all possible stream accesses.
--cacheuse=<yes|no> [default:
no]
Specify whether cache line use should be collected. For every cache line, from loading to it being evicted, the number
of accesses as well as the number of actually used bytes is determined. This behavior is related to the code which
triggered loading of the cache line. In contrast to miss counters, which shows the position where the symptoms of
bad cache behavior (i.e. latencies) happens, the use counters try to pinpoint at the reason (i.e. the code with the bad
access behavior). The new counters are defined in a way such that worse behavior results in higher cost. AcCost1 and
AcCost2 are counters showing bad temporal locality for L1 and LL caches, respectively. This is done by summing up
reciprocal values of the numbers of accesses of each cache line, multiplied by 1000 (as only integer costs are allowed).
E.g. for a given source line with 5 read accesses, a value of 5000 AcCost means that for every access, a new cache line
was loaded and directly evicted afterwards without further accesses. Similarly, SpLoss1/2 shows bad spatial locality
for L1 and LL caches, respectively. It gives the
spatial loss
count of bytes which were loaded into cache but never
accessed. It pinpoints at code accessing data in a way such that cache space is wasted. This hints at bad layout of data
structures in memory. Assuming a cache line size of 64 bytes and 100 L1 misses for a given source line, the loading
of 6400 bytes into L1 was triggered. If SpLoss1 shows a value of 3200 for this line, this means that half of the loaded
data was never used, or using a better data layout, only half of the cache space would have been needed. Please note
that for cache line use counters, it currently is not possible to provide meaningful inclusive costs. Therefore, inclusive
cost of these counters should be ignored.
--I1=<size>,<associativity>,<line size>
Specify the size, associativity and line size of the level 1 instruction cache.
--D1=<size>,<associativity>,<line size>
Specify the size, associativity and line size of the level 1 data cache.
--LL=<size>,<associativity>,<line size>
Specify the size, associativity and line size of the last-level cache.
6.4. Callgrind Monitor Commands
The Callgrind tool provides monitor commands handled by the Valgrind gdbserver (see
Monitor command handling
by the Valgrind gdbserver
).
•
dump [<dump_hint>]
requests to dump the profile data.
•
zero
requests to zero the profile data counters.
102