5. Cachegrind: a cache and
branch-prediction profiler
To use this tool, you must specify
--tool=cachegrind
on the Valgrind command line.
5.1. Overview
Cachegrind simulates how your program interacts with a machine’s cache hierarchy and (optionally) branch predictor.
It simulates a machine with independent first-level instruction and data caches (I1 and D1), backed by a unified
second-level cache (L2). This exactly matches the configuration of many modern machines.
However, some modern machines have three levels of cache. For these machines (in the cases where Cachegrind can
auto-detect the cache configuration) Cachegrind simulates the first-level and third-level caches. The reason for this
choice is that the L3 cache has the most influence on runtime, as it masks accesses to main memory. Furthermore, the
L1 caches often have low associativity, so simulating them can detect cases where the code interacts badly with this
cache (eg. traversing a matrix column-wise with the row length being a power of 2).
Therefore, Cachegrind always refers to the I1, D1 and LL (last-level) caches.
Cachegrind gathers the following statistics (abbreviations used for each statistic is given in parentheses):
• I cache reads (
Ir
, which equals the number of instructions executed), I1 cache read misses (
I1mr
) and LL cache
instruction read misses (
ILmr
).
• D cache reads (
Dr
, which equals the number of memory reads), D1 cache read misses (
D1mr
), and LL cache data
read misses (
DLmr
).
• D cache writes (
Dw
, which equals the number of memory writes), D1 cache write misses (
D1mw
), and LL cache
data write misses (
DLmw
).
• Conditional branches executed (
Bc
) and conditional branches mispredicted (
Bcm
).
• Indirect branches executed (
Bi
) and indirect branches mispredicted (
Bim
).
Note that D1 total accesses is given by
D1mr
+
D1mw
, and that LL total accesses is given by
ILmr
+
DLmr
+
DLmw
.
These statistics are presented for the entire program and for each function in the program. You can also annotate each
line of source code in the program with the counts that were caused directly by it.
On a modern machine, an L1 miss will typically cost around 10 cycles, an LL miss can cost as much as 200 cycles,
and a mispredicted branch costs in the region of 10 to 30 cycles.
Detailed cache and branch profiling can be very
useful for understanding how your program interacts with the machine and thus how to make it faster.
Also, since one instruction cache read is performed per instruction executed, you can find out how many instructions
are executed per line, which can be useful for traditional profiling.
5.2. Using Cachegrind, cg_annotate and
cg_merge
77