Cachegrind: a cache and branch-prediction profiler
enum E { A, B, C };
enum E e;
enum E table[] = { 1, 2, 3 };
int i;
...
i += table[e];
This is obviously a contrived example, but the basic principle applies in a wide variety of situations.
In short, Cachegrind can tell you where some of the bottlenecks in your code are, but it can’t tell you how to fix them.
You have to work that out for yourself. But at least you have the information!
5.7. Simulation Details
This section talks about details you don’t need to know about in order to use Cachegrind, but may be of interest to
some people.
5.7.1. Cache Simulation Specifics
Specific characteristics of the cache simulation are as follows:
• Write-allocate: when a write miss occurs, the block written to is brought into the D1 cache. Most modern caches
have this property.
• Bit-selection hash function: the set of line(s) in the cache to which a memory block maps is chosen by the middle
bits M--(M+N-1) of the byte address, where:
• line size = 2^M bytes
• (cache size / line size / associativity) = 2^N bytes
• Inclusive LL cache: the LL cache typically replicates all the entries of the L1 caches, because fetching into L1
involves fetching into LL first (this does not guarantee strict inclusiveness, as lines evicted from LL still could
reside in L1).
This is standard on Pentium chips, but AMD Opterons, Athlons and Durons use an exclusive LL
cache that only holds blocks evicted from L1. Ditto most modern VIA CPUs.
The cache configuration simulated (cache size, associativity and line size) is determined automatically using the x86
CPUID instruction. If you have a machine that (a) doesn’t support the CPUID instruction, or (b) supports it in an early
incarnation that doesn’t give any cache information, then Cachegrind will fall back to using a default configuration
(that of a model 3/4 Athlon).
Cachegrind will tell you if this happens.
You can manually specify one, two or all
three levels (I1/D1/LL) of the cache from the command line using the
--I1
,
--D1
and
--LL
options. For cache
parameters to be valid for simulation, the number of sets (with associativity being the number of cache lines in each
set) has to be a power of two.
On PowerPC platforms Cachegrind cannot automatically determine the cache configuration, so you will need to specify
it with the
--I1
,
--D1
and
--LL
options.
Other noteworthy behaviour:
89