Cachegrind: a cache and branch-prediction profiler
• References that straddle two cache lines are treated as follows:
• If both blocks hit --> counted as one hit
• If one block hits, the other misses --> counted as one miss.
• If both blocks miss --> counted as one miss (not two)
• Instructions that modify a memory location (e.g.
inc
and
dec
) are counted as doing just a read, i.e. a single data
reference. This may seem strange, but since the write can never cause a miss (the read guarantees the block is in
the cache) it’s not very interesting.
Thus it measures not the number of times the data cache is accessed, but the number of times a data cache miss
could occur.
If you are interested in simulating a cache with different properties, it is not particularly hard to write your own cache
simulator, or to modify the existing ones in
cg_sim.c
. We’d be interested to hear from anyone who does.
5.7.2. Branch Simulation Specifics
Cachegrind simulates branch predictors intended to be typical of mainstream desktop/server processors of around
2004.
Conditional branches are predicted using an array of 16384 2-bit saturating counters.
The array index used for a
branch instruction is computed partly from the low-order bits of the branch instruction’s address and partly using the
taken/not-taken behaviour of the last few conditional branches.
As a result the predictions for any specific branch
depend both on its own history and the behaviour of previous branches. This is a standard technique for improving
prediction accuracy.
For indirect branches (that is, jumps to unknown destinations) Cachegrind uses a simple branch target address
predictor. Targets are predicted using an array of 512 entries indexed by the low order 9 bits of the branch instruction’s
address.
Each branch is predicted to jump to the same address it did last time.
Any other behaviour causes a
mispredict.
More recent processors have better branch predictors, in particular better indirect branch predictors.
Cachegrind’s
predictor design is deliberately conservative so as to be representative of the large installed base of processors which
pre-date widespread deployment of more sophisticated indirect branch predictors. In particular, late model Pentium
4s (Prescott), Pentium M, Core and Core 2 have more sophisticated indirect branch predictors than modelled by
Cachegrind.
Cachegrind does not simulate a return stack predictor.
It assumes that processors perfectly predict function return
addresses, an assumption which is probably close to being true.
See Hennessy and Patterson’s classic text "Computer Architecture: A Quantitative Approach", 4th edition (2007),
Section 2.3 (pages 80-89) for background on modern branch predictors.
5.7.3. Accuracy
Valgrind’s cache profiling has a number of shortcomings:
• It doesn’t account for kernel activity -- the effect of system calls on the cache and branch predictor contents is
ignored.
• It doesn’t account for other process activity. This is probably desirable when considering a single program.
90