Callgrind: a call-graph generating cache and branch prediction profiler
Use
--auto=yes
to get annotated source code for all relevant functions for which the source can be found. In
addition to source annotation as produced by
cg_annotate
, you will see the annotated call sites with call counts.
For all other options, consult the (Cachegrind) documentation for
cg_annotate
.
For better call graph browsing experience, it is highly recommended to use KCachegrind. If your code has a significant
fraction of its cost in
cycles
(sets of functions calling each other in a recursive manner), you have to use KCachegrind,
as
callgrind_annotate
currently does not do any cycle detection, which is important to get correct results in
this case.
If you are additionally interested in measuring the cache behavior of your program, use Callgrind with the option
--cache-sim
=yes
. For branch prediction simulation, use
--branch-sim
=yes
. Expect a further slow down
approximately by a factor of 2.
If the program section you want to profile is somewhere in the middle of the run, it is beneficial to
fast forward
to
this section without any profiling, and then enable profiling.
This is achieved by using the command line option
--instr-atstart
=no
and running, in a shell:
callgrind_control -i on
just before the interesting
code section is executed. To exactly specify the code position where profiling should start, use the client request
CALLGRIND_START_INSTRUMENTATION
.
If you want to be able to see assembly code level annotation, specify
--dump-instr
=yes
. This will produce
profile data at instruction granularity. Note that the resulting profile data can only be viewed with KCachegrind. For
assembly annotation, it also is interesting to see more details of the control flow inside of functions, i.e. (conditional)
jumps. This will be collected by further specifying
--collect-jumps
=yes
.
6.2. Advanced Usage
6.2.1. Multiple profiling dumps from one program run
Sometimes you are not interested in characteristics of a full program run, but only of a small part of it, for example
execution of one algorithm.
If there are multiple algorithms, or one algorithm running with different input data, it
may even be useful to get different profile information for different parts of a single program run.
Profile data files have names of the form
callgrind.out.
pid
.
part
-
threadID
where
pid
is the PID of the running program,
part
is a number incremented on each dump (".part" is skipped for the
dump at program termination), and
threadID
is a thread identification ("-threadID" is only used if you request dumps
of individual threads with
--separate-threads
=yes
).
There are different ways to generate multiple profile dumps while a program is running under Callgrind’s supervision.
Nevertheless, all methods trigger the same action, which is "dump all profile information since the last dump or
program start, and zero cost counters afterwards".
To allow for zeroing cost counters without dumping, there is a
second action "zero all cost counters now". The different methods are:
•
Dump on program termination.
This method is the standard way and doesn’t need any special action on your
part.
95