Callgrind: a call-graph generating cache and branch prediction profiler
misses
which would not have happened in reality. If you do not want to see these, start event collection a few million
instructions after you have enabled instrumentation.
6.2.3. Counting global bus events
For access to shared data among threads in a multithreaded code, synchronization is required to avoid raced
conditions. Synchronization primitives are usually implemented via atomic instructions. However, excessive use
of such instructions can lead to performance issues.
To enable analysis of this problem, Callgrind optionally can count the number of atomic instructions executed. More
precisely, for x86/x86_64, these are instructions using a lock prefix. For architectures supporting LL/SC, these are the
number of SC instructions executed. For both, the term "global bus events" is used.
The short name of the event type used for global bus events is "Ge".
To count global bus events, use
--collect-bus
=yes
.
6.2.4. Avoiding cycles
Informally speaking, a cycle is a group of functions which call each other in a recursive way.
Formally speaking, a cycle is a nonempty set S of functions, such that for every pair of functions F and G in S, it
is possible to call from F to G (possibly via intermediate functions) and also from G to F. Furthermore, S must be
maximal -- that is, be the largest set of functions satisfying this property. For example, if a third function H is called
from inside S and calls back into S, then H is also part of the cycle and should be included in S.
Recursion is quite usual in programs, and therefore, cycles sometimes appear in the call graph output of Callgrind.
However, the title of this chapter should raise two questions: What is bad about cycles which makes you want to avoid
them? And: How can cycles be avoided without changing program code?
Cycles are not bad in itself, but tend to make performance analysis of your code harder. This is because inclusive costs
for calls inside of a cycle are meaningless. The definition of inclusive cost, i.e. self cost of a function plus inclusive
cost of its callees, needs a topological order among functions. For cycles, this does not hold true: callees of a function
in a cycle include the function itself. Therefore, KCachegrind does cycle detection and skips visualization of any
inclusive cost for calls inside of cycles. Further, all functions in a cycle are collapsed into artifical functions called like
Cycle 1
.
Now, when a program exposes really big cycles (as is true for some GUI code, or in general code using event or
callback based programming style), you lose the nice property to let you pinpoint the bottlenecks by following call
chains from
main
, guided via inclusive cost. In addition, KCachegrind loses its ability to show interesting parts of the
call graph, as it uses inclusive costs to cut off uninteresting areas.
Despite the meaningless of inclusive costs in cycles, the big drawback for visualization motivates the possibility to
temporarily switch off cycle detection in KCachegrind, which can lead to misguiding visualization. However, often
cycles appear because of unlucky superposition of independent call chains in a way that the profile result will see a
cycle. Neglecting uninteresting calls with very small measured inclusive cost would break these cycles. In such cases,
incorrect handling of cycles by not detecting them still gives meaningful profiling visualization.
It has to be noted that currently,
callgrind_annotate
does not do any cycle detection at all. For program executions
with function recursion, it e.g. can print nonsense inclusive costs way above 100%.
After describing why cycles are bad for profiling, it is worth talking about cycle avoidance. The key insight here is that
symbols in the profile data do not have to exactly match the symbols found in the program. Instead, the symbol name
could encode additional information from the current execution context such as recursion level of the current function,
or even some part of the call chain leading to the function. While encoding of additional information into symbols is
97