Helgrind: a thread error detector
The first thing to do is examine the source locations referred to by each call stack. They should both show an access
to the same location, or variable.
Now figure out how how that location should have been made thread-safe:
• Perhaps the location was intended to be protected by a mutex? If so, you need to lock and unlock the mutex at both
access points, even if one of the accesses is reported to be a read. Did you perhaps forget the locking at one or other
of the accesses? To help you do this, Helgrind shows the set of locks held by each threads at the time they accessed
the raced-on location.
• Alternatively, perhaps you intended to use a some other scheme to make it safe, such as signalling on a condition
variable.
In all such cases, try to find a synchronisation event (or a chain thereof) which separates the earlier-
observed access (as shown in the second call stack) from the later-observed access (as shown in the first call stack).
In other words, try to find evidence that the earlier access "happens-before" the later access. See the previous
subsection for an explanation of the happens-before relation.
The fact that Helgrind is reporting a race means it did not observe any happens-before relation between the two
accesses.
If Helgrind is working correctly, it should also be the case that you also cannot find any such relation,
even on detailed inspection of the source code. Hopefully, though, your inspection of the code will show where the
missing synchronisation operation(s) should have been.
7.5. Hints and Tips for Effective Use of Helgrind
Helgrind can be very helpful in finding and resolving threading-related problems.
Like all sophisticated tools, it is
most effective when you understand how to play to its strengths.
Helgrind will be less effective when you merely throw an existing threaded program at it and try to make sense of any
reported errors. It will be more effective if you design threaded programs from the start in a way that helps Helgrind
verify correctness. The same is true for finding memory errors with Memcheck, but applies more here, because thread
checking is a harder problem. Consequently it is much easier to write a correct program for which Helgrind falsely
reports (threading) errors than it is to write a correct program for which Memcheck falsely reports (memory) errors.
With that in mind, here are some tips, listed most important first, for getting reliable results and avoiding false errors.
The first two are critical. Any violations of them will swamp you with huge numbers of false data-race errors.
1. Make sure your application, and all the libraries it uses, use the POSIX threading primitives.
Helgrind needs to
be able to see all events pertaining to thread creation, exit, locking and other synchronisation events. To do so it
intercepts many POSIX pthreads functions.
Do not roll your own threading primitives (mutexes, etc) from combinations of the Linux futex syscall, atomic
counters, etc. These throw Helgrind’s internal what’s-going-on models way off course and will give bogus results.
Also, do not reimplement existing POSIX abstractions using other POSIX abstractions. For example, don’t build
your own semaphore routines or reader-writer locks from POSIX mutexes and condition variables.
Instead use
POSIX reader-writer locks and semaphores directly, since Helgrind supports them directly.
Helgrind directly supports the following POSIX threading abstractions: mutexes, reader-writer locks, condition
variables (but see below), semaphores and barriers. Currently spinlocks are not supported, although they could be
in future.
At the time of writing, the following popular Linux packages are known to implement their own threading
primitives:
115