Helgrind: a thread error detector
2. Avoid memory recycling.
If you can’t avoid it, you must use tell Helgrind what is going on via the
VALGRIND_HG_CLEAN_MEMORY
client request (in
helgrind.h
).
Helgrind is aware of standard heap memory allocation and deallocation that occurs via
malloc
/
free
/
new
/
delete
and from entry and exit of stack frames.
In particular, when memory is deallocated via
free
,
delete
, or
function exit, Helgrind considers that memory clean, so when it is eventually reallocated, its history is irrelevant.
However, it is common practice to implement memory recycling schemes.
In these, memory to be freed is not
handed to
free
/
delete
, but instead put into a pool of free buffers to be handed out again as required.
The
problem is that Helgrind has no way to know that such memory is logically no longer in use, and its history is
irrelevant.
Hence you must make that explicit, using the
VALGRIND_HG_CLEAN_MEMORY
client request to
specify the relevant address ranges.
It’s easiest to put these requests into the pool manager code, and use them
either when memory is returned to the pool, or is allocated from it.
3. Avoid POSIX condition variables. If you can, use POSIX semaphores (
sem_t
,
sem_post
,
sem_wait
) to do
inter-thread event signalling. Semaphores with an initial value of zero are particularly useful for this.
Helgrind only partially correctly handles POSIX condition variables. This is because Helgrind can see inter-thread
dependencies between a
pthread_cond_wait
call and a
pthread_cond_signal
/
pthread_cond_broadcast
call only if the waiting thread actually gets to the rendezvous first (so that it actually calls
pthread_cond_wait
).
It can’t see dependencies between the threads if the signaller arrives first.
In the latter case, POSIX guidelines
imply that the associated boolean condition still provides an inter-thread synchronisation event, but one which is
invisible to Helgrind.
The result of Helgrind missing some inter-thread synchronisation events is to cause it to report false positives.
The root cause of this synchronisation lossage is particularly hard to understand, so an example is helpful. It was
discussed at length by Arndt Muehlenfeld ("Runtime Race Detection in Multi-Threaded Programs", Dissertation,
TU Graz, Austria). The canonical POSIX-recommended usage scheme for condition variables is as follows:
b
is a Boolean condition, which is False most of the time
cv
is a condition variable
mx
is its associated mutex
Signaller:
Waiter:
lock(mx)
lock(mx)
b = True
while (b == False)
signal(cv)
wait(cv,mx)
unlock(mx)
unlock(mx)
Assume
b
is False most of the time. If the waiter arrives at the rendezvous first, it enters its while-loop, waits for
the signaller to signal, and eventually proceeds. Helgrind sees the signal, notes the dependency, and all is well.
If the signaller arrives first,
b
is set to true, and the signal disappears into nowhere. When the waiter later arrives, it
does not enter its while-loop and simply carries on. But even in this case, the waiter code following the while-loop
cannot execute until the signaller sets
b
to True.
Hence there is still the same inter-thread dependency, but this
time it is through an arbitrary in-memory condition, and Helgrind cannot see it.
By comparison, Helgrind’s detection of inter-thread dependencies caused by semaphore operations is believed to
be exactly correct.
As far as I know, a solution to this problem that does not require source-level annotation of condition-variable wait
loops is beyond the current state of the art.
117