BBV: an experimental basic block vector generation tool
The SimPoint program only processes lines that start with a "T". All other lines are ignored. Traditionally comments
are indicated by starting a line with a "#" character. Some other BBV generation tools, such as PinPoints, generate
lines beginning with letters other than "T" to indicate more information about the program being run.
We do not
generate these, as the SimPoint utility ignores them.
12.5. Implementation
Valgrind provides all of the information necessary to create BBV files. In the current implementation, all instructions
are instrumented. This is slower (by approximately a factor of two) than a method that instruments at the basic block
level, but there are some complications (especially with rep prefix detection) that make that method more difficult.
Valgrind actually provides instrumentation at a superblock level. A superblock has one entry point but unlike basic
blocks can have multiple exit points.
Once a branch occurs into the middle of a block, it is split into a new basic
block. Because Valgrind cannot produce "true" basic blocks, the generated BBV vectors will be different than those
generated by other tools. In practice this does not seem to affect the accuracy of the SimPoint results. We do internally
force the
--vex-guest-chase-thresh=0
option to Valgrind which forces a more basic-block-like behavior.
When a superblock is run for the first time, it is instrumented with our BBV routine. A block info (bbInfo) structure is
allocated which holds the various information and statistics for the block. A unique block ID is assigned to the block,
and then the structure is placed into an ordered set. Then each native instruction in the block is instrumented to call an
instruction counting routine with a pointer to the block info structure as an argument.
At run-time, our instruction counting routines are called once per native instruction. The relevant block info structure
is accessed and the block count and total instruction count is updated. If the total instruction count overflows the
interval size then we walk the ordered set, writing out the statistics for any block that was accessed in the interval, then
resetting the block counters to zero.
On the x86 and amd64 architectures the counting code has extra code to handle rep-prefixed string instructions. This
is because actual hardware counts a rep-prefixed instruction as one instruction, while a naive Valgrind implementation
would count it as many (possibly hundreds, thousands or even millions) of instructions.
We handle rep-prefixed
instructions specially, in order to make the results match those obtained with hardware performance counters.
BBV also counts the fldcw instruction. This instruction is used on x86 machines in various ways; it is most commonly
found when converting floating point values into integers. On Pentium 4 systems the retired instruction performance
counter counts this instruction as two instructions (all other known processors only count it as one). This can affect
results when using SimPoint on Pentium 4 systems. We provide the fldcw count so that users can evaluate whether it
will impact their results enough to avoid using Pentium 4 machines for their experiments. It would be possible to add
an option to this tool that mimics the double-counting so that the generated BBV files would be usable for experiments
using hardware performance counters on Pentium 4 systems.
12.6. Threaded Executable Support
BBV supports threaded programs.
When a program has multiple threads, an additional basic block vector file is
created for each thread (each additional file is the specified filename with the thread number appended at the end).
There is no official method of using SimPoint with threaded workloads. The most common method is to run SimPoint
on each thread’s results independently, and use some method of deterministic execution to try to match the original
workload. This should be possible with the current BBV.
12.7. Validation
BBV has been tested on x86, amd64, and ppc32 platforms. An earlier version of BBV was tested in detail using
hardware performance counters, this work is described in a paper from the HiPEAC’08 conference, "Using Dynamic
160