Debugging benchmarks is something I've done for many years, and I've seen an amazing and comical variety of failure modes. The problem is that benchmark tools are often run without understanding what they are testing or checking that the results are valid. This can lead to poor development or architectural choices that haunt you later on. I previously summarized this situation as:
casual benchmarking: you benchmark A, but actually measure B, and conclude you've measured C.
On this page I'll introduce what I call active and passive benchmarking, where active benchmarking helps you test the true target of the benchmark, and properly understand its results. It requires more effort at the start, but can save much more time and money later on.
To see this in action, you can jump to the examples.
To perform active benchmarking:
- If possible, configure the benchmark to run for a long duration in a steady state: eg, hours.
- While the benchmark is running, analyze the performance of all components involved using other tools, to identify the limiter of the benchmark.
The process of active benchmarking is similar to the performance analysis of any application. One difference, which can make this process easier, is that you have a known workload to begin analyzing: the benchmark applied.
Benchmarks are commonly executed and then ignored until they have completed. That is passive benchmarking, where the main objective is the collection of benchmark data. Data is not Information.
With active benchmarking, you analyze performance while the benchmark is still running (not just after it's done), using other tools. You can confirm that the benchmark tests what you intend it to, and that you understand what that is. Data becomes Information. This can also identify the true limiters of the system under test, or of the benchmark itself.
To perform active benchmarking, you may use any performance analysis tool that your OS provides: vmstat, iostat, mpstat, sar, top, tcpdump/snoop, perf, DTrace/SystemTap/ktap, strace/truss, etc. You can also follow a performance analysis methodology to guide your usage of these tools. The USE Method is especially suited for this, since it identifies typical limiters: hardware and software resources.
How to tell if someone else did active benchmarking:
- Did they run other tools while the benchmark was running? Can they provide screenshots?
- Can they explain why the benchmark result was X, and not 2X (twice as fast)? ie, what is the limiting factor?
Ideally, include the limiting factor (or suspected limiting factor) along with the benchmark results. For example: "the file system returned 830,000 cached 4 Kbyte reads/sec, limited by the benchmark being single-threaded, and the CPU speed of the server". For evidence, this statement could include a screenshot showing that the benchmark was single-threaded and CPU-bound: for example, on Linux, using "pidstat -t 1"; on Solaris, using "prstat -mLc 1".
The following are worked examples of active benchmarking, showing the tools used for analysis:
- Bonnie++: a detailed analysis on both SmartOS/illumos and Fedora/Linux.
Common pitfalls that can be identified using active benchmarking, are when the benchmark is:
- Perturbed by other system events, including neighbors.
- Throttled by software imposed resource controls.
- Throttled by the network between the benchmark client and the server.
- Limited by the benchmark software being single threaded.
- Testing different client or server software versions, when doing comparative benchmarking.
- Testing disk I/O instead of file system I/O.
- Applying an unrealistic workload.
The most common case is where a benchmark is not really testing what it claims to test, which can be identified using active benchmarking. Sometimes the results are still useful, now that they can be interpreted correctly.
This is the statistical analysis of numerical benchmark results after the benchmark has completed. While this can generate apparently useful information, if the benchmark results were wrong or misleading to begin with, this generates false information. It can also make the problem worse: a sound statistical method can make benchmark results seem trustworthy, when in fact, they are false.
Statistical analysis is useful after active benchmarking – when you have valid numbers to work with. iostat first, R later.
- I gave a lightning talk at Surge 2013 titled Benchmarking Gone Wrong, which provides a memorable anecdote for active benchmarking.