Debugging benchmarks is something I've done for many years, and I've seen an amazing and comical variety of failure modes. The problem is that benchmark tools are often run without understanding what they are testing or checking that the results are valid. This can lead to poor development or architectural choices that haunt you later on. I previously summarized this situation as:
casual benchmarking: you benchmark A, but actually measure B, and conclude you've measured C.
Accurate benchmarking is important. Your company may choose a different server platform, programming stack, database, or application vendor, based on benchmark results. Accurate results enable good choices to be made: picking options that deliver the best price performance, and are most likely to scale under load.
On this page I'll introduce what I call active and passive benchmarking, where active benchmarking helps you accurately test the true target of the benchmark, and properly understand its results. It requires more effort at the start, but can save much more time and money later on.
To see this in action, you can jump to the examples.
On this page:
To perform active benchmarking:
- If possible, configure the benchmark to run for a long duration in a steady state: eg, hours.
- While the benchmark is running, analyze the performance of all components involved using other tools, to identify the true limiter of the benchmark.
The process of active benchmarking is similar to the performance analysis of any application. One difference, which can make this process easier, is that you have a known workload to begin analyzing: the benchmark applied.
Benchmarks are commonly executed and then ignored until they have completed. That is passive benchmarking, where the main objective is the collection of benchmark data. Data is not Information.
A telltale sign is when the only technical results presented are the benchmark results. I've seen countless slide decks, blog posts, and articles that present an impressive bar chart of comparitive results, but then no supporting technical evidence. It's been my job to get to the bottom of many of these, and I typically find that they are wrong or misleading almost every time. The primary reason is that they have been run passively, "fire and forget" style, with no additional analysis, and all problems were overlooked.
With active benchmarking, you analyze performance while the benchmark is still running (not just after it's done), using other tools. You can confirm that the benchmark tests what you intend it to, and that you understand what that is. Data becomes Information. This can also identify the true limiters of the system under test, or of the benchmark itself.
To perform active benchmarking, you may use any performance analysis tool that your OS provides: vmstat, iostat, mpstat, sar, top, tcpdump/snoop, perf, bcc+eBPF/DTrace/SystemTap, strace/truss, etc. You can also follow a performance analysis methodology to guide your usage of these tools. The USE Method is especially suited for this, since it identifies typical limiters: hardware and software resources.
How to tell if someone else did active benchmarking:
- Did they run other tools while the benchmark was running? Can they provide screenshots?
- Can they explain why the benchmark result was X, and not 2X (twice as fast)? ie, what is the limiting factor?
Ideally, include the limiting factor (or suspected limiting factor) along with the benchmark results. For example: "the file system result was limited by the CPU speed of the server, and the benchmark being single-threaded". For evidence, this statement could include a screenshot showing that the benchmark was single-threaded and CPU-bound: for example, on Linux, using "pidstat -t 1"; on Solaris, using "prstat -mLc 1".
Apart from analysis while the benchmark runs, you should also analyze its configuration beforehand. Ideally, the benchmark is open source, allowing you to study the source code, as well as any Makefiles and compiler options.
The following are worked examples of active benchmarking, showing the tools used for analysis:
- Bonnie++: a detailed analysis on both SmartOS/illumos and Fedora/Linux.
Common pitfalls that can be identified using active benchmarking, are when the benchmark is:
- Perturbed by other system events, including neighbors.
- Throttled by software imposed resource controls.
- Throttled by the network between the benchmark client and the server.
- Limited by the benchmark software being single threaded.
- Testing different client or server software versions, when doing comparative benchmarking.
- Testing disk I/O instead of file system I/O.
- Applying an unrealistic workload.
The most common case is where a benchmark is not really testing what it claims to test, which can be identified using active benchmarking. Sometimes the results are still useful, now that they can be interpreted correctly.
This is the statistical analysis of numerical benchmark results after the benchmark has completed. This is often considered a useful exercise to develop new information from raw benchmark data, for better understanding results, and for developing confidence. However, if the benchmark results were wrong or misleading to begin with, statistical analysis can make matters worse. A sound statistical method can make benchmark results seem trustworthy, when in fact, they are false. New information developed may also be false, compounding the problem.
The only good outcome, given bad results, is that statistical analysis deems them untrustworthy (eg, too high CoV), and analysis moves to understanding what went wrong with the actual benchmark. In practice, this doesn't happen as much as I'd like. Often, the wrong target has been benchmarked, but the results are statistically sound.
Statistical analysis is useful after active benchmarking – when you have valid numbers to work with. iostat first, R later.
- I gave a lightning talk at Surge 2013 titled Benchmarking Gone Wrong, which provides a memorable anecdote for active benchmarking.