Benchmarking the Cloud

I originally posted this at http://dtrace.org/blogs/brendan/2014/01/10/benchmarking-the-cloud.

Benchmarking, and benchmarking the cloud, is incredibly error prone. I provided guidance though this minefield in the benchmarking chapter of my book (Systems Performance: Enterprise and the Cloud); that chapter can be read online on the InformIT site. I also gave a lightning talk about benchmarking gone wrong at Surge last year. In this post, I'm going to cut to the chase and show you the tools I commonly use for basic cloud benchmarking.

As explained in the benchmarking chapter, I do not run these tools passively. I perform Active Benchmarking, where I use a variety of other observability tools while the benchmark is running, to confirm that it is measuring what it is supposed to. For perform reliable benchmarking, you need to be rigorous. For some suggestions of observability tools that you can use, try starting with the OS checklists (Linux, Solaris, etc.) from the USE Method.

Why These Tools

The aim here is to benchmark the performance of cloud instances, either for evaluations, capacity planning, or for troubleshooting performance issues. My approach is to use micro-benchmarks, where a single component or activity is tested, and to test on different resource dimensions: CPUs, networking, file systems. The results can then be mapped à la carte: I may be investigating an production application workload that has high network throughput, moderate CPU usage, and negligible file system usage, and so I can weigh the importance of each accordingly. Additional goals for testing these dimensions in the cloud environment are listed in the following sections.

CPUs: noploop, sysbench

For CPUs, this is what I'd like to test, and why:

Single-threaded performance: this should map to my expectations for processor speed and type. If it doesn't, if there is noticeable jitter or throttling, then I'll investigate that separately.
Multi-threaded performance: to see how much CPU capacity this instance has, including whether there are cloud limits in play.

For single-threaded performance, I start by hacking up an assembly program (noploop) to investigate instruction retire rates, and disassemble the binary to confirm what is being measured. That gives me a baseline result for how fast the CPUs really are. I'll write up that process when I get a chance.

sysbench can test single-threaded and multi-threaded performance by calculating prime numbers. This also brings memory I/O into play. You need to be running the same version of sysbench, with the same compilation options, to be able to compare results. Testing from 1 to 8 threads:

sysbench --num-threads=1 --test=cpu --cpu-max-prime=25000 run
sysbench --num-threads=2 --test=cpu --cpu-max-prime=25000 run
sysbench --num-threads=4 --test=cpu --cpu-max-prime=25000 run
sysbench --num-threads=8 --test=cpu --cpu-max-prime=25000 run

The value for cpu-max-prime should be chosen so that the benchmark runs for at least 10 seconds. I don't test for longer than 60 seconds, unless I'm looking for systemic perturbations like cronjobs.

I'll run the same multi-threaded sysbench invocation a number of times, to look for repeatability. This could vary based on scheduler placement, CPU affinity, and memory groups.

The single-threaded results are important for single-threaded (or effectively single-threaded) applications, like node.js. Multi-threaded for applications like MySQL server.

While sysbench is running, you'll want to analyze CPU usage. For example, on Linux, I'd use mpstat, sar, pidstat, and perf. On SmartOS, I'd use mpstat, prstat, and DTrace profiling.

Networking: iperf

For networking, this is what I'd like to test, and why:

Throughput, single-threaded: find the maximum throughput of one connection.
Throughput, multi-threaded: look for cloud instance limits (resource controls).

iperf works well for this. Example commands:

# server
iperf -s -l 128k

# client, 1 thread
iperf -c server_IP -l 128k -i 1 -t 30

# client, 2 threads
iperf -c server_IP -P 2 -l 128k -i 1 -t 30

Here I've included -i 1 to print per-second summaries, so I can watch for variance.

While iperf is running, you'll want to analyze network and CPU usage. On Linux, I'd use nicstat, sar, and pidstat. On SmartOS, I'd use nicstat, mpstat, and prstat.

File systems: fio

For file systems, this is what I'd like to test, and why:

Medium working set size: as a realistic test of the file system and its cache. I'd test both single-threaded and multi-threaded performance, and analyze to confirm whether these hit CPU or file system limits.

By "medium", I mean a working set size somewhat larger than the instance memory size. Eg, for a 1 Gbyte instance, I'd create a total file set of 10 Gbytes, with a non-uniform access distribution so that it has a cache hit ratio in the 90%s. These characteristics are chosen to match what I've seen are typical of the cloud. If you know what your total file size will be, working set size, and access distribution, then by all means test that instead.

I've been impressed by fio by Jens Axboe. Here's how I'd use it:

# throw-away: 5 min warm-up
fio --runtime=300 --time_based --clocksource=clock_gettime --name=randread --numjobs=8 \
    --rw=randread --random_distribution=pareto:0.9 --bs=8k --size=10g --filename=fio.tmp
# file system random I/O, 10 Gbytes, 1 thread
fio --runtime=60 --time_based --clocksource=clock_gettime --name=randread --numjobs=1 \
    --rw=randread --random_distribution=pareto:0.9 --bs=8k --size=10g --filename=fio.tmp
# file system random I/O, 10 Gbytes, 8 threads
fio --runtime=60 --time_based --clocksource=clock_gettime --name=randread --numjobs=8 \
    --rw=randread --random_distribution=pareto:0.9 --bs=8k --size=10g --filename=fio.tmp

This is all about finding the "Goldilocks" working set size. People often test too small or too big:

A small working set size, smaller than the instance memory size, may cache entirely in the instance file system cache. This can then become a test of kernel file system code paths, bounded by CPU speed. I sometimes do this to investigate file system pathologies, such as lock contention, and perturbations from cache flushing/shrinking.
A large working set size, much larger than the instance memory size, can cause the file system cache to mostly miss. This becomes a test of storage I/O.

These tests are useful if and only if you can explain why the results are interesting: how they map to your production environment.

While fio is running, you'll want to analyze file system, disk, and CPU usage. On Linux, I'd use sar, iostat, pidstat, and a profiler (perf, ...). On SmartOS, I'd use vfsstat, iostat, prstat, and DTrace.

There are many more benchmarking tools for different targets. In my experience, it's best to assume that they are all broken or misleading until proven otherwise, by use of analysis and sanity checking.

Brendan Gregg's Blog