CPU Flame Graphs


MySQL CPU Flame Graph

Determining why CPUs are busy is a routine task for performance analysis, which often involves profiling stack traces. Profiling by sampling at a fixed rate is a coarse but effective way to see which code-paths are hot (busy on-CPU). It usually works by creating a timed interrupt that collects the current program counter, function address, or entire stack back trace, and translates these to something human readable when printing a summary report.

Profiling data can be thousands of lines long, and difficult to comprehend. Flame graphs are a visualization for sampled stack traces, which for CPU profile data allows hot code-paths to be identified quickly and accurately. See the Flame Graphs main page for other uses of this visualization other than CPU profiling.

Flame Graphs can work with any CPU profiler on any operating system. I'll introduce how Flame Graphs work using DTrace profile data, and then I'll describe how to generate them using various Linux profilers: perf_events, SystemTap, and ktap.

Problem

There are many tools for profiling applications and the kernel, including Linux perf_events and DTrace. Here is a profiling example using DTrace, where a production MySQL database was busy on-CPU:

# dtrace -x ustackframes=100 -n 'profile-997 /execname == "mysqld"/ {
    @[ustack()] = count(); } tick-60s { exit(0); }'
dtrace: description 'profile-997 ' matched 2 probes
CPU     ID                    FUNCTION:NAME
  1  75195                        :tick-60s 
[...]
              libc.so.1`__priocntlset+0xa
              libc.so.1`getparam+0x83
              libc.so.1`pthread_getschedparam+0x3c
              libc.so.1`pthread_setschedprio+0x1f
              mysqld`_Z16dispatch_command19enum_server_commandP3THDPcj+0x9ab
              mysqld`_Z10do_commandP3THD+0x198
              mysqld`handle_one_connection+0x1a6
              libc.so.1`_thrp_setup+0x8d
              libc.so.1`_lwp_start
             4884

              mysqld`_Z13add_to_statusP17system_status_varS0_+0x47
              mysqld`_Z22calc_sum_of_all_statusP17system_status_var+0x67
              mysqld`_Z16dispatch_command19enum_server_commandP3THDPcj+0x1222
              mysqld`_Z10do_commandP3THD+0x198
              mysqld`handle_one_connection+0x1a6
              libc.so.1`_thrp_setup+0x8d
              libc.so.1`_lwp_start
             5530

If you haven't seen this output before, it's showing two multi-line stacks with a count at the bottom. Each stack shows the code-path ancestry: the on-CPU function is on top, its parent is below, and so on. Many lines are C++ signatures and look a little mangled; these can be easily translated into a more human-readable form using a C++ demangler (eg, gc++filt).

The last two most frequent stacks were shown here. The very last was sampled on-CPU 1,643 times, and looks like it is MySQL doing some system status housekeeping. If that's the hottest, and we know we have a CPU issue, perhaps we should go hunting for tunables to disable system stats in MySQL.

The problem is that most of the output was truncated from this screenshot (the ellipsis "[...]"), and what we see here represents less than 3% of the time spent on-CPU. The total sample count in MySQL was 348,427, and the two stacks above are around 10,000 samples. Given all the output, it's still hard to read through and comprehend it quickly - even if percentages were included instead of sample counts.

Too much data

The full output from the previous command was 591,622 lines long, and included 27,053 stacks like the two pictured above. It looks like this:

Click for a larger image, although be warned that it is a 7 Mbyte JPEG. Even then, you still can't read the text. I couldn't make the resolution any bigger without breaking the tools I was using to generate it. I've included this to provide a visual sense of the amount of data involved.

The Flame Graph

The same MySQL profile data shown above, rendered as a flame graph:

You can mouse over elements to see percentages, showing how frequent the element was present in the profiling data. You can also view the SVG separately to zoom in, or if you have problems with SVGs and your browser, try the PNG version.

The earlier truncated text output identified a MySQL status stack as the hottest. The flame graph shows reality: most of the time is really in JOIN::exec. This points the investigation in the right direction, and I can continue to learn more about JOIN::exec by navigating the frames above it in the flame graph.

Description

I'll explain this carefully: it may look similar to other visualizations from profilers, but it is different.

The colors aren't significant, and are picked at random to be warm colors. It's called "flame graph" as it's showing what is hot on-CPU. And, it's interactive: mouse over the SVGs to reveal details.

Instructions

The code to the FlameGraph tool and instructions are on github. It's a simple Perl program that outputs SVG. They are generated in three steps:

  1. Capture stacks
  2. Fold stacks
  3. flamegraph.pl

The first step is to use the profiler of your choice. See below for some examples using DTrace, perf, SystemTap, and ktap.

The second step generates a line-based output for flamegraph.pl to read, which can also be grep'd to filter for functions of interest. There is a Perl program to do this, stackcollapse.pl, and variants of this to process the output from different profilers.

DTrace

This example samples user-level stacks at 97 Hertz for processes named "mysqld", and generate the flame graph (this is similar to the steps used to create the earlier MySQL flame graph):

# dtrace -x ustackframes=100 -n 'profile-97 /execname == "mysqld" && arg1/ {
    @[ustack()] = count(); } tick-60s { exit(0); }' -o out.stacks
# ./stackcollapse.pl out.stacks > out.folded
# ./flamegraph.pl out.folded > out.svg

The out.folded intermediate file isn't necessary, as stackcollapse.pl could be piped directly to flamegraph.pl. However, I find it handy in case I'd like to edit the profile data a little using vi. For example, when sampling the kernel, to find and delete the idle threads.

To explain the "arg1" check: arg1 is the user-land program counter, so this checks that it is non-zero; arg0 is the kernel. The following is an example DTrace command to sample the kernel:

# dtrace -x stackframes=100 -n 'profile-997 /arg0/ {
    @[stack()] = count(); } tick-60s { exit(0); }' -o out.stacks

This time, all threads are sampled, so the output will contain many idle thread samples (you can either use DTrace to filter them, or grep/vi to filter the folded output). The rate is also increased to 997 Hertz, as capturing kernel stacks is much less expensive than user-level stacks. The odd numbered rates, 97 and 997, are used to avoid sampling in lockstep with other activity and producing misleading results.

perf

Linux perf_events has a variety of capabilities, including CPU sampling. Using it to sample all CPUs and generate a flame graph:

# perf record -a -g -F 99 sleep 60
# perf script | ./stackcollapse-perf.pl > out.perf-folded
# ./flamegraph.pl out.perf-folded > perf-kernel.svg

The perf command samples at 99 Hertz (-F 99) across all CPUs (-a), capturing stack traces so that a call graph (-g) of function ancestry can be generated later. The samples are saved in a perf.data file, which are read by perf script.

The perf report command will also read the perf.data file, and does an excellent job at presenting this information as text. However, with text there are limitations. The output often does not fit in one screen (you could say it doesn't need to, if the bulk of the samples are identified on the first page). Also, identifying the hottest code paths requires reading the percentages. With the flame graph, all the data is on screen at once, and the hottest code-paths are immediately obvious as the widest functions.

I create the intermediate file, out.perf-folded, to make it a little quicker when creating multiple filtered flame graphs from the same data. Eg:

# perf script | ./stackcollapse-perf.pl > out.perf-folded
# grep -v cpu_idle out.perf-folded | ./flamegraph.pl > nonidle.svg
# grep ext4 out.perf-folded | ./flamegraph.pl > ext4internals.svg
# egrep 'system_call.*sys_(read|write)' out.perf-folded | ./flamegraph.pl > rw.svg

It would be a little more efficient to process the output of perf report; better still, perf report could have a report style (eg, "-g folded") that output folded stacks directly, obviating the need for stackcollapse-perf.pl. There could even be a perf mode that output the flame graph SVG directly, although, that would miss the value of being able to grep the folded stacks.

See my perf page for more details on perf_events Flame Graphs.

SystemTap

SystemTap can also sample stack traces via the timer.profile probe, which fires at the system clock rate (CONFIG_HZ). Unlike perf, which dumps samples to a file for later aggregation and reporting, SystemTap can do the aggregation in-kernel and pass a (much smaller) report to user-land. The data collected and output generated can be customized much further via its scripting language.

Using SystemTap v1.7 on Fedora 16 to generate a flame graph:

# stap -s 32 -D MAXBACKTRACE=100 -D MAXSTRINGLEN=4096 -D MAXMAPENTRIES=10240 \
    -D MAXACTION=10000 -D STP_OVERLOAD_THRESHOLD=5000000000 --all-modules \
    -ve 'global s; probe timer.profile { s[backtrace()] <<< 1; } 
    probe end { foreach (i in s+) { print_stack(i);
    printf("\t%d\n", @count(s[i])); } } probe timer.s(60) { exit(); }' \
    > out.stap-stacks
# ./stackcollapse-stap.pl out.stap-stacks > out.stap-folded
# cat out.stap-folded | ./flamegraph.pl > stap-kernel.svg

The six options used (-s 32, -D ...) increase various SystemTap limits. The only ones really necessary for flame graphs are "-D MAXBACKTRACE=100 -D MAXSTRINGLEN=4096", so that stack traces aren't truncated; the others became necessary when sampling for long periods (in this case, 60 seconds) on busy workloads, to avoid various threshold and overflow errors.

The timer.profile probe was used, which samples all CPUs at 100 Hertz: a little coarse, and risking lockstep sampling. Last I checked, the timer.hz(997) probe fired at the correct rate, but couldn't read stack backtraces.

ktap

ktap already supports CPU flame graphs via the scripts/profiling/stack_profile.kp script in the ktap distribution, which captures the sample data. Here's the steps using a ktap one-liner instead of that script:

# ktap -e 's = ptable(); profile-1ms { s[backtrace(12, -1)] <<< 1 }
    trace_end { for (k, v in pairs(s)) { print(k, count(v), "\n") } }
    tick-30s { exit(0) }' -o out.kstacks
# sed 's/	//g' out.kstacks | stackcollapse.pl > out.kstacks.folded
# ./flamegraph.pl out.kstacks.folded > out.kstacks.svg

The sed(1) command remove tab characters (that's a tab, not a series of spaces), so that stackcollapse.pl can process the text. A stackcollapse-ktap.pl could be easily written to process ktap output directly, and avoid the need for sed(1).

This full example is on my ktap page under ktap Flame Graphs.

Other Profilers

See the Updates links on the Flame Graphs main page for other profilers. These include Dave Pacheco's flame graphs for node.js functions, Mark Probst's Flame Graphs for Instruments on OS X, and Bruce Dawson's Summarizing Xperf CPU Usage with Flame Graphs on Microsoft Windows.

Examples

The following two CPU flame graphs show the Linux kernel, profiled perf_events. These were were generated on Linux 3.2.9 (Fedora 16 guest) running under KVM (Ubuntu host), with one virtual CPU. Some code paths and sample ratios will be very different on bare-metal: networking won't be processed via the virtio-net driver, for example.

File Systems

As an example of a different workload, this shows the kernel CPU time while an ext4 file system was being archived (SVG, PNG):

This shows how the file system was being read and where kernel CPU time was spent. Most of the kernel time is in sys_newfstatat() and sys_getdents() - metadata work as the file system is walked. sys_openat() is on the right, as files are opened to be read, which are then mmap()d (look to the right of sys_getdents(), these are in alphabetical order), and finally page faulted into user-space (see the page_fault() mountain on the left). The actual work of moving bytes is then spent in user-land on the mmap'd segments (and not shown in this kernel flame graph). Had the archiver used the read() syscall instead, this flame graph would look very different, and have a large sys_read() component.

Short Lived Processes

For this flame graph, I executed a workload of short-lived processes to see where kernel time is spent creating them (SVG, PNG):

Apart from performance analysis, this is also a great tool for learning the internals of the Linux kernel.

User+Kernel Flame Graph

This example shows both user and kernel stacks on an illumos kernel hypervisor host (SVG, PNG):

You can also see the standalone SVG and PNG versions. This shows the CPU usage of qemu thread 3, a KVM virtual CPU. Both user and kernel stacks are included (DTrace can access both at the same time), with the syscall in-between colored gray.

The plateau of vcpu_enter_guest() is where that virtual CPU was executing code inside the virtual machine. I was more interested in the mountains on the right, to examine the performance of the KVM exit code paths.

Java

Language virtual machines, like the JVM, present a problem for all profilers. Fetching a native stack trace is easy, as this is the stack that the kernel and processor use to operate. However, the native stack will only show JVM functions and system libraries. The Java code being executed -- the classes and methods, which is the crucial context we need -- is not visible, or is present as hexadecimal numbers only.

Some Java profilers, like the bundled jstack() tool, can reach into JVM's brain and retrieve the actual methods and classes being executed, and show them with the native stack. To generate meaningful Java flame graphs, we need such a profiler.

DTrace has this capability with Java, as the JVM has a "DTrace ustack helper" (search for that term to learn more about them). So, to generate a CPU flame graph for a Java program, the stacks can be collected using:

# dtrace -n 'profile-97 /execname == "java"/ { @[jstack(100, 8000)] = count(); }
    tick-30s { exit(0); }' -o out.javastacks_01

The output file can then be fed to stackcollapse.pl and flamegraph.pl as shown in earlier sections.

The ustack helper action is jstack(), and if it worked you'll have stacks that look like this:

              libjvm.so`jni_GetObjectField+0x10f
              libnet.so`Java_java_net_PlainSocketImpl_socketAvailable+0x39
              java/net/PlainSocketImpl.socketAvailable()I*
              java/net/PlainSocketImpl.available()I*
              java/lang/Thread.run()V
              0xfb60035e
              libjvm.so`__1cJJavaCallsLcall_helper6FpnJJavaValue_pnMmethodHa...
              libjvm.so`__1cCosUos_exception_wrapper6FpFpnJJavaValue_pnMmeth...
              libjvm.so`__1cJJavaCallsMcall_virtual6FpnJJavaValue_nGHandle_n...
              libjvm.so`__1cMthread_entry6FpnKJavaThread_pnGThread__v_+0x113
              libjvm.so`__1cKJavaThreadDrun6M_v_+0x2c6
              libjvm.so`java_start+0x1f2
              libc.so.1`_thrp_setup+0x88
              libc.so.1`_lwp_start
                9

Note that it includes both libjvm frames and classes/methods: java/net/PlainSocketImpl.available() etc. If there are no classes/methods visible, only hexadecimal numbers, you'll need to get jstack() to work first.

Unfortunately, jstack() doesn't work in many JVM versions. See bug JDK-7187999 for the list. Fortunately, there is a workaround on illumos systems, thanks to the awesome Adam Leventhal. It's described in Adam's email and illumos issue 3123, and the final usage (which was changed since his original proposal) involves setting an environment variable when launching your Java program, which ensures that the ustack helper is loaded. Eg:

illumos# LD_AUDIT_32=/usr/lib/dtrace/libdtrace_forceload.so java myprog

When I've tried, this has added over 60 seconds of startup time. This, itself, needs performance analysis; I'd guess that it should be less than 1 second.

Here's a resulting flame graph; this is showing ttcp (test TCP), which is a TCP benchmark written in Java:

You can also see the standalone SVG and PNG versions. This shows that most CPU time is spent in socketWrite(), with 9.2% in releaseFD().

Node.js

Like Java, we need a profiler that can dig into V8's brain and retrieve stack traces along with the JavaScript function calls. The node --perf-basic-prof option can be used with perf_events on Linux to do this: see the instructions by Trevor Norris, and his example.

There is also a DTrace ustack helper: see Dave Pacheco's post about it, where he also demonstrates flame graphs.

Other Uses

The flame graph visualization works for any stack trace plus value combination, not just stack traces with CPU sample counts like those above. For example, you could trace device I/O, syscalls, off-CPU events, specific function calls, and memory allocation. For any of these, the stack trace can be gathered, along with a relevant value: counts, bytes, or latency. See the other types on the Flame Graphs page.

Background & Acknowledgements

I created this visualization out of necessity: I had huge amounts of stack sample data from a variety of different performance issues, and needed to quickly dig through it. I first tried creating some text-based tools to summarize the data, with limited success. Then I remembered a time-ordered visualization created by Neelakanth Nadgir (and another Roch Bourbonnais had created and shown me), and thought stack samples could be presented in a similar way. Neel's visualization looked great, but the process of tracing every function entry and return for my workloads altered performance too much. In my case, what mattered more was to have accurate percentages for quantifying code-paths, not the time ordering. This meant I could sample stacks (low overhead) instead of tracing functions (high overhead).

The very first visualization worked, and immediately identified a performance improvement to our KVM code (some added functions were more costly than expected). I've since used it many times for both kernel and application analysis, and others have been using it on a wide range of targets, including node.js profiling. Happy hunting!

References


Last updated: 25-Jan-2014