<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
  xmlns:content="http://purl.org/rss/1.0/modules/content/"
  xmlns:dc="http://purl.org/dc/elements/1.1/"
  xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd"
  xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/">
  <channel>
    <title>Brendan Gregg's Blog</title>
    <link>http://www.brendangregg.com/blog</link>
    <description>RSS feed for Brendan Gregg's Blog</description>
    <copyright>Brendan Gregg</copyright>
    <pubDate>Sat, 29 Apr 2017 00:00:00 -0700</pubDate>
    <item>
      <title>USENIX/LISA 2016 Linux bcc/BPF Tools</title>
      <link>http://www.brendangregg.com/blog/2017-04-29/usenix-lisa-2016-bcc-bpf-tools.html</link>
      <description><![CDATA[For USENIX LISA 2016 I gave a talk that was years in the making, on Linux bcc/BPF analysis tools.
]]></description>
      <pubDate>Sat, 29 Apr 2017 00:00:00 -0700</pubDate>
      <guid>http://www.brendangregg.com/blog/2017-04-29/usenix-lisa-2016-bcc-bpf-tools.html</guid>
      <content:encoded><![CDATA[For USENIX LISA 2016 I gave a talk that was years in the making, on Linux bcc/BPF analysis tools.

<blockquote>"Time to rethink the kernel" - Thomas Graf</blockquote>

Thomas has been using BPF to create new network and application security technologies (project [Cilium]), and build something that's starting to look like microservices in the kernel ([video]). I'm using it for advanced performance analysis tools that do tracing and profiling. Enhanced BPF might still be new, but it's already delivering new technologies, and making us rethink what we can do with the kernel.

My LISA 2016 talk begins with a 15 minute demo, showing the progression from ftrace, then perf\_events, to BPF (due to the audio/video settings, this demo is a little hard to follow in the full video, but there's a separate recording of just the demo here: [Linux tracing 15 min demo]). Below is the full talk video (<a href="https://www.youtube.com/watch?v=UmOU3I36T2U&t=1151s">youtube</a>):

<center><iframe style="padding-top:5px" width="595" height="335" src="https://www.youtube.com/embed/UmOU3I36T2U" frameborder="0" allowfullscreen></iframe></center>

The slides are on [slideshare] \([PDF]\):

<center><iframe src="//www.slideshare.net/slideshow/embed_code/key/2LUry5U7ho7tqe" width="595" height="375" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" style="border:1px solid #CCC; border-width:1px; margin-bottom:1px; max-width: 100%;" allowfullscreen> </iframe></center>

<!--
I started with a demo, and the best audio/video feed was in my [Linux tracing in 15 minutes] post ([youtube]):

<center><iframe width="595" height="335" src="https://www.youtube.com/embed/GsMs3n8CB6g" frameborder="0" allowfullscreen></iframe></center>

The rest of the talk can be seen in the standard talk video (<a href="https://www.youtube.com/watch?v=UmOU3I36T2U&t=1151s">youtube</a>):

<center><iframe width="595" height="335" src="https://www.youtube.com/embed/UmOU3I36T2U?start=1151" frameborder="0" allowfullscreen></iframe></center>
-->

## Installing bcc/BPF

To try out BPF for performance analysis you'll need to be on a newer kernel: at least 4.4, preferably 4.9. The main front end is currently [bcc] (BPF compiler collection), and there are [install instructions] on github, which keep getting improved. For Ubuntu, installation is:

<pre>
echo "deb [trusted=yes] https://repo.iovisor.org/apt/xenial xenial-nightly main" | sudo tee /etc/apt/sources.list.d/iovisor.list
sudo apt-get update
sudo apt-get install bcc-tools
</pre>

There's currently a pull request to add snap instructions, as there are nightly builds for snappy as well.

## Listing bcc/BPF Tools

This install will add various performance analysis and debugging tools to /usr/share/bcc/tools. Since some require a very recent kernel (4.6, 4.7, or 4.9), there's a subdirectory, /usr/share/bcc/tools/old, which has some older versions of the same tools that work on Linux 4.4 (albeit with some caveats).

<pre>
# <b>ls /usr/share/bcc/tools</b>
argdist       cpudist            filetop         offcputime   solisten    tcptop    vfsstat
bashreadline  cpuunclaimed       funccount       offwaketime  sslsniff    tplist    wakeuptime
biolatency    dcsnoop            funclatency     old          stackcount  trace     xfsdist
biosnoop      dcstat             gethostlatency  oomkill      stacksnoop  ttysnoop  xfsslower
biotop        deadlock_detector  hardirqs        opensnoop    statsnoop   ucalls    zfsdist
bitesize      doc                killsnoop       pidpersec    syncsnoop   uflow     zfsslower
btrfsdist     execsnoop          llcstat         profile      tcpaccept   ugc
btrfsslower   ext4dist           mdflush         runqlat      tcpconnect  uobjnew
cachestat     ext4slower         memleak         runqlen      tcpconnlat  ustat
cachetop      filelife           mountsnoop      slabratetop  tcplife     uthreads
capable       fileslower         mysqld_qslower  softirqs     tcpretrans  vfscount
</pre>

Just by listing the tools, you might spot something you want to start with (ext4*, tcp*, etc). Or you can browse the following diagram:

<center><a href="/Perf/bcc_tracing_tools.png"><img src="http://www.brendangregg.com/Perf/bcc_tracing_tools.png" width=600 border=0"></a></center>

## Using bcc/BPF

If you don't have a good starting point, in the [bcc Tutorial] I included a generic checklist of the first ten tools to try. I also included this in my LISA talk:

<ol>
<li>execsnoop</li>
<li>opensnoop</li>
<li>ext4slower (or btrfs*, xfs*, zfs*)</li>
<li>biolatency</li>
<li>biosnoop</li>
<li>cachestat</li>
<li>tcpconnect</li>
<li>tcpaccept</li>
<li>tcpretrans</li>
<li>runqlat</li>
<li>profile</li>
</ol>

Most of these have usage messages, and are easy to use. They'll need to be run as root. For example, <tt>execsnoop</tt> to trace new processes:

<pre>
# <b>/usr/share/bcc/tools/execsnoop</b>
PCOMM            PID    PPID   RET ARGS
grep             69460  69458    0 /bin/grep -q g2.
grep             69462  69458    0 /bin/grep -q p2.
ps               69464  58610    0 /bin/ps -p 308
ps               69465  100871   0 /bin/ps -p 301
sleep            69466  58610    0 /bin/sleep 1
sleep            69467  100871   0 /bin/sleep 1
run              69468  5160     0 ./run
[...]
</pre>

And <tt>biolatency</tt> to record an in-kernel histogram of disk I/O latency:

<pre>
# <b>/usr/share/bcc/tools/biolatency </b>
Tracing block device I/O... Hit Ctrl-C to end.
^C
     usecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 64       |**********                              |
       512 -> 1023       : 248      |****************************************|
      1024 -> 2047       : 29       |****                                    |
      2048 -> 4095       : 18       |**                                      |
      4096 -> 8191       : 42       |******                                  |
      8192 -> 16383      : 20       |***                                     |
     16384 -> 32767      : 3        |                                        |
</pre>

Here's its USAGE message:

<pre>
# <b>/usr/share/bcc/tools/biolatency -h</b>
usage: biolatency [-h] [-T] [-Q] [-m] [-D] [interval] [count]

Summarize block device I/O latency as a histogram

positional arguments:
  interval            output interval, in seconds
  count               number of outputs

optional arguments:
  -h, --help          show this help message and exit
  -T, --timestamp     include timestamp on output
  -Q, --queued        include OS queued time in I/O time
  -m, --milliseconds  millisecond histogram
  -D, --disks         print a histogram per disk device

examples:
    ./biolatency            # summarize block I/O latency as a histogram
    ./biolatency 1 10       # print 1 second summaries, 10 times
    ./biolatency -mT 1      # 1s summaries, milliseconds, and timestamps
    ./biolatency -Q         # include OS queued time in I/O time
    ./biolatency -D         # show each disk device separately
</pre>


In /usr/share/bcc/tools/docs or the [tools subdirectory] on github, you'll find \_example.txt files for every tool which have screenshots and discussion. Check them out! There are also man pages under man/man8.

For more information, please watch my LISA talk at the top of this post when you get a chance, where I explain Linux tracing, BPF, bcc, and tour various tools.

## What's Next?

My prior talk at LISA 2014 was [New Tools and Old Secrets (perf-tools)], where I showed similar performance analysis tools using ftrace, an older tracing framework in Linux. I'm still using ftrace, not just for older kernels, but for times where it's more efficient (eg, kernel function counting using the <tt>funccount</tt> tool). BPF is programmatic, and can do things that ftrace can't.

Doing ftrace at LISA 2014, then BPF at LISA 2016, you might wonder I'll propose for LISA 2018. We'll see. I could be covering a higher-level BPF front-end (eg, [ply], if it gets finished), or I could be focused on something else entirely. Tracing was my priority when Linux lacked various capabilities, but now that's done, there are other important technologies to work on...

[youtube]: https://www.youtube.com/watch?v=GsMs3n8CB6g
[Linux tracing 15 min demo]: https://www.youtube.com/watch?v=GsMs3n8CB6g
[Linux tracing in 15 minutes]: /blog/2016-12-27/linux-tracing-in-15-minutes.html
[PDF]: /Slides/LISA2016_BPF_tools_16_9.pdf
[slideshare]: http://www.slideshare.net/brendangregg/linux-4x-tracing-tools-using-bpf-superpowers
[previous post]: /blog/2017-04-23/usenix-lisa-2013-flame-graphs.html
[video]: https://www.youtube.com/watch?v=ilKlmTDdFgk
[Cilium]: https://github.com/cilium/cilium
[New Tools and Old Secrets (perf-tools)]: /blog/2015-03-17/usenix-lisa-2014-linux-ftrace-perf-tools.html
[ply]: https://github.com/iovisor/ply
[install instructions]: https://github.com/iovisor/bcc/blob/master/INSTALL.md
[tools subdirectory]: https://github.com/iovisor/bcc/tree/master/tools
[bcc Tutorial]: https://github.com/iovisor/bcc/blob/master/docs/tutorial.md
[bcc]: https://github.com/iovisor/bcc
]]></content:encoded>
      <dc:date>2017-04-29T00:00:00-07:00</dc:date>
    </item>
    <item>
      <title>USENIX/LISA 2013 Blazing Performance with Flame Graphs</title>
      <link>http://www.brendangregg.com/blog/2017-04-23/usenix-lisa-2013-flame-graphs.html</link>
      <description><![CDATA[In 2013 I gave a plenary at USENIX/LISA on flame graphs: my visualization for profiled stack traces, which is now used by many companies (including Netflix, Facebook, and Linkedin) to identify which code paths consume CPU. The talk is more relevant today, now that flame graphs are widely adopted.
]]></description>
      <pubDate>Sun, 23 Apr 2017 00:00:00 -0700</pubDate>
      <guid>http://www.brendangregg.com/blog/2017-04-23/usenix-lisa-2013-flame-graphs.html</guid>
      <content:encoded><![CDATA[In 2013 I gave a plenary at USENIX/LISA on flame graphs: my visualization for profiled stack traces, which is now used by many companies (including Netflix, Facebook, and Linkedin) to identify which code paths consume CPU. The talk is more relevant today, now that flame graphs are widely adopted.

Slides are on [slideshare] ([PDF]):

<p><center><iframe src="http://www.slideshare.net/slideshow/embed_code/28010650" width="425" height="355" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" style="border:0px solid #CCC;border-width:1px 1px 0;margin-bottom:0px" allowfullscreen> </iframe></center></p>

Video is on [youtube]:

<p><center><iframe width="560" height="315" src="//www.youtube.com/embed/nZfNehCzGdw?rel=0" frameborder="0" marginwidth="0" allowfullscreen></iframe></center></p>

The talk explains the origin of flame graphs, how to interpret them, and then tours different profile and trace event types that can be visualized. It predates some flame graph features that were added later: zoom, search, mixed-mode color highlights (--colors=java), and differential flame graphs.

I used DTrace to create different types of flame graphs in the talk, but since then I've developed ways to do them on Linux, using [perf] for CPU flame graphs, and [bcc/BPF] for advanced flame graphs: off-CPU and more. My [BPF off-CPU flame graphs] post used my stack track hack, but since then we've added stack trace support to BPF in Linux (4.6), and these can now be implemented without hacks. The tool <tt>offcputime</tt> in bcc has already been updated to do this (thanks Vicent Marti and others for getting it working well, and Alexei Starovoitov for adding stack trace support to BPF).

This talk was 170 slides in 90 minutes, which may have been too much in 2013 when flame graphs were new. There's a reason for this: I'd planned to do a 45 minute talk on CPU flame graphs, ending on slide 98, followed by a different talk. For reasons beyond my control, I was told the night before that I couldn't give that second talk. My plan B, as I'd already discussed with the conference organizers, was to extend the flame graphs talk and add an advanced section. I was up to 5am doing this, and was then woken at 8am by the conference organizers: the plenary speaker had shellfish poisoning, and could I come down and give my flame graphs talk at 9am, instead of later that day? That's how this ended up as a 90 minute plenary!

At that LISA I also worked more with USENIX staff, and co-delivered a metrics workshop, as well as another talk. I was proud to be involved with USENIX/LISA and contribute in these ways. And, you can too, the call for proposals for LISA 2017 ends tomorrow (April 24).

Since 2013, I've also written about flame graphs in [ACMQ] and [CACM]. For the latest on flame graphs, see the [updates] section of my flame graphs page.

[ACMQ]: http://queue.acm.org/detail.cfm?id=2927301
[CACM]: http://cacm.acm.org/magazines/2016/6/202665-the-flame-graph/abstract
[PDF]: /Slides/LISA13_Flame_Graphs.pdf
[slideshare]: http://www.slideshare.net/brendangregg/blazing-performance-with-flame-graphs
[youtube]: http://www.youtube.com/watch?v=nZfNehCzGdw
[updates]: /flamegraphs.html#Updates
[perf]: /perf.html#FlameGraphs
[bcc/BPF]: https://github.com/iovisor/bcc
[BPF off-CPU flame graphs]: /blog/2016-01-20/ebpf-offcpu-flame-graph.html
]]></content:encoded>
      <dc:date>2017-04-23T00:00:00-07:00</dc:date>
    </item>
    <item>
      <title>perf sched for Linux CPU scheduler analysis</title>
      <link>http://www.brendangregg.com/blog/2017-03-16/perf-sched.html</link>
      <description><![CDATA[Linux perf gained a new CPU scheduler analysis view in Linux 4.10: perf sched timehist. As I haven&#39;t talked about perf sched before, I&#39;ll summarize its capabilities here. If you&#39;re in a hurry, it may be helpful to just browse the following screenshots so that you are aware of what is available. (I&#39;ve also added this content to my perf examples page.)
]]></description>
      <pubDate>Thu, 16 Mar 2017 00:00:00 -0700</pubDate>
      <guid>http://www.brendangregg.com/blog/2017-03-16/perf-sched.html</guid>
      <content:encoded><![CDATA[Linux perf gained a new CPU scheduler analysis view in Linux 4.10: <tt>perf sched timehist</tt>. As I haven't talked about perf sched before, I'll summarize its capabilities here. If you're in a hurry, it may be helpful to just browse the following screenshots so that you are aware of what is available. (I've also added this content to my [perf examples] page.)

<tt>perf sched</tt> uses a dump-and-post-process approach for analyzing scheduler events, which can be a problem as these events can be very frequent &ndash; millions per second &ndash; costing CPU, memory, and disk overhead to record. I've recently been writing scheduler analysis tools using [eBPF/bcc] \(including [runqlat]\), which lets me greatly reduce overhead by using in-kernel summaries. But there are cases where you might want to capture every event using <tt>perf sched</tt> instead, despite the higher overhead. Imagine having five minutes to analyze a bad cloud instance before it is auto-terminated, and you want to capture everything for later analysis.

I'll start by recording one second of events:

<pre>
# <b>perf sched record -- sleep 1</b>
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 1.886 MB perf.data (13502 samples) ]
</pre>

That's 1.9 Mbytes for _one second_, including 13,502 samples. The size and rate will be relative to your workload and number of CPUs (this example is an 8 CPU server running a software build). How this is written to the file system has been optimized: it only woke up one time to read the event buffers and write them to disk, which reduces overhead. That said, there are still significant overheads with instrumenting all scheduler events and writing event data to the file system. These events:

<pre>
# <b>perf script --header</b>
# ========
# captured on: Sun Feb 26 19:40:00 2017
# hostname : bgregg-xenial
# os release : 4.10-virtual
# perf version : 4.10
# arch : x86_64
# nrcpus online : 8
# nrcpus avail : 8
# cpudesc : Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
# cpuid : GenuineIntel,6,62,4
# total memory : 15401700 kB
# cmdline : /usr/bin/perf sched record -- sleep 1 
# event : name = <b>sched:sched_switch</b>, , id = { 2752, 2753, 2754, 2755, 2756, 2757, 2758, 2759...
# event : name = <b>sched:sched_stat_wait</b>, , id = { 2760, 2761, 2762, 2763, 2764, 2765, 2766, 2...
# event : name = <b>sched:sched_stat_sleep</b>, , id = { 2768, 2769, 2770, 2771, 2772, 2773, 2774, ...
# event : name = <b>sched:sched_stat_iowait</b>, , id = { 2776, 2777, 2778, 2779, 2780, 2781, 2782,...
# event : name = <b>sched:sched_stat_runtime</b>, , id = { 2784, 2785, 2786, 2787, 2788, 2789, 2790...
# event : name = <b>sched:sched_process_fork</b>, , id = { 2792, 2793, 2794, 2795, 2796, 2797, 2798...
# event : name = <b>sched:sched_wakeup</b>, , id = { 2800, 2801, 2802, 2803, 2804, 2805, 2806, 2807...
# event : name = <b>sched:sched_wakeup_new</b>, , id = { 2808, 2809, 2810, 2811, 2812, 2813, 2814, ...
# event : name = <b>sched:sched_migrate_task</b>, , id = { 2816, 2817, 2818, 2819, 2820, 2821, 2822...
# HEADER_CPU_TOPOLOGY info available, use -I to display
# HEADER_NUMA_TOPOLOGY info available, use -I to display
# pmu mappings: breakpoint = 5, power = 7, software = 1, tracepoint = 2, msr = 6
# HEADER_CACHE info available, use -I to display
# missing features: HEADER_BRANCH_STACK HEADER_GROUP_DESC HEADER_AUXTRACE HEADER_STAT 
# ========
#
    perf 16984 [005] 991962.879966:   sched:sched_wakeup: comm=perf pid=16999 prio=120 target_cpu=005
[...]
</pre>

<p>The captured trace file can be reported in a number of ways, summarized by the help message:</p>

<pre>
# <b>perf sched -h</b>

 Usage: perf sched [<options>] {record|<b>latency|map|replay|script|timehist</b>}

    -D, --dump-raw-trace  dump raw trace in ASCII
    -f, --force           don't complain, do it
    -i, --input <file>    input file name
    -v, --verbose         be more verbose (show symbol address, etc)
</pre>

<p><b>perf sched latency</b> will summarize scheduler latencies by task, including average and maximum delay:</p>

<pre class="ten">
# <b>perf sched latency</b>

 -----------------------------------------------------------------------------------------------------------------
  Task                  |   Runtime ms  | Switches | Average delay ms | <b>Maximum delay ms</b> | Maximum delay at       |
 -----------------------------------------------------------------------------------------------------------------
  cat:(6)               |     12.002 ms |        6 | avg:   17.541 ms | max:   29.702 ms | max at: 991962.948070 s
  ar:17043              |      3.191 ms |        1 | avg:   13.638 ms | max:   13.638 ms | max at: 991963.048070 s
  rm:(10)               |     20.955 ms |       10 | avg:   11.212 ms | max:   19.598 ms | max at: 991963.404069 s
  objdump:(6)           |     35.870 ms |        8 | avg:   10.969 ms | max:   16.509 ms | max at: 991963.424443 s
  :17008:17008          |    462.213 ms |       50 | avg:   10.464 ms | max:   35.999 ms | max at: 991963.120069 s
  grep:(7)              |     21.655 ms |       11 | avg:    9.465 ms | max:   24.502 ms | max at: 991963.464082 s
  fixdep:(6)            |     81.066 ms |        8 | avg:    9.023 ms | max:   19.521 ms | max at: 991963.120068 s
  mv:(10)               |     30.249 ms |       14 | avg:    8.380 ms | max:   21.688 ms | max at: 991963.200073 s
  ld:(3)                |     14.353 ms |        6 | avg:    7.376 ms | max:   15.498 ms | max at: 991963.452070 s
  recordmcount:(7)      |     14.629 ms |        9 | avg:    7.155 ms | max:   18.964 ms | max at: 991963.292100 s
  svstat:17067          |      1.862 ms |        1 | avg:    6.142 ms | max:    6.142 ms | max at: 991963.280069 s
  cc1:(21)              |   6013.457 ms |     1138 | avg:    5.305 ms | max:   44.001 ms | max at: 991963.436070 s
  gcc:(18)              |     43.596 ms |       40 | avg:    3.905 ms | max:   26.994 ms | max at: 991963.380069 s
  ps:17073              |     27.158 ms |        4 | avg:    3.751 ms | max:    8.000 ms | max at: 991963.332070 s
...]
</pre>

<p>To shed some light as to how this is instrumented and calculated, I'll show the events that led to the top event's "Maximum delay at" of 29.702 ms. Here are the raw events from <tt>perf sched script</tt>:

<pre class="ten">
      sh 17028 [001] 991962.918368:   sched:sched_wakeup_new: comm=sh pid=17030 prio=120 target_cpu=002
[...]
     cc1 16819 [002] 991962.948070:       sched:sched_switch: prev_comm=cc1 prev_pid=16819 prev_prio=120
                                                            prev_state=R ==> next_comm=sh next_pid=17030 next_prio=120
[...]
</pre>

<p>The time from the wakeup (991962.918368, which is in seconds) to the context switch (991962.948070) is 29.702 ms. This process is listed as "sh" (shell) in the raw events, but execs "cat" soon after, so is shown as "cat" in the <tt>perf sched latency</tt> output.</p>

<p><b>perf sched map</b> shows all CPUs and context-switch events, with columns representing what each CPU was doing and when. It's the kind of data you see visualized in scheduler analysis GUIs (including <tt>perf timechart</tt>, with the layout rotated 90 degrees). Example output:</p>

<pre>
# <b>perf sched map</b>
                      *A0           991962.879971 secs A0 => perf:16999
                       A0     *B0   991962.880070 secs B0 => cc1:16863
          *C0          A0      B0   991962.880070 secs C0 => :17023:17023
  *D0      C0          A0      B0   991962.880078 secs D0 => ksoftirqd/0:6
   D0      C0 *E0      A0      B0   991962.880081 secs E0 => ksoftirqd/3:28
   D0      C0 *F0      A0      B0   991962.880093 secs F0 => :17022:17022
  *G0      C0  F0      A0      B0   991962.880108 secs G0 => :17016:17016
   G0      C0  F0     *H0      B0   991962.880256 secs H0 => migration/5:39
   G0      C0  F0     *I0      B0   991962.880276 secs I0 => perf:16984
   G0      C0  F0     *J0      B0   991962.880687 secs J0 => cc1:16996
   G0      C0 *K0      J0      B0   991962.881839 secs K0 => cc1:16945
   G0      C0  K0      J0 *L0  B0   991962.881841 secs L0 => :17020:17020
   G0      C0  K0      J0 *M0  B0   991962.882289 secs M0 => make:16637
   G0      C0  K0      J0 *N0  B0   991962.883102 secs N0 => make:16545
   G0     *O0  K0      J0  N0  B0   991962.883880 secs O0 => cc1:16819
   G0 *A0  O0  K0      J0  N0  B0   991962.884069 secs 
   G0  A0  O0  K0 *P0  J0  N0  B0   991962.884076 secs P0 => rcu_sched:7
   G0  A0  O0  K0 *Q0  J0  N0  B0   991962.884084 secs Q0 => cc1:16831
   G0  A0  O0  K0  Q0  J0 *R0  B0   991962.884843 secs R0 => cc1:16825
   G0 *S0  O0  K0  Q0  J0  R0  B0   991962.885636 secs S0 => cc1:16900
   G0  S0  O0 *T0  Q0  J0  R0  B0   991962.886893 secs T0 => :17014:17014
   G0  S0  O0 *K0  Q0  J0  R0  B0   991962.886917 secs 
[...]
</pre>

<p>This is an 8 CPU system, and you can see the 8 columns for each CPU starting from the left. Some CPU columns begin blank, as we've yet to trace an event on that CPU at the start of the profile. They quickly become populated.</p>

<p>The two character codes you see ("A0", "C0") are identifiers for tasks, which are mapped on the right ("=>"). This is more compact than using process (task) IDs. The "*" shows which CPU had the context switch event, and the new event that was running. For example, the very last line shows that at 991962.886917 (seconds) CPU 4 context-switched to K0 (a "cc1" process, PID 16945).</p>

<p>That example was from a busy system. Here's an idle system:</p>

<pre>
# <b>perf sched map</b>
                      *A0           993552.887633 secs A0 => perf:26596
  *.                   A0           993552.887781 secs .  => swapper:0
   .                  *B0           993552.887843 secs B0 => migration/5:39
   .                  *.            993552.887858 secs 
   .                   .  *A0       993552.887861 secs 
   .                  *C0  A0       993552.887903 secs C0 => bash:26622
   .                  *.   A0       993552.888020 secs 
   .          *D0      .   A0       993552.888074 secs D0 => rcu_sched:7
   .          *.       .   A0       993552.888082 secs 
   .           .      *C0  A0       993552.888143 secs 
   .      *.   .       C0  A0       993552.888173 secs 
   .       .   .      *B0  A0       993552.888439 secs 
   .       .   .      *.   A0       993552.888454 secs 
   .      *C0  .       .   A0       993552.888457 secs 
   .       C0  .       .  *.        993552.889257 secs 
   .      *.   .       .   .        993552.889764 secs 
   .       .  *E0      .   .        993552.889767 secs E0 => bash:7902
...]
</pre>

<p>Idle CPUs are shown as ".".</p>

<p>Remember to examine the timestamp column to make sense of this visualization (GUIs use that as a dimension, which is easier to comprehend, but here the numbers are just listed). It's also only showing context switch events, and not scheduler latency. The newer <tt>timehist</tt> command has a visualization (-V) that can include wakeup events.</p>

<p><b>perf sched timehist</b> was added in Linux 4.10, and shows the scheduler latency by event, including the time the task was waiting to be woken up (<tt>wait time</tt>) and the scheduler latency after wakeup to running (<tt>sch delay</tt>). It's the scheduler latency that we're more interested in tuning. Example output:</p>

<pre>
# <b>perf sched timehist</b>
Samples do not have callchains.
           time    cpu  task name                       wait time  sch delay   run time
                        [tid/pid]                          (msec)     (msec)     (msec)
--------------- ------  ------------------------------  ---------  ---------  ---------
  991962.879971 [0005]  perf[16984]                         0.000      0.000      0.000 
  991962.880070 [0007]  :17008[17008]                       0.000      0.000      0.000 
  991962.880070 [0002]  cc1[16880]                          0.000      0.000      0.000 
  991962.880078 [0000]  cc1[16881]                          0.000      0.000      0.000 
  991962.880081 [0003]  cc1[16945]                          0.000      0.000      0.000 
  991962.880093 [0003]  ksoftirqd/3[28]                     0.000      0.007      0.012 
  991962.880108 [0000]  ksoftirqd/0[6]                      0.000      0.007      0.030 
  991962.880256 [0005]  perf[16999]                         0.000      0.005      0.285 
  991962.880276 [0005]  migration/5[39]                     0.000      0.007      0.019 
  991962.880687 [0005]  perf[16984]                         0.304      0.000      0.411 
  991962.881839 [0003]  cat[17022]                          0.000      0.000      1.746 
  991962.881841 [0006]  cc1[16825]                          0.000      0.000      0.000 
[...]
  991963.885740 [0001]  :17008[17008]                      25.613      0.000      0.057 
  991963.886009 [0001]  sleep[16999]                     1000.104      0.006      0.269 
  991963.886018 [0005]  cc1[17083]                         19.998      0.000      9.948 
</pre>

<p>This output includes the <tt>sleep</tt> command run to set the duration of perf itself to one second. Note that <tt>sleep</tt>'s wait time is 1000.104 milliseconds because I had run "sleep 1": that's the time it was asleep waiting its timer wakeup event. Its scheduler latency was only 0.006 milliseconds, and its time on-CPU was 0.269 milliseconds.</p>

<p>There are a number of options to timehist, including -V to add a CPU visualization column, -M to add migration events, and -w for wakeup events. For example:</p>

<pre class="ten">
# <b>perf sched timehist -MVw</b>
Samples do not have callchains.
           time    cpu  012345678  task name           wait time  sch delay   run time
                                   [tid/pid]              (msec)     (msec)     (msec)
--------------- ------  ---------  ------------------  ---------  ---------  ---------
  991962.879966 [0005]             perf[16984]                                          awakened: perf[16999]
  991962.879971 [0005]       s     perf[16984]             0.000      0.000      0.000                                 
  991962.880070 [0007]         s   :17008[17008]           0.000      0.000      0.000                                 
  991962.880070 [0002]    s        cc1[16880]              0.000      0.000      0.000                                 
  991962.880071 [0000]             cc1[16881]                                           awakened: ksoftirqd/0[6]
  991962.880073 [0003]             cc1[16945]                                           awakened: ksoftirqd/3[28]
  991962.880078 [0000]  s          cc1[16881]              0.000      0.000      0.000                                 
  991962.880081 [0003]     s       cc1[16945]              0.000      0.000      0.000                                 
  991962.880093 [0003]     s       ksoftirqd/3[28]         0.000      0.007      0.012                                 
  991962.880108 [0000]  s          ksoftirqd/0[6]          0.000      0.007      0.030                                 
  991962.880249 [0005]             perf[16999]                                          awakened: migration/5[39]
  991962.880256 [0005]       s     perf[16999]             0.000      0.005      0.285                                 
  991962.880264 [0005]        m      migration/5[39]                                      migrated: perf[16999] cpu 5 => 1
  991962.880276 [0005]       s     migration/5[39]         0.000      0.007      0.019                                 
  991962.880682 [0005]        m      perf[16984]                                          migrated: cc1[16996] cpu 0 => 5
  991962.880687 [0005]       s     perf[16984]             0.304      0.000      0.411                                 
  991962.881834 [0003]             cat[17022]                                           awakened: :17020
...]
  991963.885734 [0001]             :17008[17008]                                        awakened: sleep[16999]
  991963.885740 [0001]   s         :17008[17008]          25.613      0.000      0.057                           
  991963.886005 [0001]             sleep[16999]                                         awakened: perf[16984]
  991963.886009 [0001]   s         sleep[16999]         1000.104      0.006      0.269
  991963.886018 [0005]       s     cc1[17083]             19.998      0.000      9.948 
</pre>

<p>The CPU visualization column ("012345678") has "s" for context-switch events, and "m" for migration events, showing the CPU of the event.</p>

<p>The last events in that output include those related to the "sleep 1" command used to time perf. The wakeup happened at 991963.885734, and at 991963.885740 (6 microseconds later) CPU 1 begins to context-switch to the sleep process. The column for that event still shows ":17008[17008]" for what was on-CPU, but the target of the context switch (sleep) is not shown. It is in the raw events:</p>

<pre class="ten">
  :17008 17008 [001] 991963.885740:       sched:sched_switch: prev_comm=cc1 prev_pid=17008 prev_prio=120
                                                             prev_state=R ==> next_comm=<b>sleep</b> next_pid=16999 next_prio=120
</pre>

<p>The 991963.886005 event shows that the perf command received a wakeup while sleep was running (almost certainly sleep waking up its parent process because it terminated), and then we have the context switch on 991963.886009 where sleep stops running, and a summary is printed out: 1000.104 ms waiting (the "sleep 1"), with 0.006 ms scheduler latency, and 0.269 ms of CPU runtime.</p>

<p>Here I've decorated the timehist output with the details of the context switch destination in red:</p>

<pre class="ten">
  991963.885734 [0001]             :17008[17008]                                        awakened: sleep[16999]
  991963.885740 [0001]   s         :17008[17008]          25.613      0.000      0.057  <font color="#a00000">next: sleep[16999]</font>
  991963.886005 [0001]             sleep[16999]                                         awakened: perf[16984]
  991963.886009 [0001]   s         sleep[16999]         1000.104      0.006      0.269  <font color="#a00000">next: cc1[17008]</font>
  991963.886018 [0005]       s     cc1[17083]             19.998      0.000      9.948  <font color="#a00000">next: perf[16984]</font>
</pre>

<p>When sleep finished, a waiting "cc1" process then executed. perf ran on the following context switch, and is the last event in the profile (perf terminated). I've submitted a patch to add this info when a -n option is used.</p>

<p><b>perf sched script</b> dumps all events (similar to <tt>perf script</tt>):</p>

<pre class="ten">
# <b>perf sched script</b>
    perf 16984 [005] 991962.879960: sched:sched_stat_runtime: comm=perf pid=16984 runtime=3901506 [ns] vruntime=165...
    perf 16984 [005] 991962.879966:       sched:sched_wakeup: comm=perf pid=16999 prio=120 target_cpu=005
    perf 16984 [005] 991962.879971:       sched:sched_switch: prev_comm=perf prev_pid=16984 prev_prio=120 prev_stat...
    perf 16999 [005] 991962.880058: sched:sched_stat_runtime: comm=perf pid=16999 runtime=98309 [ns] vruntime=16405...
     cc1 16881 [000] 991962.880058: sched:sched_stat_runtime: comm=cc1 pid=16881 runtime=3999231 [ns] vruntime=7897...
  :17024 17024 [004] 991962.880058: sched:sched_stat_runtime: comm=cc1 pid=17024 runtime=3866637 [ns] vruntime=7810...
     cc1 16900 [001] 991962.880058: sched:sched_stat_runtime: comm=cc1 pid=16900 runtime=3006028 [ns] vruntime=7772...
     cc1 16825 [006] 991962.880058: sched:sched_stat_runtime: comm=cc1 pid=16825 runtime=3999423 [ns] vruntime=7876...
</pre>

Each of these events ("sched:sched_stat_runtime" etc) are tracepoints you can instrument directly using perf record. As I've shown earlier, this raw output can be useful for digging further than the summary commands.

That's it for now. Happy hunting.

[eBPF/bcc]: https://github.com/iovisor/bcc
[runqlat]: /blog/2016-10-08/linux-bcc-runqlat.html
[perf examples]: /perf.html#SchedulerAnalysis
]]></content:encoded>
      <dc:date>2017-03-16T00:00:00-07:00</dc:date>
    </item>
    <item>
      <title>Flame Graphs vs Tree Maps vs Sunburst</title>
      <link>http://www.brendangregg.com/blog/2017-02-06/flamegraphs-vs-treemaps-vs-sunburst.html</link>
      <description><![CDATA[Yesterday I posted about flame graphs for file systems, showing how they can visualize where disk space is consumed. Many people have responded, citing other tools and visualizations they prefer: du, ncdu, treemaps, and the sunburst layout. Since there&#39;s so much interest in this subject, I&#39;ve visualized the same files here (the source for linux 4.9.-rc5) in different ways for comparison.
]]></description>
      <pubDate>Mon, 06 Feb 2017 00:00:00 -0800</pubDate>
      <guid>http://www.brendangregg.com/blog/2017-02-06/flamegraphs-vs-treemaps-vs-sunburst.html</guid>
      <content:encoded><![CDATA[Yesterday I posted about [flame graphs for file systems], showing how they can visualize where disk space is consumed. Many people have [responded], citing other tools and visualizations they prefer: du, ncdu, treemaps, and the sunburst layout. Since there's so much interest in this subject, I've visualized the same files here (the source for linux 4.9.-rc5) in different ways for comparison.

## Flame Graphs

Using [FlameGraph] \(<a href="/blog/images/2017/files_linux49.svg">SVG</a>\):

<p><object data="/blog/images/2017/files_linux49.svg" type="image/svg+xml" width=720 height=218>
<img src="/blog/images/2017/files_linux49.svg" width=720 height=218 />
</object></p>

While you can mouse-over and click to zoom, at first glance the long labeled rectangles tell the big picture, by comparing their lengths and looking at the longest first:

<p><a href="/blog/images/2017/linux_flamegraph02.png"><img src="/blog/images/2017/linux_flamegraph02.png" width=720></a></p>

The drivers directory looks like it's over 50%, with drivers/net about 15% of the total. Many small rectangles are too thin to label, and, they also matter less overall. You can imagine printing the flame graph on paper, or including a screen shot in a slide deck, and it will still convey many high level details in not much space. Here's an <a href="https://twitter.com/ppcelery/status/828779560376299521">example</a> someone just posted to twitter.

## Tree Map

Using [GrandPerspective] \(on OSX\):

<p><a href="/blog/images/2017/linux_treemap01.png"><img src="/blog/images/2017/linux_treemap01.png" width=720></a></p>

What can you tell on first glance? Not those big picture details (drivers 50%, etc). You can mouse over tree map boxes to get more details, which this screenshot doesn't convey. It is, however, easier to see that there are a handful of large files with those boxes in the top left, which are under drivers/gpu/drm/amd.

Using [Baobab] on Linux:

<p><a href="/blog/images/2017/linux_treemap03.png"><img src="/blog/images/2017/linux_treemap03.png" width=720></a></p>

You can see that the drivers directory is large from the tree list on the left, which includes mini bar graphs for a visual line length comparison (good). You can't see into subdirectories without clicking to expand.

<p><a href="/blog/images/2017/linux_treemap03.png"><img src="/blog/images/2017/linux_treemap05.png" width=277></a></p>

Here I've highlighted the drivers/net box. What percentage is that of total from first glance?  It's a little bit more difficult to compare sizes than lengths (compare to earlier).

This is also missing labels, when compared to the flame graph, although other tree maps like [SpaceMonger] do have them. An advantage to all tree maps is that we can more easily use vertical space.

## Sunburst

Using [Baobab] on Linux:

<p><a href="/blog/images/2017/linux_sunburst01.png"><img src="/blog/images/2017/linux_sunburst01.png" width=720></a></p>

This is a flame graph (which is an adjacency diagram with an inverted icicle layout), using polar coordinates. It is very pretty, and as someone said "it always wows". Sunbursts are the new pie chart.

<p><a href="/blog/images/2017/linux_sunburst03.png"><img src="/blog/images/2017/linux_sunburst03.png"></a></p>

Deeper slices exaggerate their size, and look visually larger. The problem is this visualization requires the comparison of angles, instead of lengths, which has been evaluated as a more difficult perceptive task. That larger slice I highlighted is 25.6 Mbytes, and the smaller one is 27.8 Mbytes.

## ncdu

<pre>
--- /home/bgregg/linux-4.9-rc5 -----------------------------------------------
                         /..          
  405.1 MiB [##########] /drivers
  139.1 MiB [###       ] /arch
   37.5 MiB [          ] /fs
   36.0 MiB [          ] /include
   35.8 MiB [          ] /Documentation
   32.6 MiB [          ] /sound
   27.8 MiB [          ] /net
   14.7 MiB [          ] /tools
    7.5 MiB [          ] /kernel
    6.0 MiB [          ] /firmware
    3.7 MiB [          ] /lib
    3.4 MiB [          ] /scripts
    3.3 MiB [          ] /mm
    3.2 MiB [          ] /crypto
    2.4 MiB [          ] /security
    1.1 MiB [          ] /block
  968.0 KiB [          ] /samples
[...]
</pre>

This does have ASCII bar charts for line length comparisons, but it's only showing one directory level at a time.

## du

<pre>
$ du -hs * | sort -hr
406M	drivers
140M	arch
38M	fs
36M	include
36M	Documentation
33M	sound
28M	net
15M	tools
7.5M	kernel
6.1M	firmware
3.7M	lib
3.5M	scripts
3.4M	mm
3.2M	crypto
2.4M	security
1.2M	block
968K	samples
[...]
</pre>

Requires reading. Although this is so quick it's my usual starting point.

## Which to use

If you're designing a file system usage tool, which should you use? Ideally, I'd make flame graphs, tree maps, and sunbursts all available as different ways to understand the same data set. For the default view, I'd probably use the flame graph, but I'd want to check with many sample file systems to ensure it really works best with the data it's visualizing.

For more about flame graphs see my ACMQ article [The Flame Graph] \(<a href="http://dl.acm.org/citation.cfm?id=2909476">CACM</a>\), and for more about different visualizations see the ACMQ article [A Tour through the Visualization Zoo] \(<a href="http://dl.acm.org/citation.cfm?id=1743546.1743567">CACM</a>\), by Jeffrey Heer, Michael Bostock, and Vadim Ogievetsky, especially the section on Hierarchies.

[flame graphs]: http://www.brendangregg.com/flamegraphs.html
[The Flame Graph]: http://queue.acm.org/detail.cfm?id=2927301
[A Tour through the Visualization Zoo]: http://queue.acm.org/detail.cfm?id=1805128
[files.pl]: https://raw.githubusercontent.com/brendangregg/FlameGraph/master/files.pl
[flamegraph.pl]: https://raw.githubusercontent.com/brendangregg/FlameGraph/master/flamegraph.pl
[FlameGraph]: https://github.com/brendangregg/FlameGraph
[SVG]: /blog/images/2017/files_linux49.svg
[GrandPerspective]: https://sourceforge.net/projects/grandperspectiv/?source=typ_redirect
[Baobab]: http://www.marzocca.net/linux/baobab/index.html
[previous post]: /blog/2017-02-05/file-system-flame-graph.html
[flame graphs for file systems]: /blog/2017-02-05/file-system-flame-graph.html
[responded]: https://news.ycombinator.com/item?id=13574825
[SpaceMonger]: http://www.aplusfreeware.com/categories/LFWV/SpaceMonger.html
]]></content:encoded>
      <dc:date>2017-02-06T00:00:00-08:00</dc:date>
    </item>
    <item>
      <title>Where has my disk space gone? Flame graphs for file systems</title>
      <link>http://www.brendangregg.com/blog/2017-02-05/file-system-flame-graph.html</link>
      <description><![CDATA[My laptop was recently running low on available disk space, and it was a mystery as to why. I have different tools to explore the file system, including running the &quot;find / -ls&quot; command from a terminal, but they can be time consuming to use. I wanted a big picture view of space by directories, subdirectories, and so on.
]]></description>
      <pubDate>Sun, 05 Feb 2017 00:00:00 -0800</pubDate>
      <guid>http://www.brendangregg.com/blog/2017-02-05/file-system-flame-graph.html</guid>
      <content:encoded><![CDATA[My laptop was recently running low on available disk space, and it was a mystery as to why. I have different tools to explore the file system, including running the "find / -ls" command from a terminal, but they can be time consuming to use. I wanted a big picture view of space by directories, subdirectories, and so on.

I've created a simple open source tool to do this, using flame graphs as the final visualization. To demonstrate, here's the space consumed by the Linux 4.9-rc5 source code. Click to zoom, and Ctrl-F to search ([SVG]):

<p><object data="/blog/images/2017/files_linux49.svg" type="image/svg+xml" width=720 height=218>
<img src="/blog/images/2017/files_linux49.svg" width=720 height=218 />
</object></p>

If you are new to flame graphs, see my [flame graphs] page. In this case, width corresponds to total size. I created them for visualizing stack traces, but since they are a generic hierarchical visualization (technically an [adjacency diagram with an inverted icicle layout]), they are suited for the hierarchy of directories as well.

I've also used this to diagnose a similar problem with a friend's laptop, which turned out to be due to a backup application consuming space in a directory completely unknown to them.

The following sections show how to create one yourself. Start by opening a terminal session so you can use the command line.

### Using git

If you have the "git" command, you can fetch the [FlameGraph] repository and run the commands from it:

<pre>
git clone https://github.com/brendangregg/FlameGraph
cd FlameGraph
./files.pl /Users | ./flamegraph.pl --hash --countname=bytes > out.svg
</pre>

Then open out.svg in a browser. Change "/Users" to be the directory you want to visualize. This could be "/" for everything (provided you don't have removable storage or network file systems mounted, which if you do, it would include them in the report by accident).

### Without git

If you don't have git, you can download the two Perl programs straight from github: [files.pl] and [flamegraph.pl], either using wget or download them via your browser (save as). The steps can then be:

<pre>
chmod 755 files.pl flamegraph.pl
./files.pl /Users | ./flamegraph.pl --hash --countname=bytes > out.svg
</pre>

Again, change "/Users" to be the directory you want to visualize, then open out.svg in a browser.

### Linux source example

For reference, the Linux source example I included above was created using:

<pre>
files.pl linux-4.9-rc5 | flamegraph.pl --hash --countname=bytes \
    --title="Linux source tree by file size" --width=800 > files_linux49.svg
</pre>

You can customize the flame graph using options:

<pre>
$ ./flamegraph.pl -h
Option h is ambiguous (hash, height, help)
USAGE: ./flamegraph.pl [options] infile > outfile.svg

	--title       # change title text
	--width       # width of image (default 1200)
	--height      # height of each frame (default 16)
	--minwidth    # omit smaller functions (default 0.1 pixels)
	--fonttype    # font type (default "Verdana")
	--fontsize    # font size (default 12)
	--countname   # count type label (default "samples")
	--nametype    # name type label (default "Function:")
	--colors      # set color palette. choices are: hot (default), mem, io,
	              # wakeup, chain, java, js, perl, red, green, blue, aqua,
	              # yellow, purple, orange
	--hash        # colors are keyed by function name hash
	--cp          # use consistent palette (palette.map)
	--reverse     # generate stack-reversed flame graph
	--inverted    # icicle graph
	--negate      # switch differential hues (blue<->red)
	--help        # this message

	eg,
	./flamegraph.pl --title="Flame Graph: malloc()" trace.txt > graph.svg
</pre>

You might need to know about the --minwidth option: rectangles thinner than this (1/10th of a pixel when zoomed out) will be elided, to conserve space in the SVG. But that can mean things are missing when you zoom in. If it's a problem, you can set minwidth to 0.

Update: see my follow-on post [Flame Graphs vs Tree Maps vs Sunburst].

[flame graphs]: http://www.brendangregg.com/flamegraphs.html
[adjacency diagram with an inverted icicle layout]: http://queue.acm.org/detail.cfm?id=2927301
[files.pl]: https://raw.githubusercontent.com/brendangregg/FlameGraph/master/files.pl
[flamegraph.pl]: https://raw.githubusercontent.com/brendangregg/FlameGraph/master/flamegraph.pl
[FlameGraph]: https://github.com/brendangregg/FlameGraph
[SVG]: /blog/images/2017/files_linux49.svg
[Flame Graphs vs Tree Maps vs Sunburst]: http://www.brendangregg.com/blog/2017-02-06/flamegraphs-vs-treemaps-vs-sunburst.html
]]></content:encoded>
      <dc:date>2017-02-05T00:00:00-08:00</dc:date>
    </item>
    <item>
      <title>Golang bcc/BPF Function Tracing</title>
      <link>http://www.brendangregg.com/blog/2017-01-31/golang-bcc-bpf-function-tracing.html</link>
      <description><![CDATA[In this post I&#39;ll quickly investigate a new way to trace a Go program: dynamic tracing with Linux 4.x enhanced BPF (aka eBPF). If you search for Go and BPF, you&#39;ll find Go interfaces for using BPF (eg, gobpf). That&#39;s not what I&#39;m exploring here: I&#39;m using BPF to instrument a Go program for performance analysis and debugging. If you&#39;re new to BPF, I just summarized it at linux.conf.au a couple of weeks ago (youtube, slideshare).
]]></description>
      <pubDate>Tue, 31 Jan 2017 00:00:00 -0800</pubDate>
      <guid>http://www.brendangregg.com/blog/2017-01-31/golang-bcc-bpf-function-tracing.html</guid>
      <content:encoded><![CDATA[In this post I'll quickly investigate a new way to trace a Go program: dynamic tracing with Linux 4.x enhanced BPF (aka eBPF). If you search for Go and BPF, you'll find Go interfaces for using BPF (eg, [gobpf]). That's not what I'm exploring here: I'm using BPF to instrument a Go program for performance analysis and debugging. If you're new to BPF, I just summarized it at linux.conf.au a couple of weeks ago ([youtube], [slideshare]).

There's a number of ways so far to debug and trace Go already, including (and not limited to):

- [Debugging with gdb] and Go runtime support.
- The [go execution tracer] for high level execution and blocking events.
- GODEBUG with gctrace and schedtrace.

BPF tracing can do a lot more, but has its own pros and cons. I'll demonstrate, starting with a simple Go program, hello.go:

<pre>
package main

import "fmt"

func main() {
        fmt.Println("Hello, BPF!")
}
</pre>

I'll begin with a gccgo compilation, then do Go gc. (If you don't know the difference, try this [summary] by VonC: in short, gccgo can produce more optimized binaries, but for older versions of go.)

## gccgo Function Counting

Compiling:

<pre>
$ <b>gccgo -o hello hello.go</b>
$ <b>./hello</b>
Hello, BPF!
</pre>

Now I'll use my [bcc] tool funccount to dynamically trace and count all Go library functions that begin with "fmt.", while I reran the hello program in another terminal session:

<pre>
# <b>funccount 'go:fmt.*'</b>
Tracing 160 functions for "go:fmt.*"... Hit Ctrl-C to end.
^C
FUNC                                    COUNT
fmt..import                                 1
fmt.padString.pN7_fmt.fmt                   1
fmt.fmt_s.pN7_fmt.fmt                       1
fmt.WriteString.pN10_fmt.buffer             1
fmt.free.pN6_fmt.pp                         1
fmt.fmtString.pN6_fmt.pp                    1
fmt.doPrint.pN6_fmt.pp                      1
fmt.init.pN7_fmt.fmt                        1
fmt.printArg.pN6_fmt.pp                     1
fmt.WriteByte.pN10_fmt.buffer               1
fmt.Println                                 1
fmt.truncate.pN7_fmt.fmt                    1
fmt.Fprintln                                1
fmt.$nested1                                1
fmt.newPrinter                              1
fmt.clearflags.pN7_fmt.fmt                  2
Detaching...
</pre>

Neat! The output contains the fmt.Println() called by the program, along with other calls.

I didn't need to run Go under any special mode to do this, and I can walk up to an already running Go process and begin doing this instrumentation, without restarting it. So how does it even work?

- It uses [Linux uprobes: User-Level Dynamic Tracing], added in Linux 3.5. It overwrites instructions with a soft interrupt to kernel instrumentation, and reverses the process when tracing has ended.
- The gccgo compiled output has a standard symbol table for function lookup.
- In this case, I'm instrumenting libgo (there's an assumed "lib" before this "go:"), as gccgo emits a dynamically linked binary. libgo has the fmt package.
- uprobes can attach to already running processes, or as I did here, instrument a binary and catch all processes that use it.
- For efficiency, I'm frequency counting the function calls in kernel context, and only emitting the counts to user space.

To the system, the binary looks like this:

<pre>
$ <b>file hello</b>
hello: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.32, BuildID[sha1]=4dc45f1eb023f44ddb32c15bbe0fb4f933e61815, not stripped
$ <b>ls -lh hello</b>
-rwxr-xr-x 1 bgregg root 29K Jan 12 21:18 hello
$ <b>ldd hello</b>
	linux-vdso.so.1 =>  (0x00007ffc4cb1a000)
	libgo.so.9 => /usr/lib/x86_64-linux-gnu/libgo.so.9 (0x00007f25f2407000)
	libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f25f21f1000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f25f1e27000)
	/lib64/ld-linux-x86-64.so.2 (0x0000560b44960000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f25f1c0a000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f25f1901000)
$ <b>objdump -tT /usr/lib/x86_64-linux-gnu/libgo.so.9 | grep fmt.Println</b>
0000000001221070 g     O .data.rel.ro	0000000000000008              fmt.Println$descriptor
0000000000978090 g     F .text	0000000000000075              fmt.Println
0000000001221070 g    DO .data.rel.ro	0000000000000008  Base        fmt.Println$descriptor
0000000000978090 g    DF .text	0000000000000075  Base        fmt.Println
</pre>

That looks a lot like a compiled C binary, which you can instrument using many existing debuggers and tracers, including bcc/BPF. It's a lot easier to instrument than runtimes that compile on the fly, like Java and Node.js. The only hitch so far is that function names can contain non-standard characters, like "." in this example.

funccount also has options like -p to match a PID, and -i to emit output every interval. It currently can only handle up to 1000 probes at a time, so "fmt.\*" was ok, but matching everything in libgo:

<pre>
# <b>funccount 'go:*'</b>
maximum of 1000 probes allowed, attempted 21178
</pre>

... doesn't work yet. Like many things in bcc/BPF, when this limitation becomes too much of a nuisance we'll find a way to fix it.

## Go gc Function Counting

Compiling using Go's gc compiler:

<pre>
$ <b>go build hello.go</b>
$ <b>./hello</b>
Hello, BPF!
</pre>

Now counting the fmt functions:

<pre>
# <b>funccount '/home/bgregg/hello:fmt.*'</b>
Tracing 78 functions for "/home/bgregg/hello:fmt.*"... Hit Ctrl-C to end.
^C
FUNC                                    COUNT
fmt.init.1                                  1
fmt.(*fmt).padString                        1
fmt.(*fmt).truncate                         1
fmt.(*fmt).fmt_s                            1
fmt.newPrinter                              1
fmt.(*pp).free                              1
fmt.Fprintln                                1
fmt.Println                                 1
fmt.(*pp).fmtString                         1
fmt.(*pp).printArg                          1
fmt.(*pp).doPrint                           1
fmt.glob.func1                              1
fmt.init                                    1
Detaching...
</pre>

You can still trace fmt.Println(), but this is now finding it in the binary rather than libgo. Because:

<pre>
$ <b>file hello</b>
hello: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, not stripped
$ <b>ls -lh hello</b>
-rwxr-xr-x 1 bgregg root 2.2M Jan 12 05:16 hello
$ <b>ldd hello</b>
	not a dynamic executable
$ <b>objdump -t hello | grep fmt.Println</b>
000000000045a680 g     F .text	00000000000000e0 fmt.Println
</pre>

It's a 2 Mbyte static binary that contains the function.

Another difference is that the function names contain more unusual symbols: "\*", "(", etc, which I suspect will trip up other debuggers until they are fixed to handle them (like bcc's trace was).

## gccgo Function Tracing

Now I'll try Sasha Goldshtein's trace tool, also from [bcc], to see per-event invocations of a function. Back using gccgo, and I'll start with this simple program from the [go tour], functions.go:

<pre>
package main

import "fmt"

func add(x int, y int) int {
	return x + y
}

func main() {
	fmt.Println(add(42, 13))
}
</pre>

Now tracing the add() function:

<pre>
# <b>trace '/home/bgregg/functions:main.add'</b>
PID    TID    COMM         FUNC             
14424  14424  functions    main.add  
</pre>

... and with both its arguments:

<pre>
# <b>trace '/home/bgregg/functions:main.add "%d %d" arg1, arg2'</b>
PID    TID    COMM         FUNC             -
14390  14390  functions    main.add         42 13
</pre>

Awesome, that worked. Both arguments are printed on the right.

trace has other options (try -h), such as for including timestamps and stack traces with the output.

## Go gc Function Tracing

Now the wheels start to go of the tracks... Same program, compiled with go build:

<pre>
$ <b>go build functions.go</b>

# <b>trace '/home/bgregg/functions:main.add "%d %d" arg1, arg2'</b>
could not determine address of symbol main.add

$ <b>objdump -t functions | grep main.add</b>
$
</pre>

No main.add()? Was it inlined? Disabling inlining:

<pre>
$ <b>go build -gcflags '-l' functions.go</b>
$ <b>objdump -t functions | grep main.add</b>
0000000000401000 g     F .text	0000000000000020 main.add
</pre>

Now it's back. Well that was easy. Tracing it and its arguments:

<pre>
# <b>trace '/home/bgregg/functions:main.add "%d %d" arg1, arg2'</b>
PID    TID    COMM         FUNC             -
16061  16061  functions    main.add         536912504 16
</pre>

That's wrong. The arguments should be 42 and 13, not 536912504 and 16.

Taking a peek with gdb:

<pre>
$ <b>gdb ./functions</b>
[...]
warning: File "/usr/share/go-1.6/src/runtime/runtime-gdb.py" auto-loading has been declined
 by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
[...]
(gdb) <b>b main.add</b>
Breakpoint 1 at 0x401000: file /home/bgregg/functions.go, line 6.
(gdb) <b>r</b>
Starting program: /home/bgregg/functions 
[New LWP 16082]
[New LWP 16083]
[New LWP 16084]

Thread 1 "functions" hit Breakpoint 1, main.add (x=42, y=13, ~r2=4300314240) at
 /home/bgregg/functions.go:6
6	        return x + y
(gdb) <b>i r</b>
rax            0xc820000180	859530330496
rbx            0x584ea0	5787296
rcx            0xc820000180	859530330496
rdx            0xc82005a048	859530698824
rsi            0x10	16
rdi            0xc82000a2a0	859530371744
rbp            0x0	0x0
rsp            0xc82003fed0	0xc82003fed0
r8             0x41	65
r9             0x41	65
r10            0x4d8ba0	5082016
r11            0x0	0
r12            0x10	16
r13            0x52a3c4	5415876
r14            0xa	10
r15            0x8	8
rip            0x401000	0x401000 <main.add>
eflags         0x206	[ PF IF ]
cs             0xe033	57395
ss             0xe02b	57387
ds             0x0	0
es             0x0	0
fs             0x0	0
gs             0x0	0
</pre>

I included the startup warning about runtime-gdb.py, since it's helpful: if I want to dig deeper into Go context, I'll want to fix or source that. Even without it, gdb has shown the arguments as the variables "x=42, y=13".

I also dumped the registers to compare them to the x86_64 ABI, which is how bcc's trace reads them. From the syscall(2) man page:

<pre>
       arch/ABI      arg1  arg2  arg3  arg4  arg5  arg6  arg7  Notes
       ──────────────────────────────────────────────────────────────────
[...]
       x86_64        rdi   rsi   rdx   r10   r8    r9    -
</pre>

42 and 13 don't appear rdi or rsi, or any of the registers. The reason is that Go's gc compiler is not following the standard [AMD64 ABI] function calling convention, which causes problems with this and other debuggers. This is pretty annoying. (I've also heard this complained <a href="http://dtrace.org/blogs/wesolows/2014/12/29/golang-is-trash/">about</a> <a href="http://dtrace.org/blogs/ahl/2016/08/02/i-love-go-i-hate-go/">before</a>, coincidentally, by my former colleagues). I guess Go needed to use a different ABI for return values, since it can return multiple values, so even if the entry arguments were standard we'd still run into differences.

I've browsed the [Quick Guide to Go's Assembler] and the [Plan 9 assembly manual], and it looks like functions are passed on the stack. Here's our 42 and 13:

<pre>
(gdb) <b>x/3dg $rsp</b>
0xc82003fed0:	4198477	42
0xc82003fee0:	13
</pre>

BPF can dig these out too. As a proof of concept, I just hacked in a couple of new aliases, "go1" and "go2" for those entry arguments:

<pre>
# <b>trace '/home/bgregg/functions:main.add "%d %d" go1, go2'</b>
PID    TID    COMM         FUNC             -
17555  17555  functions    main.add         42 13
</pre>

Works. Hopefully by the time you read this post, I (or someone) has finished this work and added it to bcc trace tool. Preferably as "goarg1", "goarg2", etc.

## Interface Arguments

I was going to trace the string argument to fmt.Println() as another example, but its argument is actually an "interface". From go's src/fmt/print.go:

<pre>
func Println(a ...interface{}) (n int, err error) {
    return Fprintln(os.Stdout, a...)
</pre>

With gdb you can dig out the string, eg, back to gccgo:

<pre>
$ <b>gdb ./hello</b>
[...]
(gdb) <b>b fmt.Println</b>
Breakpoint 1 at 0x401c50
(gdb) <b>r</b>
Starting program: /home/bgregg/hello 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff449c700 (LWP 16836)]
[New Thread 0x7ffff3098700 (LWP 16837)]
[Switching to Thread 0x7ffff3098700 (LWP 16837)]

Thread 3 "hello" hit Breakpoint 1, fmt.Println (a=...) at ../../../src/libgo/go/fmt/print.go:263
263	../../../src/libgo/go/fmt/print.go: No such file or directory.
(gdb) <b>p a</b>
$1 = {__values = 0xc208000240, __count = 1, __capacity = 1}
(gdb) <b>p a.__values</b>
$18 = (struct {...} *) 0xc208000240
(gdb) <b>p a.__values[0]</b>
$20 = {__type_descriptor = 0x4037c0 <__go_tdn_string>, __object = 0xc208000210}
(gdb) <b>x/s *0xc208000210</b>
0x403483:	"Hello, BPF!"
</pre>

So it can be read (and I'm sure there's an easier way with gdb, too). You could write a custom bcc/BPF program to dig this out, and we can add more aliases to bcc's trace program to deal with interface arguments.

## Function Latency

(Update) Here's a quick demo of function latency tracing:

<pre>
# <b>funclatency 'go:fmt.Println'</b>
Tracing 1 functions for "go:fmt.Println"... Hit Ctrl-C to end.
^C

Function = fmt.Println [3041]
     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 0        |                                        |
      2048 -> 4095       : 0        |                                        |
      4096 -> 8191       : 0        |                                        |
      8192 -> 16383      : 27       |****************************************|
     16384 -> 32767      : 3        |****                                    |
Detaching...
</pre>

That's showing a histogram of latency (in nanoseconds) for fmt.Println(), which I was calling in a loop.

WARNING: There are some unfortunate problems with this: if Go switches a goroutine to a different OS thread during the function call, then funclatency won't match the entry to the return. We'll need a new tool, <tt>gofunclatency</tt>, that uses Go's internal GOID for latency tracking instead of the OS's TID. There may also be problems with uretprobes modifying Go in a way that causes Go to crash, which we'll need to debug and figure out a plan around. See the comment by Suresh for details.

## Next Steps

I took a quick look at Golang with dynamic tracing and Linux enhanced BPF, via [bcc]'s funccount and trace tools, with some successes and some challenges. Counting function calls works already. Tracing function arguments when compiled with gccgo also works, whereas Go's gc compiler doesn't follow the standard ABI calling convention, so the tools need to be updated to support this. As a proof of concept I modified the bcc trace tool to show it could be done, but that feature needs to be coded properly and integrated. Processing interface objects will also be a challenge, and multi-return values, again, areas where we can improve the tools to make this easier, as well as add macros to C for writing other custom Go observability tools as well.

Hopefully there will be a follow up post (not necessarily by me, feel free to take up the baton if this interests you) that shows improvements to bcc/BPF Go gc argument tracing, interfaces, and return values.

Another important tracing topic, which again can be a follow up post, is stack traces. Thankfully Go made frame pointer-based stack traces default in 1.7.

Lastly, another important topic that could be a post by itself is tracing Go functions along with kernel context. BPF and bcc can instrument kernel functions, as well as user space, and I can imagine custom new tools that combine information from both.

[Debugging with gdb]: https://golang.org/doc/gdb
[go execution tracer]: https://golang.org/pkg/runtime/trace/
[gcvis]: https://github.com/davecheney/gcvis
[gotrace]: https://github.com/divan/gotrace
[summary]: http://stackoverflow.com/a/25811505
[Linux uprobes: User-Level Dynamic Tracing]: /blog/2015-06-28/linux-ftrace-uprobe.html
[bcc]: https://github.com/iovisor/bcc
[go tour]: https://tour.golang.org/basics/4
[AMD64 ABI]: https://en.wikipedia.org/wiki/X86_calling_conventions#System_V_AMD64_ABI
[Quick Guide to Go's Assembler]: https://golang.org/doc/asm
[Plan 9 assembly manual]: https://9p.io/sys/doc/asm.html
[gobpf]: https://github.com/iovisor/gobpf
[youtube]: https://www.youtube.com/watch?v=JRFNIKUROPE
[slideshare]: http://www.slideshare.net/brendangregg/bpf-tracing-and-more
]]></content:encoded>
      <dc:date>2017-01-31T00:00:00-08:00</dc:date>
    </item>
    <item>
      <title>Give me 15 minutes and I'll change your view of Linux tracing</title>
      <link>http://www.brendangregg.com/blog/2016-12-27/linux-tracing-in-15-minutes.html</link>
      <description><![CDATA[I gave this demo recently at USENIX/LISA 2016, showing ftrace, perf, and bcc/BPF. A video is on youtube:
]]></description>
      <pubDate>Tue, 27 Dec 2016 00:00:00 -0800</pubDate>
      <guid>http://www.brendangregg.com/blog/2016-12-27/linux-tracing-in-15-minutes.html</guid>
      <content:encoded><![CDATA[I gave this demo recently at USENIX/LISA 2016, showing ftrace, perf, and bcc/BPF. A video is on [youtube]:

<center><iframe width="595" height="335" src="https://www.youtube.com/embed/GsMs3n8CB6g" frameborder="0" allowfullscreen></iframe></center>

It was part of a larger talk on Linux 4.x Tracing Tools using BPF Superpowers. The slides are on [slideshare]:

<center><iframe src="//www.slideshare.net/slideshow/embed_code/key/2LUry5U7ho7tqe" width="595" height="375" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" style="border:1px solid #CCC; border-width:1px; margin-bottom:1px; max-width: 100%;" allowfullscreen> </iframe></center>

The full talk video should be posted on the [usenix website] at some point.

My 15 (18) minute demo stepped through the evolution of recent built in Linux tracers: ftrace (2008+) and its many capabilities, perf (2009+), and bcc/BPF (2015+) which provides the final programmatic abilities for advanced tracing, via enhanced BPF (aka eBPF). I suspect I might change people's view of Linux tracing, as these tracers &ndash; despite being built in to the Linux kernel &ndash; are still not widely known.

<div style="float:right;padding-left:10px;padding-right:5px;padding-top:1px;padding-bottom:0px"><a href="/blog/images/2016/LISA_BPF_lab.jpg"><img src="/blog/images/2016/LISA_BPF_lab.jpg" width=384 border=0></a><br><center><i><font size=-1>perf & BPF tutorial at LISA 2016</font></i></center></div>
Earlier at the conference, Sasha Goldshtein and I ran a half day perf & bcc/BPF tutorial. Both Sasha and I are not only bcc contributors, but also experienced classroom instructors, and it was a pleasure to collaborate with him on this project. It wasn't videoed, but the lab files are on <a href="https://github.com/goldshtn/linux-tracing-workshop">github</a>. If you are interested in learning bcc/BPF, there are also two tutorials I wrote in <a href="https://github.com/iovisor/bcc/tree/master/docs">bcc/docs</a> for using and developing bcc tools.

There was a lot of interest in both our tutorial and my talk &ndash; I imagine this interest will grow over time as more people deploy on Linux 4.x series kernels and can make use of BPF.

For more about Linux tracers, here are some resources:

- ftrace: <a href="https://lwn.net/Articles/608497/">The hidden light switch (lwn.net)</a>, <a href="https://github.com/brendangregg/perf-tools">perf-tools (github)</a>, [ftrace.txt]
- perf: <a href="http://www.brendangregg.com/perf.html">perf Examples</a>, [perf wiki]
- bcc: <a href="https://github.com/iovisor/bcc#tools">bcc/BPF tools</a>, and many posts here (listed on <a href="http://www.brendangregg.com/linuxperf.html">Linux Performance</a>).

Then there's also the add on tracers, like Systemtap, LTTng, sysdig, etc, which I didn't cover in 15 minutes.

My 15 minute tracing demo was inspired by Greg Law's excellent cppcon talk <a href="https://www.youtube.com/watch?v=PorfLSr3DDI">Give me 15 minutes & I'll change your view of GDB</a>. Since then, I've also written about GDB here, with a <a href="/blog/2016-08-09/gdb-example-ncurses.html">full GDB example (tutorial)</a>.

LISA was a lot of fun. Thanks to those who were able to attend our events, and USENIX for putting on another great conference!

[youtube]: https://www.youtube.com/watch?v=GsMs3n8CB6g
[slideshare]: http://www.slideshare.net/brendangregg/linux-4x-tracing-tools-using-bpf-superpowers
[usenix website]: https://www.usenix.org/conference/lisa16/conference-program/presentation/linux-4x-tracing-tools-using-bpf-superpowers
[ftrace.txt]: https://www.kernel.org/doc/Documentation/trace/ftrace.txt
[perf wiki]: https://perf.wiki.kernel.org/index.php/Main_Page
]]></content:encoded>
      <dc:date>2016-12-27T00:00:00-08:00</dc:date>
    </item>
    <item>
      <title>Linux bcc/BPF tcplife: TCP Lifespans</title>
      <link>http://www.brendangregg.com/blog/2016-11-30/linux-bcc-tcplife.html</link>
      <description><![CDATA[&quot;i really wish i had a command line tool that would give me stats on TCP connection lengths on a given port&quot;
]]></description>
      <pubDate>Wed, 30 Nov 2016 00:00:00 -0800</pubDate>
      <guid>http://www.brendangregg.com/blog/2016-11-30/linux-bcc-tcplife.html</guid>
      <content:encoded><![CDATA["i really wish i had a command line tool that would give me stats on TCP connection lengths on a given port"

<pre>
# <b>./tcplife -D 80</b>
PID   COMM       LADDR           LPORT RADDR           RPORT TX_KB RX_KB MS
27448 curl       100.66.11.247   54146 54.154.224.174  80        0     1 263.85
27450 curl       100.66.11.247   20618 54.154.164.22   80        0     1 243.62
27452 curl       100.66.11.247   11480 54.154.43.103   80        0     1 231.16
27454 curl       100.66.11.247   31382 54.154.15.7     80        0     1 249.95
27456 curl       100.66.11.247   33416 52.210.59.223   80        0     1 545.72
27458 curl       100.66.11.247   16406 52.30.140.35    80        0     1 222.29
27460 curl       100.66.11.247   11634 52.30.133.135   80        0     1 217.52
27462 curl       100.66.11.247   25660 52.30.126.182   80        0     1 250.81
[...]
</pre>

That's tracing destination port 80, you can also trace local port (-L), or trace all ports (default behavior).

The quote and good idea is from Julia Evans on [twitter], and I added it as a tool to the Linux BPF-based [bcc] open source collection.

The output of tcplife, short for TCP lifespan, shows not just the duration (MS == milliseconds) but also throughput statistics: TX\_KB for Kbytes transmitted, and RX\_KB for Kbytes received. It should be useful for performance and security analysis, and network debugging.

## I am NOT tracing packets

The overheads can cost too much to examine every packet, especially on servers that process millions of packets per second. To do this I'm not using tcpdump, libpcap, tethereal, or any network sniffer.

So how'd I do it? There were four challenges:

### 1. Measuring lifespans

<ul>
The current version of tcplife uses kernel dynamic tracing (kprobes) of tcp_set_state(), and looks for the duration from an early state (eg, TCP_ESTABLISHED) to TCP_CLOSE. State changes have a much lower frequency than packets, so this approach greatly reduces overhead. But it ends up tricker than it sounds, since it's tracing the Linux implementation of TCP, which isn't guaranteed to use tcp_set_state() for every state transition. But it works well enough for now, and we can add a stable tracepoint for TCP state transitions to future kernels so that this will always work. (Or, if that proves untenable, tracepoints for the creation and destruction of TCP sessions or sockets.)
</ul>

### 2. Fetching addresses and ports

<ul>
tcp_set_state() has <tt>struct sock *sk</tt> as an argument, and we can dig the details from that.
</ul>

### 3. Fetching throughput statistics

<ul>
Those TX_KB and RX_KB columns. This was only possible in the last couple of years thanks to the RFC-4898 additions to <tt>struct tcp_info</tt> in the Linux kernel: tcpi_bytes_acked and tcpi_bytes_received. I discussed these in my <a href="http://www.brendangregg.com/blog/2016-10-15/linux-bcc-tcptop.html">tcptop</a> blog post.
</ul>

### 4. Showing task context

<ul>
It's useful to show the PID and COMM (process name) with these connections, but TCP state changes aren't guaranteed to happen in the correct task context, so we can't just fetch the currently running task information. Nor is there a cached PID and comm in <tt>struct sock</tt> to print out. So I'm caching the task context on TCP state changes where it's usually valid, by virtue of implementation. It works, but is another area where we can do better, and can be addressed if and when we add stable TCP tracepoints.
</ul>

That's how tcplife works right now. It the future it may change and improve, especially if more TCP tracepoints become available. I'm also not the first to try this kind of TCP event tracing: Facebook already have their own BPF TCP event tool, although I think their implementation is a little different.

Here's the full USAGE message for tcplife:

<pre>
# <b>./tcplife -h</b>
usage: tcplife [-h] [-T] [-t] [-w] [-s] [-p PID] [-L LOCALPORT]
               [-D REMOTEPORT]

Trace the lifespan of TCP sessions and summarize

optional arguments:
  -h, --help            show this help message and exit
  -T, --time            include time column on output (HH:MM:SS)
  -t, --timestamp       include timestamp on output (seconds)
  -w, --wide            wide column output (fits IPv6 addresses)
  -s, --csv             comma seperated values output
  -p PID, --pid PID     trace this PID only
  -L LOCALPORT, --localport LOCALPORT
                        comma-separated list of local ports to trace.
  -D REMOTEPORT, --remoteport REMOTEPORT
                        comma-separated list of remote ports to trace.

examples:
    ./tcplife           # trace all TCP connect()s
    ./tcplife -t        # include time column (HH:MM:SS)
    ./tcplife -w        # wider colums (fit IPv6)
    ./tcplife -stT      # csv output, with times & timestamps
    ./tcplife -p 181    # only trace PID 181
    ./tcplife -L 80     # only trace local port 80
    ./tcplife -L 80,81  # only trace local ports 80 and 81
    ./tcplife -D 80     # only trace remote port 80
</pre>

And there is more [example output] in bcc, along with a [man page]. For more about bcc and BPF (eBPF), see the collection of documents on my [Linux performance page].

The only catch is that tcplife does need newer kernels (say, 4.4).

[twitter]: https://twitter.com/b0rk/status/765666624968003584
[bcc]: https://github.com/iovisor/bcc
[example output]: https://github.com/iovisor/bcc/blob/master/tools/tcplife_example.txt
[tcptop]: http://www.brendangregg.com/blog/2016-10-15/linux-bcc-tcptop.html
[man page]: https://github.com/iovisor/bcc/blob/master/man/man8/tcplife.8
[Linux performance page]: http://www.brendangregg.com/linuxperf.html#Documentation
]]></content:encoded>
      <dc:date>2016-11-30T00:00:00-08:00</dc:date>
    </item>
    <item>
      <title>DTrace for Linux 2016</title>
      <link>http://www.brendangregg.com/blog/2016-10-27/dtrace-for-linux-2016.html</link>
      <description><![CDATA[
]]></description>
      <pubDate>Thu, 27 Oct 2016 00:00:00 -0700</pubDate>
      <guid>http://www.brendangregg.com/blog/2016-10-27/dtrace-for-linux-2016.html</guid>
      <content:encoded><![CDATA[<div style="float:right;padding-left:10px;padding-right:5px;padding-top:0px;padding-bottom:1px"><a href="https://raw.githubusercontent.com/brendangregg/bcc/master/images/bcc_tracing_tools_2016.png"><img src="https://raw.githubusercontent.com/brendangregg/bcc/master/images/bcc_tracing_tools_2016.png" width=360 border=0></a></div>

With the final major capability for BPF tracing (timed sampling) merging in Linux 4.9-rc1, the Linux kernel now has raw capabilities similar to those provided by DTrace, the advanced tracer from Solaris. As a long time DTrace user and expert, this is an exciting milestone! On Linux, you can now analyze the performance of applications and the kernel using production-safe low-overhead custom tracing, with latency histograms, frequency counts, and more.

There have been many tracing projects for Linux, but the technology that finally merged didn’t start out as a tracing project at all: it began as enhancements to Berkeley Packet Filter (BPF), aka eBPF. At first, these enhancements let BPF redirect packets to create software-defined networks. Later on, support for tracing events was added, enabling programmatic tracing in Linux.

While BPF currently lacks a high-level language like DTrace, the front-ends available have been enough for me to create many BPF tools, some based on my older [DTraceToolkit]. In this post I'll describe how you can use these tools, the front-ends available, and discuss where the technology is going next.

## Screenshots

I've been adding BPF-based tracing tools to the open source [bcc] project (thanks to Brenden Blanco, of PLUMgrid, for leading bcc development). See the [bcc install] instructions. It will add a collection of tools under /usr/share/bcc/tools, including the following.

Tracing new processes:

<pre>
# <b>execsnoop</b>
PCOMM            PID    RET ARGS
bash             15887    0 /usr/bin/man ls
preconv          15894    0 /usr/bin/preconv -e UTF-8
man              15896    0 /usr/bin/tbl
man              15897    0 /usr/bin/nroff -mandoc -rLL=169n -rLT=169n -Tutf8
man              15898    0 /usr/bin/pager -s
nroff            15900    0 /usr/bin/locale charmap
nroff            15901    0 /usr/bin/groff -mtty-char -Tutf8 -mandoc -rLL=169n -rLT=169n
groff            15902    0 /usr/bin/troff -mtty-char -mandoc -rLL=169n -rLT=169n -Tutf8
groff            15903    0 /usr/bin/grotty
</pre>

Histogram of disk I/O latency:

<pre>
# <b>biolatency -m</b>
Tracing block device I/O... Hit Ctrl-C to end.
^C
     msecs           : count     distribution
       0 -> 1        : 96       |************************************  |
       2 -> 3        : 25       |*********                             |
       4 -> 7        : 29       |***********                           |
       8 -> 15       : 62       |***********************               |
      16 -> 31       : 100      |**************************************|
      32 -> 63       : 62       |***********************               |
      64 -> 127      : 18       |******                                |
</pre>

Tracing common ext4 operations slower than 5 milliseconds:

<pre>
# <b>ext4slower 5</b>
Tracing ext4 operations slower than 5 ms
TIME     COMM           PID    T BYTES   OFF_KB   LAT(ms) FILENAME
21:49:45 supervise      3570   W 18      0           5.48 status.new
21:49:48 supervise      12770  R 128     0           7.55 run
21:49:48 run            12770  R 497     0          16.46 nsswitch.conf
21:49:48 run            12770  R 1680    0          17.42 netflix_environment.sh
21:49:48 run            12770  R 1079    0           9.53 service_functions.sh
21:49:48 run            12772  R 128     0          17.74 svstat
21:49:48 svstat         12772  R 18      0           8.67 status
21:49:48 run            12774  R 128     0          15.76 stat
21:49:48 run            12777  R 128     0           7.89 grep
21:49:48 run            12776  R 128     0           8.25 ps
21:49:48 run            12780  R 128     0          11.07 xargs
21:49:48 ps             12776  R 832     0          12.02 libprocps.so.4.0.0
21:49:48 run            12779  R 128     0          13.21 cut
[...]
</pre>

Tracing new active TCP connections (connect()):

<pre>
# <b>tcpconnect</b>
PID    COMM         IP SADDR            DADDR            DPORT
1479   telnet       4  127.0.0.1        127.0.0.1        23
1469   curl         4  10.201.219.236   54.245.105.25    80
1469   curl         4  10.201.219.236   54.67.101.145    80
1991   telnet       6  ::1              ::1              23
2015   ssh          6  fe80::2000:bff:fe82:3ac fe80::2000:bff:fe82:3ac 22
</pre>

Tracing DNS latency by tracing getaddrinfo()/gethostbyname() library calls:

<pre>
# <b>gethostlatency</b>
TIME      PID    COMM          LATms HOST
06:10:24  28011  wget          90.00 www.iovisor.org
06:10:28  28127  wget           0.00 www.iovisor.org
06:10:41  28404  wget           9.00 www.netflix.com
06:10:48  28544  curl          35.00 www.netflix.com.au
06:11:10  29054  curl          31.00 www.plumgrid.com
06:11:16  29195  curl           3.00 www.facebook.com
06:11:25  29404  curl          72.00 foo
06:11:28  29475  curl           1.00 foo
</pre>

Interval summaries of VFS operations by type:

<pre>
# <b>vfsstat</b>
TIME         READ/s  WRITE/s CREATE/s   OPEN/s  FSYNC/s
18:35:32:       231       12        4       98        0
18:35:33:       274       13        4      106        0
18:35:34:       586       86        4      251        0
18:35:35:       241       15        4       99        0
</pre>

Tracing off-CPU time with kernel and user stack traces (summarized in kernel), for a given PID:

<pre>
# <b>offcputime -d -p 24347</b>
Tracing off-CPU time (us) of PID 24347 by user + kernel stack... Hit Ctrl-C to end.
^C
[...]
    ffffffff810a9581 finish_task_switch
    ffffffff8185d385 schedule
    ffffffff81085672 do_wait
    ffffffff8108687b sys_wait4
    ffffffff81861bf6 entry_SYSCALL_64_fastpath
    --
    00007f6733a6b64a waitpid
    -                bash (24347)
        4952

    ffffffff810a9581 finish_task_switch
    ffffffff8185d385 schedule
    ffffffff81860c48 schedule_timeout
    ffffffff810c5672 wait_woken
    ffffffff8150715a n_tty_read
    ffffffff815010f2 tty_read
    ffffffff8122cd67 __vfs_read
    ffffffff8122df65 vfs_read
    ffffffff8122f465 sys_read
    ffffffff81861bf6 entry_SYSCALL_64_fastpath
    --
    00007f6733a969b0 read
    -                bash (24347)
        1450908
</pre>

Tracing MySQL query latency (via a USDT probe):

<pre>
# <b>mysqld_qslower `pgrep -n mysqld`</b>
Tracing MySQL server queries for PID 14371 slower than 1 ms...
TIME(s)        PID          MS QUERY
0.000000       18608   130.751 SELECT * FROM words WHERE word REGEXP '^bre.*n$'
2.921535       18608   130.590 SELECT * FROM words WHERE word REGEXP '^alex.*$'
4.603549       18608    24.164 SELECT COUNT(*) FROM words
9.733847       18608   130.936 SELECT count(*) AS count FROM words WHERE word REGEXP '^bre.*n$'
17.864776      18608   130.298 SELECT * FROM words WHERE word REGEXP '^bre.*n$' ORDER BY word
</pre>

Using the trace multi-tool to watch login requests, by instrumenting the pam library:

<pre>
# <b>trace 'pam:pam_start "%s: %s", arg1, arg2'</b>
TIME     PID    COMM         FUNC             -
17:49:45 5558   sshd         pam_start        sshd: root
17:49:47 5662   sudo         pam_start        sudo: root
17:49:49 5727   login        pam_start        login: bgregg
</pre>

Many tools have usage messages (-h), and all should have man pages and text files of example output in the bcc project.

## Out of necessity

In 2014, Linux tracing had some kernel summary features (from ftrace and perf\_events), but outside those we still had to dump-and-post-process data &ndash; a decades old technique that has high overhead at scale. You couldn't frequency count process names, function names, stack traces, or other arbitrary data in the kernel. You couldn't save variables in one probe event, and then retrieve them in another, which meant that you couldn't measure latency (or time deltas) in custom places, and you couldn't create in-kernel latency histograms. You couldn't trace USDT probes. You couldn't even write custom programs. DTrace could do all these, but only on Solaris or BSD. On Linux, some out-of-tree tracers like SystemTap could serve these needs, but brought their own challenges. (For the sake of completeness: yes, you _could_ write kprobe-based kernel modules &ndash; but practically no one did.)

In 2014 I joined the Netflix cloud performance team. Having spent years as a DTrace expert, it might have seemed crazy for me to move to Linux. But I had some motivations, in particular seeking a greater challenge: performance tuning the Netflix cloud, with its rapid application changes, microservice architecture, and distributed systems. Sometimes this job involves systems tracing, for which I'd previously used DTrace. Without DTrace on Linux, I began by using what was built in to the Linux kernel, ftrace and perf\_events, and from them made a toolkit of tracing tools ([perf-tools]). They have been invaluable. But I couldn't do some tasks, particularly latency histograms and stack trace counting. We needed kernel tracing to be programmatic.

## What happened?

BPF adds programmatic capabilities to the existing kernel tracing facilities (tracepoints, kprobes, uprobes). It has been enhanced rapidly in the Linux 4.x series.

Timed sampling was the final major piece, and it landed in Linux 4.9-rc1 (<a href="https://lkml.org/lkml/2016/9/1/831">patchset</a>). Many thanks to Alexei Starovoitov (now working on BPF at Facebook), the lead developer behind these BPF enhancements.

The Linux kernel now has the following features built in (added between 2.6 and 4.9):

- Dynamic tracing, kernel-level (BPF support for kprobes)
- Dynamic tracing, user-level (BPF support for uprobes)
- Static tracing, kernel-level (BPF support for tracepoints)
- Timed sampling events (BPF with perf\_event\_open)
- PMC events (BPF with perf\_event\_open)
- Filtering (via BPF programs)
- Debug output (bpf\_trace\_printk())
- Per-event output (bpf\_perf\_event\_output())
- Basic variables (global & per-thread variables, via BPF maps)
- Associative arrays (via BPF maps)
- Frequency counting (via BPF maps)
- Histograms (power-of-2, linear, and custom, via BPF maps)
- Timestamps and time deltas (bpf\_ktime\_get\_ns(), and BPF programs)
- Stack traces, kernel (BPF stackmap)
- Stack traces, user (BPF stackmap)
- Overwrite ring buffers (perf\_event\_attr.write\_backward)

The front-end we are using is bcc, which provides both Python and lua interfaces. bcc adds:

- Static tracing, user-level (USDT probes via uprobes)
- Debug output (Python with BPF.trace\_pipe() and BPF.trace\_fields())
- Per-event output (BPF\_PERF\_OUTPUT macro and BPF.open\_perf\_buffer())
- Interval output (BPF.get\_table() and table.clear())
- Histogram printing (table.print\_log2\_hist())
- C struct navigation, kernel-level (bcc rewriter maps to bpf\_probe\_read())
- Symbol resolution, kernel-level (ksym(), ksymaddr())
- Symbol resolution, user-level (usymaddr())
- BPF tracepoint support (via TRACEPOINT\_PROBE)
- BPF stack trace support (incl. walk method for stack frames)
- Various other helper macros and functions
- Examples (under /examples)
- Many tools (under /tools)
- Tutorials (/docs/tutorial\*.md)
- Reference guide (/docs/reference\_guide.md)

I'd been holding off on this post until the last major feature was integrated, and now it has been in 4.9-rc1. There are still some minor missing things we have workarounds for, and additional things we might do, but what we have right now is worth celebrating. Linux now has advanced tracing capabilities built in.

## Safety

BPF and its enhancements are designed to be production safe, and it is used today in large scale production environments. But if you're determined, you may be able to still find a way to hang the kernel. That experience should be the exception rather than the rule, and such bugs will be fixed fast, especially since BPF is part of Linux. All eyes are on Linux.

We did hit a couple of non-BPF bugs during development that needed to be fixed: rcu not reentrant, which could cause kernel hangs for funccount and was fixed by the "bpf: map pre-alloc" patchset in 4.6, and with a workaround in bcc for older kernels. And a uprobe memory accounting issue, which failed uprobe allocations, and was fixed by the "uprobes: Fix the memcg accounting" patch in 4.8 and backported to earlier kernels (eg, it's in the current 4.4.27 and 4.4.0-45.66).

## Why did Linux tracing take so long?

Prior work had been split among several other tracers: there was never a consolidated effort on any single one. For more about this and other issues, see my 2014 [tracing summit talk]. One thing I didn't note there was the counter effect of partial solutions: some companies had found another tracer (SystemTap or LTTng) was sufficient for their specific needs, and while they have been happy to hear about BPF, contributing to its development wasn't a priority given their existing solution.

BPF has only been enhanced to do tracing in the last two years. This process could have gone faster, but early on there were zero full-time engineers working on BPF tracing. Alexei Starovoitov (BPF lead), Brenden Blanco (bcc lead), myself, and others, all had other priorities. I tracked my hours on this at Netflix (voluntarily), and I've spent around 7% of my time on BPF/bcc. It wasn't that much of a priority, in part because we had our own workarounds (including my perf-tools, which work on older kernels).

Now that BPF tracing has arrived, there's already tech companies on the lookout for BPF skills. I can still highly recommend [Netflix]. (If you're trying to hire _me_ for BPF skills, then I'm still very happy at Netflix!.)

## Ease of use

What might appear to be the largest remaining difference between DTrace and bcc/BPF is ease of use. But it depends on what you're doing with BPF tracing. Either you are:

- **Using BPF tools/metrics**: There should be no difference. Tools behave the same, GUIs can access similar metrics. Most people will use BPF in this way.
- **Developing tools/metrics**: bcc right now is much harder. DTrace has its own concise language, D, similar to awk, whereas bcc uses existing languages (C and Python or lua) with libraries. A bcc tool in C+Python may be a _lot_ more code than a D-only tool: 10x the lines, or more. However, many DTrace tools used shell wrapping to provide arguments and error checking, inflating the code to a much bigger size. The coding difficulty is also different: the rewriter in bcc can get fiddly, which makes some scripts much more complicated to develop (extra bpf\_probe\_read()s, requiring more knowledge of BPF internals). This situation should improve over time as improvements are planned.
- **Running common one-liners**: Fairly similar. DTrace could do many with the "dtrace" command, whereas bcc has a variety of multitools: trace, argdist, funccount, funclatency, etc.
- **Writing custom ad hoc one-liners**: With DTrace this was trivial, and accelerated advanced analysis by allowing rapid custom questions to be posed and answered by the system. bcc is currently limited by its multitools and their scope.

In short, if you're an end user of BPF tools, you shouldn't notice these differences. If you're an advanced user and tool developer (like me), bcc is a lot more difficult right now.

To show a current example of the bcc Python front-end, here's the code for tracing disk I/O and printing I/O size as a histogram:

<pre>
from bcc import BPF
from time import sleep

# load BPF program
b = BPF(text="""
<font color="#880000">#include &lt;uapi/linux/ptrace.h>
#include &lt;linux/blkdev.h>

BPF_HISTOGRAM(dist);

int kprobe__blk_account_io_completion(struct pt_regs *ctx, struct request *req)
{
	dist.increment(bpf_log2l(req->__data_len / 1024));
	return 0;
}</font>
""")

# header
print("Tracing... Hit Ctrl-C to end.")

# trace until Ctrl-C
try:
	sleep(99999999)
except KeyboardInterrupt:
	print

# output
b["dist"].print_log2_hist("kbytes")
</pre>

Note the embedded C (text=) in the Python code.

This gets the job done, but there's also room for improvement. Fortunately, we have time to do so: it will take many months before people are on Linux 4.9 and can use BPF, so we have time to create tools and front-ends.

## A higher-level language

An easier front-end, such as a higher-level language, may not improve adoption as much as you might imagine. Most people will use the canned tools (and GUIs), and only some of us will actually write them. But I'm not opposed to a higher-level language either, and some already exist, like SystemTap:

<pre>
#!/usr/bin/stap
/*
 * opensnoop.stp	Trace file open()s.  Basic version of opensnoop.
 */

probe begin
{
	printf("\n%6s %6s %16s %s\n", "UID", "PID", "COMM", "PATH");
}

probe syscall.open
{
	printf("%6d %6d %16s %s\n", uid(), pid(), execname(), filename);
}
</pre>

Wouldn't it be nice if we could have the SystemTap front-end with all its language integration and tapsets, with the high-performance kernel built in BPF back-end? Richard Henderson of Red Hat has already begun work on this, and has released an [initial version]!

There's also [ply], an entirely new higher-level language for BPF:

<pre>
#!/usr/bin/env ply

kprobe:SyS_*
{
    $syscalls[func].count()
}
</pre>

This is also promising.

Although, I think the real challenge for tool developers won't be the language: it will be knowing what to do with these new superpowers.

## How you can contribute

- **Promotion**: There are currently no marketing efforts for BPF tracing. Some companies know it and are using it (Facebook, Netflix, Github, and more), but it'll take years to become widely known. You can help by sharing articles and resources with others in the industry.
- **Education**: You can write articles, give meetup talks, and contribute to bcc documentation. Share case studies of how BPF has solved real issues, and provided value to your company.
- **Fix bcc issues**: See the [bcc issue list], which includes bugs and feature requests.
- **File bugs**: Use bcc/BPF, and file bugs as you find them.
- **New tools**: There are more observability tools to develop, but please don't be hasty: people are going to spend hours learning and using your tool, so make it as intuitive and excellent as possible (see my [docs]). As Mike Muuss has said about his [ping] program: "If I'd known then that it would be my most famous accomplishment in life, I might have worked on it another day or two and added some more options."
- **High-level language**: If the existing bcc front-end languages really bother you, maybe you can come up with something much better. If you build it in bcc you can leverage libbcc. Or, you could help the SystemTap BPF or ply efforts.
- **GUI integration**: Apart from the bcc CLI observability tools, how can this new information be visualized? Latency heat maps, flame graphs, and more.

## Other Tracers

What about SystemTap, ktap, sysdig, LTTng, etc? It's possible that they all have a future, either by using BPF, or by becoming better at what they specifically do. Explaining each will be a blog post by itself.

And DTrace itself? We're still using it at Netflix, on our FreeBSD-based CDN.

## Further bcc/BPF Reading

I've written a <a href="https://github.com/iovisor/bcc/blob/master/docs/tutorial.md">bcc/BPF Tool End-User Tutorial</a>, a <a href="https://github.com/iovisor/bcc/blob/master/docs/tutorial_bcc_python_developer.md">bcc Python Developer's Tutorial</a>, a <a href="https://github.com/iovisor/bcc/blob/master/docs/reference_guide.md">bcc/BPF Reference Guide</a>, and contributed useful <a href="https://github.com/iovisor/bcc/tree/master/tools">/tools</a>, each with an <a href="https://github.com/iovisor/bcc/tree/master/tools">example.txt</a> file and <a href="https://github.com/iovisor/bcc/tree/master/man/man8">man page</a>. My prior posts about bcc & BPF include:

- <a href="http://www.brendangregg.com/blog/2015-05-15/ebpf-one-small-step.html">eBPF: One Small Step</a> (we later just called it BPF)
- <a href="http://www.brendangregg.com/blog/2015-09-22/bcc-linux-4.3-tracing.html">bcc: Taming Linux 4.3+ Tracing Superpowers</a>
- <a href="http://www.brendangregg.com/blog/2016-01-18/ebpf-stack-trace-hack.html">Linux eBPF Stack Trace Hack</a> (stack traces are now officially supported)
- <a href="http://www.brendangregg.com/blog/2016-01-20/ebpf-offcpu-flame-graph.html">Linux eBPF Off-CPU Flame Graph</a> ("  "  ")
- <a href="http://www.brendangregg.com/blog/2016-02-01/linux-wakeup-offwake-profiling.html">Linux Wakeup and Off-Wake Profiling</a> (" " ")
- <a href="http://www.brendangregg.com/blog/2016-02-05/ebpf-chaingraph-prototype.html">Linux Chain Graph Prototype</a> (" " ")
- <a href="http://www.brendangregg.com/blog/2016-02-08/linux-ebpf-bcc-uprobes.html">Linux eBPF/bcc uprobes</a>
- <a href="http://www.brendangregg.com/blog/2016-03-05/linux-bpf-superpowers.html">Linux BPF Superpowers</a>
- <a href="http://www.brendangregg.com/blog/2016-06-14/ubuntu-xenial-bcc-bpf.html">Ubuntu Xenial bcc/BPF</a>
- <a href="http://www.brendangregg.com/blog/2016-10-01/linux-bcc-security-capabilities.html">Linux bcc Tracing Security Capabilities</a>
- <a href="http://www.brendangregg.com/blog/2016-10-04/linux-bcc-mysqld-qslower.html">Linux MySQL Slow Query Tracing with bcc/BPF</a>
- <a href="http://www.brendangregg.com/blog/2016-10-06/linux-bcc-ext4dist-ext4slower.html">Linux bcc ext4 Latency Tracing</a>
- <a href="http://www.brendangregg.com/blog/2016-10-08/linux-bcc-runqlat.html">Linux bcc/BPF Run Queue (Scheduler) Latency</a>
- <a href="http://www.brendangregg.com/blog/2016-10-12/linux-bcc-nodejs-usdt.html">Linux bcc/BPF Node.js USDT Tracing</a>
- <a href="http://www.brendangregg.com/blog/2016-10-15/linux-bcc-tcptop.html">Linux bcc tcptop</a>
- <a href="http://www.brendangregg.com/blog/2016-10-21/linux-efficient-profiler.html">Linux 4.9's Efficient BPF-based Profiler</a>

I've also giving a talk about bcc/BPF, at Facebook's Performance@Scale event: <a href="http://www.brendangregg.com/blog/2016-03-05/linux-bpf-superpowers.html">Linux BPF Superpowers</a>. In December, I'm giving a tutorial and talk on BPF/bcc at <a href="https://www.usenix.org/conference/lisa16">USENIX LISA</a> in Boston.

(Update: I also have a new website, <a href="/ebpf.html">Linux eBPF Tools</a>.)

## Acknowledgements

- Van Jacobson and Steve McCanne, who created the original BPF as a packet filter.
- Barton P. Miller, Jeffrey K. Hollingsworth, and Jon Cargille, for inventing dynamic tracing, and publishing the paper: "Dynamic Program Instrumentation for Scalable Performance Tools", Scalable High-performance Conputing Conference (SHPCC), Knoxville, Tennessee, May 1994.
- kerninst (ParaDyn, UW-Madison), an early dynamic tracing tool that showed the value of dynamic tracing (late 1990's).
- Mathieu Desnoyers (of LTTng), the lead developer of kernel markers that led to tracepoints.
- IBM developed kprobes as part of DProbes. DProbes was combined with LTT to provide Linux dynamic tracing in 2000, but wasn't integrated.
- Bryan Cantrill, Mike Shapiro, and Adam Leventhal (Sun Microsystems), the core developers of DTrace, an awesome tool which proved that dynamic tracing could be production safe and easy to use (2004). Given the mechanics of dynamic tracing, this was a crucial turning point for the technology: that it became safe enough to be shipped _by default in Solaris_, an OS known for reliability.
- The many Sun Microsystems staff in marketing, sales, training, and other roles, for promoting DTrace and creating the awareness and desire for advanced system tracing.
- Roland McGrath (at Red Hat), the lead developer of utrace, which became uprobes.
- Alexei Starovoitov (PLUMgrid, then Facebook), the lead developer of enhanced BPF: the programmatic kernel components necessary.
- Many other Linux kernel engineers who contributed feedback, code, testing, and their own patchsets for the development of enhanced BPF (search lkml for BPF): Wang Nan, Daniel Borkmann, David S. Miller, Peter Zijlstra, and many others.
- Brenden Blanco (PLUMgrid), the lead developer of bcc.
- Sasha Goldshtein (Sela) developed tracepoint support in bcc, developed the most powerful bcc multitools trace and argdist, and contributed to USDT support.
- Vicent Mart&iacute; and others at Github engineering, for developing the lua front-end for bcc, and contributing parts of USDT.
- Allan McAleavy, Mark Drayton, and other bcc contributors for various improvements.

Thanks to Netflix for providing the environment and support where I've been able to contribute to BPF and bcc tracing, and help get them done. I've also contributed to tracing in general over the years by developing tracing tools (using TNF/prex, DTrace, SystemTap, ktap, ftrace, perf, and now bcc/BPF), and books, blogs, and talks.

Finally, thanks to [Deirdré] for editing another post.

## Conclusion

Linux doesn't have DTrace (the language), but it now does, in a way, have the DTraceToolkit (the tools).

The Linux 4.9 kernel has the final capabilities needed to support modern tracing, via enhancments to its built-in BPF engine. The hardest part is now done: kernel support. Future work now includes more performance CLI tools, alternate higher-level languages, and GUIs.

For customers of performance analysis products, this is also good news: you can now ask for latency histograms and heatmaps, CPU and off-CPU flame graphs, better latency breakdowns, and lower-cost instrumentation. Per-packet tracing and processing in user space is now the old inefficient way.

So when are you going to upgrade to Linux 4.9? Once it is officially released, new performance tools await: apt-get install bcc-tools. For updates on bcc/BPF tools, see <a href="https://github.com/iovisor/bcc">bcc (github)</a> and my <a href="/ebpf.html">eBPF Tools</a> page.

Enjoy!

Brendan

[eBPF]: http://www.brendangregg.com/blog/2015-05-15/ebpf-one-small-step.html
[perf-tools]: https://github.com/brendangregg/perf-tools
[DTraceToolkit]: https://github.com/opendtrace/toolkit
[bcc]: https://github.com/iovisor/bcc
[tracing summit talk]: http://www.slideshare.net/brendangregg/from-dtrace-to-linux
[bcc install]: https://github.com/iovisor/bcc/blob/master/INSTALL.md
[bcc#tools]: https://github.com/iovisor/bcc#tools
[hist triggers]: /blog/2016-06-08/linux-hist-triggers.html
[bcc issue list]: https://github.com/iovisor/bcc/issues
[ping]: http://ftp.arl.army.mil/~mike/ping.html
[docs]: https://github.com/iovisor/bcc/blob/master/CONTRIBUTING-SCRIPTS.md
[initial version]: https://lkml.org/lkml/2016/6/14/749
[ply]: https://wkz.github.io/ply/
[Deirdré]: /blog/2016-07-23/deirdre.html
[Netflix]: http://www.brendangregg.com/blog/2016-03-30/working-at-netflix-2016.html
]]></content:encoded>
      <dc:date>2016-10-27T00:00:00-07:00</dc:date>
    </item>
    <item>
      <title>Linux 4.9's Efficient BPF-based Profiler</title>
      <link>http://www.brendangregg.com/blog/2016-10-21/linux-efficient-profiler.html</link>
      <description><![CDATA[BPF-optimized profiling arrived in Linux 4.9-rc1 (patchset), allowing the kernel to profile via timed sampling and summarize stack traces. Here&#39;s how I explained it recently in a talk alongside other prior perf methods for CPU flame graph generation:
]]></description>
      <pubDate>Fri, 21 Oct 2016 00:00:00 -0700</pubDate>
      <guid>http://www.brendangregg.com/blog/2016-10-21/linux-efficient-profiler.html</guid>
      <content:encoded><![CDATA[BPF-optimized profiling arrived in Linux 4.9-rc1 ([patchset]), allowing the kernel to profile via timed sampling and summarize stack traces. Here's how I explained it recently in a talk alongside other prior perf methods for CPU [flame graph] generation:

<center><img src="/blog/images/2016/linux-profiling-perf-bpf.png" width=539 border=1></center>

Linux 4.9 skips needing the perf.data file entirely, and its associated overheads. I wrote about this as a missing BPF feature [in March]. It is now done.

In detail:

- **Linux 2.6**: <tt>perf record</tt> dumps all stack samples to a binary perf.data file for post-processing in user space. This is copied via a ring buffer, and perf wakes up an optimal number of times to read that buffer, so this has already been optimized somewhat. With low frequency sampling (&lt;50 Hertz), it would be hard to measure the runtime overhead. The post-processing, however, could take up to several seconds of CPU time (and some disk I/O for perf.data) for a busy application and many CPUs.
- **Linux 4.5**: The prior use of stackcollapse-perf.pl was to frequency count stack traces, but really, the perf command has this capability, it just couldn't emit output in the folded format (stack traces on single lines, with frames delimitered by ";"). Linux 4.5 added a "-g folded" option to perf report, making this workflow more efficient. I blogged about this [previously].
- **Linux 4.9**: BPF was enhanced (eBPF) to attach to perf\_events, which are created by perf\_event\_open() with a PERF\_EVENT\_IOC\_SET\_BPF ioctl() to attach a BPF program. Voilà! Timed samples can now run BPF programs. BPF already had the capability to walk stack traces and frequency count them, added in Linux 4.6, we just needed this capability to attach it to timed samples.

I developed profile.py for [bcc tools] as a BPF profiler (see [example output]), which uses this new Linux 4.9 support. Here's a truncated screenshot:

<pre>
# <b>./profile.py 10</b>
Sampling at 49 Hertz of all threads by user + kernel stack for 10 secs.
[...]
    ffffffff81414025 copy_user_enhanced_fast_string
    ffffffff817b0774 tcp_sendmsg
    ffffffff817dd145 inet_sendmsg
    ffffffff8173ea48 sock_sendmsg
    ffffffff8173eae5 sock_write_iter
    ffffffff81232db3 __vfs_write
    ffffffff81233438 vfs_write
    ffffffff812348a5 sys_write
    ffffffff818737bb entry_SYSCALL_64_fastpath
    00007fae0a2614fd [unknown]
    0000000000020000 [unknown]
    -                iperf (21262)
        132

    ffffffff81414025 copy_user_enhanced_fast_string
    ffffffff8174e76b skb_copy_datagram_iter
    ffffffff817addb3 tcp_recvmsg
    ffffffff817dd0ae inet_recvmsg
    ffffffff8173e65d sock_recvmsg
    ffffffff8173e89a SYSC_recvfrom
    ffffffff8173fb9e sys_recvfrom
    ffffffff818737bb entry_SYSCALL_64_fastpath
    00007f9c16cf58bf recv
    -                iperf (21266)
        200
</pre>

The output are stack traces, process details, and a single number: the number of times this stack trace was sampled. The entire stack trace is used as a key in an in-kernel hash, so that only the summary is emitted to user space.

You can use my profile tool to output folded format directly, for flame graph generation. Here I'm also including -d, for stack delimiters, which the flame graph software will color grey:

<pre>
# <b>./profile.py -df 10</b>
[...]
iperf;[unknown];[unknown];-;entry_SYSCALL_64_fastpath;sys_write;vfs_write;__vfs_write;sock_write_iter;sock_sendmsg;inet_sendmsg;tcp_sendmsg;copy_user_enhanced_fast_string 140
iperf;recv;-;entry_SYSCALL_64_fastpath;sys_recvfrom;SYSC_recvfrom;sock_recvmsg;inet_recvmsg;tcp_recvmsg;skb_copy_datagram_iter;copy_user_enhanced_fast_string 177
</pre>

This can be read directly by flamegraph.pl -- it doesn't need stackcollapse-perf.pl (as pictured in the slide).

Here's the USAGE message for profile, which shows other capabilities:

<pre>
# <b>./profile.py -h</b>
usage: profile.py [-h] [-p PID] [-U | -K] [-F FREQUENCY] [-d] [-a] [-f]
                  [--stack-storage-size STACK_STORAGE_SIZE]
                  [duration]

Profile CPU stack traces at a timed interval

positional arguments:
  duration              duration of trace, in seconds

optional arguments:
  -h, --help            show this help message and exit
  -p PID, --pid PID     profile this PID only
  -U, --user-stacks-only
                        show stacks from user space only (no kernel space
                        stacks)
  -K, --kernel-stacks-only
                        show stacks from kernel space only (no user space
                        stacks)
  -F FREQUENCY, --frequency FREQUENCY
                        sample frequency, Hertz (default 49)
  -d, --delimited       insert delimiter between kernel/user stacks
  -a, --annotations     add _[k] annotations to kernel frames
  -f, --folded          output folded format, one line per stack (for flame
                        graphs)
  --stack-storage-size STACK_STORAGE_SIZE
                        the number of unique stack traces that can be stored
                        and displayed (default 2048)

examples:
    ./profile             # profile stack traces at 49 Hertz until Ctrl-C
    ./profile -F 99       # profile stack traces at 99 Hertz
    ./profile 5           # profile at 49 Hertz for 5 seconds only
    ./profile -f 5        # output in folded format for flame graphs
    ./profile -p 185      # only profile threads for PID 185
    ./profile -U          # only show user space stacks (no kernel)
    ./profile -K          # only show kernel space stacks (no user)
</pre>

I have an older version of this tool in the bcc tools/old directory, which works on Linux 4.6 to 4.8 using an old kprobe trick, although one that had caveats.

Thanks to Alexei Starovoitov (Facebook) for getting the kernel BPF profiling support done, and Teng Qin (Facebook) for adding bcc support yesterday.

We could have developed BPF profiling sooner, but since the 2.6 style of profiling was somewhat sufficient, BPF effort has been on other higher priority areas. But now that BPF profiling has arrived, it means more than just saving CPU (and disk I/O) resources: imagine real-time flame graphs!

[patchset]: https://lkml.org/lkml/2016/9/1/831
[previously]: /blog/2016-04-30/linux-perf-folded.html
[in March]: /blog/2016-03-28/linux-bpf-bcc-road-ahead-2016.html
[flame graph]: http://www.brendangregg.com/flamegraphs.html
[bcc tools]: https://github.com/iovisor/bcc
[example output]: https://github.com/iovisor/bcc/blob/master/tools/profile_example.txt
]]></content:encoded>
      <dc:date>2016-10-21T00:00:00-07:00</dc:date>
    </item>
    <item>
      <title>Linux bcc tcptop</title>
      <link>http://www.brendangregg.com/blog/2016-10-15/linux-bcc-tcptop.html</link>
      <description><![CDATA[I recently wrote a tcptop tool using the new Linux BPF capabilities, which summarizes top active TCP sessions:
]]></description>
      <pubDate>Sat, 15 Oct 2016 00:00:00 -0700</pubDate>
      <guid>http://www.brendangregg.com/blog/2016-10-15/linux-bcc-tcptop.html</guid>
      <content:encoded><![CDATA[I recently wrote a tcptop tool using the new Linux BPF capabilities, which summarizes top active TCP sessions:

<pre>
# <b>tcptop</b>
Tracing... Output every 1 secs. Hit Ctrl-C to end
<screen clears>
19:46:24 loadavg: 1.86 2.67 2.91 3/362 16681

PID    COMM         LADDR                 RADDR                  RX_KB  TX_KB
16648  16648        100.66.3.172:22       100.127.69.165:6684        1      0
16647  sshd         100.66.3.172:22       100.127.69.165:6684        0   2149
14374  sshd         100.66.3.172:22       100.127.69.165:25219       0      0
14458  sshd         100.66.3.172:22       100.127.69.165:7165        0      0

PID    COMM         LADDR6                           RADDR6                            RX_KB  TX_KB
16681  sshd         fe80::8a3:9dff:fed5:6b19:22      fe80::8a3:9dff:fed5:6b19:16606        1      1
16679  ssh          fe80::8a3:9dff:fed5:6b19:16606   fe80::8a3:9dff:fed5:6b19:22           1      1
16680  sshd         fe80::8a3:9dff:fed5:6b19:22      fe80::8a3:9dff:fed5:6b19:16606        0      0
</pre>

The output has a single line system summary, then groups for IPv4 and IPv6 traffic, if present. tcptop is in the open source [bcc] project, along with many other BPF tools that work on the Linux 4.x series.

This information can be useful for performance analysis and general troubleshooting: who is this server talking to, and how much. You might discover unexpected traffic that can be eliminated with application changes, improving overall performance.

The current version has these options:

<pre>
# <b>tcptop -h</b>
usage: tcptop [-h] [-C] [-S] [-p PID] [interval] [count]

Summarize TCP send/recv throughput by host

positional arguments:
  interval           output interval, in seconds (default 1)
  count              number of outputs

optional arguments:
  -h, --help         show this help message and exit
  -C, --noclear      don't clear the screen
  -S, --nosummary    skip system summary line
  -p PID, --pid PID  trace this PID only

examples:
    ./tcptop           # trace TCP send/recv by host
    ./tcptop -C        # don't clear the screen
    ./tcptop -p 181    # only trace PID 181
</pre>

I'm fond of using -C, so that it prints rolling output without clearing the screen. That way I can examine it for time patterns, or copy interesting output to share with others. I'm tempted to make -C the default behavior, but instead I follow the expected screen-clearing behavior of the original top by William LeFebvre.

Other options and behavior may be added later. This current version doesn't truncate to the screen size, but should (unless -C).

## Overhead

tcptop currently works by tracing send/receive at the TCP level, and summarizing session data in kernel context. The user-level bcc program wakes up each interval, fetches this summary, and prints it out.

You should be extremely cautious about anything that instruments the networking send/receive path due to the event rates possible: over 1 million per second. Reasons this bcc/BPF approach lowers overhead:

- The kernel->user transfer is the summary only, and not a dump of every packet.
- The kernel->user transfer is infrequent: once per interval.
- Tracing at the TCP level may have a lower event rate than at the packet level, due to TCP buffering (I've seen a 3x difference, although by using jumbo frames the difference can be much smaller).
- The BPF instrumentation is JIT compiled.

The overhead is relative to TCP event rate (the rate of the TCP functions traced by the program: tcp\_sendmsg() and tcp\_recvmsg() or tcp\_cleanup\_rbuf()). Due to TCP buffering, this should be lower than the packet rate. You can measure the rate of these kernel functions using funccount, also in bcc.

I tested some sample production servers and found total rates of 4k to 15k TCP events per second. The CPU overhead for tcptop at these rates would range from 0.5% to 2.0% of one CPU. Maybe your workloads have higher rates and therefore higher overhead, or, lower rates. The most extreme production case I've seen so far was a repository server under a load test of 5 Gbits/sec, where the TCP event rate was around 300k per second. I'd estimate the overhead of tcptop for that workload to be 40% of one CPU, or 1.3% overall (across 32 CPUs). Not very big, but not negligible either.

Can overhead be lowered further? I was wondering if we could add counters to struct sock or stuct tcp, when Alexei Starovoitov (BPF lead developer) suggested I look at the new tcp\_info enhancements.

## tcp\_info and RFC-4898

Here's the latest tcp\_info from <a href="https://github.com/torvalds/linux/blob/master/include/uapi/linux/tcp.h">include/uapi/linux/tcp.h</a>:

<pre>
struct tcp_info {
	__u8	tcpi_state;
[...]
	__u64	<b>tcpi_bytes_acked;    /* RFC4898 tcpEStatsAppHCThruOctetsAcked */</b>
	__u64	<b>tcpi_bytes_received; /* RFC4898 tcpEStatsAppHCThruOctetsReceived */</b>
	__u32	tcpi_segs_out;	     /* RFC4898 tcpEStatsPerfSegsOut */
	__u32	tcpi_segs_in;	     /* RFC4898 tcpEStatsPerfSegsIn */

	__u32	tcpi_notsent_bytes;
	__u32	tcpi_min_rtt;
	__u32	tcpi_data_segs_in;	/* RFC4898 tcpEStatsDataSegsIn */
	__u32	tcpi_data_segs_out;	/* RFC4898 tcpEStatsDataSegsOut */

	__u64   tcpi_delivery_rate;
};
</pre>

These counters are **new**, with Eric Dumazet (Google) beginning to add the RFC4898 counters in 2015, and Martin KaFai Lau (Facebook) adding more this year.

Most interesting are tcpi\_bytes\_acked and tcpi\_bytes\_received, which made it into Linux 4.1. They are printed by "ss -nti":

<pre>
# <b>ss -nti</b>
State      Recv-Q Send-Q        Local Address:Port                       Peer Address:Port
ESTAB      0      0              100.66.3.172:22                        10.16.213.254:55277
	 cubic wscale:5,9 rto:264 rtt:63.487/0.067 ato:48 mss:1448 cwnd:58 <b>bytes_acked:175897
    bytes_received:15933</b> segs_out:618 segs_in:917 send 10.6Mbps lastsnd:176 lastrcv:196
    lastack:112 pacing_rate 21.2Mbps rcv_rtt:64 rcv_space:28960
ESTAB      0      0              100.66.3.172:22                        10.16.213.254:52066
	 cubic wscale:5,9 rto:264 rtt:63.509/0.064 ato:40 mss:1448 cwnd:54 <b>bytes_acked:47461
    bytes_received:28861</b> segs_out:746 segs_in:1285 send 9.8Mbps lastsnd:10732 lastrcv:10732
    lastack:10668 pacing_rate 19.7Mbps rcv_rtt:76 rcv_space:28960
[...]
</pre>

From [RFC4898]:

<pre>
   tcpEStatsAppHCThruOctetsAcked  OBJECT-TYPE
       SYNTAX          ZeroBasedCounter64
       UNITS           "octets"
       MAX-ACCESS      read-only
       STATUS          current
       DESCRIPTION
          "The number of <b>octets</b> for which cumulative <b>acknowledgments</b>
           have been <b>received</b>, on systems that can receive more than
           10 million bits per second.  Note that this will be the sum
           of changes in tcpEStatsAppSndUna."
       ::= { tcpEStatsAppEntry 5 }

   tcpEStatsAppHCThruOctetsReceived  OBJECT-TYPE
       SYNTAX          ZeroBasedCounter64
       UNITS           "octets"
       MAX-ACCESS      read-only
       STATUS          current
       DESCRIPTION
          "The number of <b>octets</b> for which cumulative <b>acknowledgments</b>
           have been <b>sent</b>, on systems that can transmit more than 10
           million bits per second.  Note that this will be the sum of
           changes in tcpEStatsAppRcvNxt."
       ::= { tcpEStatsAppEntry 8 }
</pre>

These counters are acknowledged octets (bytes) sent and received. Couldn't we fashion a tcptop from these, that polled tcp\_info similar to "ss -nti"? That way we could avoid instrumenting send & receive, potentially lowering overhead further.

There are at least two challenges with the tcp\_info polling approach:

1. **Short-lived and partial sessions**: Just like how top can miss short-lived processes and those that finish before the interval snapshot, a polling approach would miss short-lived sessions. Unfortunately, tcp\_info doesn't stay around for TIME-WAIT (likely as part of DoS mitigation), and applications can make many short-lived TCP connections (eg, to dependency services) that would be missed by polling. The solution would be to cache the previously polled tcp\_info session state by tcptop, and also instrument the TCP close paths with BPF, including fetching out the RFC-4898 counters. That way, short-lived sessions could be seen, and sessions that closed during an interval could also be measured as the difference between the previous poll cache and the TCP close counters.
1. **Overhead of polling tcp\_info**: "ss -nti" on some servers, with over 15k sessions, can take over 100 milliseconds of CPU time. This can be optimized down somewhat, but it's possible that for some workloads the overhead of tcp\_info polling (plus caching, and TCP close tracing) is higher than just TCP send/receive tracing.

It's possible the current implementation, involving TCP send/receive tracing, ends up hard to beat due to these other complications. I'm still investigating.

## Prior tcptop

I wrote the [original tcptop] years ago using DTrace, based on my earlier work with tcpsnoop. The lesson I learned from these was to minimize the number of dynamic probes used, since the TCP/IP stack will change between kernel versions, and the more probes you use the more brittle the program will be. Better still, use static tracepoints, which won't change.

Linux needs static tracepoints for TCP send & receive, after which, tools like tcptop won't break between kernels. That's a topic for another post.

Thanks to Coburn (Netflix) for suggesting I try building this tool with BPF.

[bcc]: https://github.com/iovisor/bcc
[example output]: https://github.com/iovisor/bcc/blob/master/tools/profile_example.txt
[RFC4898]: http://www.ietf.org/rfc/rfc4898.txt
[original tcptop]: http://www.brendangregg.com/DTrace/tcptop
]]></content:encoded>
      <dc:date>2016-10-15T00:00:00-07:00</dc:date>
    </item>
    <item>
      <title>Linux bcc/BPF Node.js USDT Tracing</title>
      <link>http://www.brendangregg.com/blog/2016-10-12/linux-bcc-nodejs-usdt.html</link>
      <description><![CDATA[You may know that Node.js has built-in USDT (user statically-defined tracing) probes for performance analysis and debugging, but did you know that Linux now supports using them? And now that V8 has tracing, is this too late to matter? In this post I&#39;ll explain things a little with a basic USDT example.
]]></description>
      <pubDate>Wed, 12 Oct 2016 00:00:00 -0700</pubDate>
      <guid>http://www.brendangregg.com/blog/2016-10-12/linux-bcc-nodejs-usdt.html</guid>
      <content:encoded><![CDATA[You may know that Node.js has built-in USDT (user statically-defined tracing) probes for performance analysis and debugging, but did you know that Linux now supports using them? And now that [V8 has tracing], is this too late to matter? In this post I'll explain things a little with a basic USDT example.

The Linux 4.x series has been adding enhancements to BPF (Berkeley Packet Filter) originally for software defined networks, but now can be used for programmatic tracing. Aka [BPF superpowers]. These are built into Linux, so sooner or later, this is coming to everyone who runs Linux.

I wrote an example of instrumenting the node http-server-request USDT probe with BPF:

<pre>
# <b>./nodejs_http_server.py 24728</b>
TIME(s)            COMM             PID    ARGS
24653324.561322998 node             24728  path:/index.html
24653335.343401998 node             24728  path:/images/welcome.png
24653340.510164998 node             24728  path:/images/favicon.png
</pre>

The source is in [bcc] under [examples/tracing/nodejs_http_server.py]:

<div style="padding-left:20px;padding-right:10px"><script src="https://gist.github.com/brendangregg/592e2a927833fc6754d9192938186781.js"></script></div>

bcc uses C for the kernel instrumentation (which it compiles into BPF bytecode) and Python (or lua) for user-level reporting. It gets the job done, but is verbose. This example should be even more verbose: I used a debug shortcut, bpf\_trace\_printk(), but if this were a tool intended for concurrent use it needs to use BPF\_PERF\_OUTPUT() instead (I explained how in the [bcc Python Tutorial]), which will inflate the code further.

This code ultimately runs the do\_trace() function when the http\_\_server\_\_request probe is hit, which reads the 6th argument, the URL. You can see some argument definitions in src/node\_provider.d, eg:

<pre>
<b>probe http__server__request</b>(node_dtrace_http_server_request_t *h,
    node_dtrace_connection_t *c, const char *a, int p, const char *m,
    <b>const char *u</b>, int fd) : (node_http_request_t *h, node_connection_t *c,
</pre>

Let's check those other strings (char *'s). The bcc trace program can print them out, which allows some powerful ad hoc one-liners to be developed:

<pre>
# <b>trace -p `pgrep -n node` 'u:node:http__server__request "%s %s %s", arg3, arg5, arg6'</b>
TIME     PID    COMM         FUNC             -
21:28:14 827    node         http__server__request 127.0.0.1 GET /
21:28:15 827    node         http__server__request 127.0.0.1 GET /
21:28:16 827    node         http__server__request 127.0.0.1 GET /
^C
</pre>

You can also use bcc's tplist to list probes, eg, on a file:

<pre>
# <b>tplist -l /mnt/src/node-v6.7.0/node</b>
/mnt/src/node-v6.7.0/node node:gc__start
/mnt/src/node-v6.7.0/node node:gc__done
/mnt/src/node-v6.7.0/node node:http__server__response
/mnt/src/node-v6.7.0/node node:net__server__connection
/mnt/src/node-v6.7.0/node node:net__stream__end
/mnt/src/node-v6.7.0/node node:http__client__response
/mnt/src/node-v6.7.0/node node:http__client__request
/mnt/src/node-v6.7.0/node node:http__server__request
</pre>

... or on a running process:

<pre>
# <b>tplist -p `pgrep node`</b>
/mnt/src/node-v6.7.0/out/Release/node node:gc__start
/mnt/src/node-v6.7.0/out/Release/node node:gc__done
/mnt/src/node-v6.7.0/out/Release/node node:http__server__response
/mnt/src/node-v6.7.0/out/Release/node node:net__server__connection
/mnt/src/node-v6.7.0/out/Release/node node:net__stream__end
/mnt/src/node-v6.7.0/out/Release/node node:http__client__response
/mnt/src/node-v6.7.0/out/Release/node node:http__client__request
/mnt/src/node-v6.7.0/out/Release/node node:http__server__request
/lib/x86_64-linux-gnu/libc-2.23.so libc:setjmp
/lib/x86_64-linux-gnu/libc-2.23.so libc:longjmp
/lib/x86_64-linux-gnu/libc-2.23.so libc:longjmp_target
/lib/x86_64-linux-gnu/libc-2.23.so libc:memory_heap_new
/lib/x86_64-linux-gnu/libc-2.23.so libc:memory_sbrk_less
/lib/x86_64-linux-gnu/libc-2.23.so libc:memory_arena_reuse_free_list
/lib/x86_64-linux-gnu/libc-2.23.so libc:memory_arena_reuse_wait
/lib/x86_64-linux-gnu/libc-2.23.so libc:memory_arena_reuse
/lib/x86_64-linux-gnu/libc-2.23.so libc:memory_arena_new
/lib/x86_64-linux-gnu/libc-2.23.so libc:memory_arena_retry
/lib/x86_64-linux-gnu/libc-2.23.so libc:memory_heap_free
/lib/x86_64-linux-gnu/libc-2.23.so libc:memory_heap_less
/lib/x86_64-linux-gnu/libc-2.23.so libc:memory_heap_more
/lib/x86_64-linux-gnu/libc-2.23.so libc:memory_sbrk_more
/lib/x86_64-linux-gnu/libc-2.23.so libc:memory_malloc_retry
/lib/x86_64-linux-gnu/libc-2.23.so libc:memory_mallopt_free_dyn_thresholds
/lib/x86_64-linux-gnu/libc-2.23.so libc:memory_realloc_retry
/lib/x86_64-linux-gnu/libc-2.23.so libc:memory_memalign_retry
/lib/x86_64-linux-gnu/libc-2.23.so libc:memory_calloc_retry
/lib/x86_64-linux-gnu/libc-2.23.so libc:memory_mallopt
/lib/x86_64-linux-gnu/libc-2.23.so libc:memory_mallopt_mxfast
/lib/x86_64-linux-gnu/libc-2.23.so libc:memory_mallopt_arena_max
/lib/x86_64-linux-gnu/libc-2.23.so libc:memory_mallopt_arena_test
/lib/x86_64-linux-gnu/libc-2.23.so libc:memory_mallopt_mmap_max
/lib/x86_64-linux-gnu/libc-2.23.so libc:memory_mallopt_mmap_threshold
/lib/x86_64-linux-gnu/libc-2.23.so libc:memory_mallopt_top_pad
/lib/x86_64-linux-gnu/libc-2.23.so libc:memory_mallopt_trim_threshold
/lib/x86_64-linux-gnu/libc-2.23.so libc:memory_mallopt_perturb
/lib/x86_64-linux-gnu/libc-2.23.so libc:memory_mallopt_check_action
/lib/x86_64-linux-gnu/libc-2.23.so libc:lll_lock_wait_private
/lib/x86_64-linux-gnu/libpthread-2.23.so libpthread:pthread_start
/lib/x86_64-linux-gnu/libpthread-2.23.so libpthread:pthread_create
/lib/x86_64-linux-gnu/libpthread-2.23.so libpthread:pthread_join
/lib/x86_64-linux-gnu/libpthread-2.23.so libpthread:pthread_join_ret
/lib/x86_64-linux-gnu/libpthread-2.23.so libpthread:mutex_init
/lib/x86_64-linux-gnu/libpthread-2.23.so libpthread:mutex_destroy
/lib/x86_64-linux-gnu/libpthread-2.23.so libpthread:mutex_acquired
/lib/x86_64-linux-gnu/libpthread-2.23.so libpthread:mutex_entry
/lib/x86_64-linux-gnu/libpthread-2.23.so libpthread:mutex_timedlock_entry
/lib/x86_64-linux-gnu/libpthread-2.23.so libpthread:mutex_timedlock_acquired
/lib/x86_64-linux-gnu/libpthread-2.23.so libpthread:mutex_release
/lib/x86_64-linux-gnu/libpthread-2.23.so libpthread:rwlock_destroy
/lib/x86_64-linux-gnu/libpthread-2.23.so libpthread:rdlock_acquire_read
/lib/x86_64-linux-gnu/libpthread-2.23.so libpthread:rdlock_entry
/lib/x86_64-linux-gnu/libpthread-2.23.so libpthread:wrlock_acquire_write
/lib/x86_64-linux-gnu/libpthread-2.23.so libpthread:wrlock_entry
/lib/x86_64-linux-gnu/libpthread-2.23.so libpthread:rwlock_unlock
/lib/x86_64-linux-gnu/libpthread-2.23.so libpthread:cond_init
/lib/x86_64-linux-gnu/libpthread-2.23.so libpthread:cond_destroy
/lib/x86_64-linux-gnu/libpthread-2.23.so libpthread:cond_wait
/lib/x86_64-linux-gnu/libpthread-2.23.so libpthread:cond_timedwait
/lib/x86_64-linux-gnu/libpthread-2.23.so libpthread:cond_signal
/lib/x86_64-linux-gnu/libpthread-2.23.so libpthread:cond_broadcast
/lib/x86_64-linux-gnu/libpthread-2.23.so libpthread:lll_lock_wait_private
/lib/x86_64-linux-gnu/libpthread-2.23.so libpthread:lll_lock_wait
/lib/x86_64-linux-gnu/libm-2.23.so libm:slowatan2_inexact
/lib/x86_64-linux-gnu/libm-2.23.so libm:slowatan2
/lib/x86_64-linux-gnu/libm-2.23.so libm:slowlog_inexact
/lib/x86_64-linux-gnu/libm-2.23.so libm:slowlog
/lib/x86_64-linux-gnu/libm-2.23.so libm:slowatan_inexact
/lib/x86_64-linux-gnu/libm-2.23.so libm:slowatan
/lib/x86_64-linux-gnu/libm-2.23.so libm:slowtan
/lib/x86_64-linux-gnu/libm-2.23.so libm:slowasin
/lib/x86_64-linux-gnu/libm-2.23.so libm:slowacos
/lib/x86_64-linux-gnu/libm-2.23.so libm:slowsin
/lib/x86_64-linux-gnu/libm-2.23.so libm:slowcos
/lib/x86_64-linux-gnu/libm-2.23.so libm:slowexp_p6
/lib/x86_64-linux-gnu/libm-2.23.so libm:slowexp_p32
/lib/x86_64-linux-gnu/libm-2.23.so libm:slowpow_p10
/lib/x86_64-linux-gnu/libm-2.23.so libm:slowpow_p32
/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21 libstdcxx:catch
/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21 libstdcxx:throw
/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21 libstdcxx:rethrow
/lib/x86_64-linux-gnu/ld-2.23.so rtld:init_start
/lib/x86_64-linux-gnu/ld-2.23.so rtld:init_complete
/lib/x86_64-linux-gnu/ld-2.23.so rtld:map_failed
/lib/x86_64-linux-gnu/ld-2.23.so rtld:map_start
/lib/x86_64-linux-gnu/ld-2.23.so rtld:map_complete
/lib/x86_64-linux-gnu/ld-2.23.so rtld:reloc_start
/lib/x86_64-linux-gnu/ld-2.23.so rtld:reloc_complete
/lib/x86_64-linux-gnu/ld-2.23.so rtld:unmap_start
/lib/x86_64-linux-gnu/ld-2.23.so rtld:unmap_complete
/lib/x86_64-linux-gnu/ld-2.23.so rtld:setjmp
/lib/x86_64-linux-gnu/ld-2.23.so rtld:longjmp
/lib/x86_64-linux-gnu/ld-2.23.so rtld:longjmp_target
</pre>

This has picked up many other USDT probes (wow), from libc, libpthread, libm, libstdcxx, and rtld. Nice!

## Node.js and USDT

Last time I built Node.js with these USDT probes I used these steps:

<pre>
$ <b>sudo apt-get install systemtap-sdt-dev</b>   # adds "dtrace", used by node build
$ <b>wget https://nodejs.org/dist/v6.7.0/node-v6.7.0.tar.gz</b>
$ <b>tar xvf node-v6.7.0.tar.gz</b>
$ <b>cd node-v6.7.0</b>
$ <b>./configure --with-dtrace</b>
$ <b>make -j8</b>
</pre>

If you don't have bcc setup, you can use readelf to check that the USDT probes are built into the binary, which show up as "SystemTap probe descriptors":

<pre>
# <b>readelf -n /mnt/src/node-v6.7.0/node</b>
[...]
Displaying notes found at file offset 0x01814014 with length 0x000003c4:
  Owner                 Data size	Description
  stapsdt              0x0000003c	NT_STAPSDT (SystemTap probe descriptors)
    Provider: node
    Name: gc__start
    Location: 0x00000000011552f4, Base: 0x0000000001a444e4, Semaphore: 0x0000000001e13fdc
    Arguments: 4@%esi 4@%edx 8@%rdi
[...]
</pre>

bcc supports both USDT probes and IS-ENABLED USDT probes.

There is another way to create USDT probes: the [dtrace-provider] library, which allows your Node.js code to dynamically declare new USDT probes. Last I checked, that library did not compile on Linux, however, with the new bcc/BPF support it should be fixable.

In order to use USDT probes with bcc, you'll need a Linux kernel that's new and shiny. By Linux 4.4 (which is used by Ubuntu 16.04 LTS), there's enough BPF to do USDT event tracing, latency measurements, and histograms. Linux 4.6 adds stack trace support.

## V8 Tracing & Future Use

V8 has recently added an --enable-tracing option that generates a v8\_trace.json file for loading in Google Chrome's [trace viewer] (chrome://tracing/). Once this is widely available in node, many common tracing needs may be solved.

The long term value of USDT support with bcc may be the instrumentation of other subsystems: node internals, system libraries, and the kernel, and exposing these with Node.js context fetched from the USDT probes. BPF/bcc can instrument kernel functions, kernel tracepoints, and user-level functions as well. These:

<pre>
# <b>objdump -j .text -tT node | head</b>

node:     file format elf64-x86-64

SYMBOL TABLE:
000000000079bd00 l    d  .text	0000000000000000              .text
000000000079e070 l     F .text	0000000000000000              deregister_tm_clones
000000000079e0b0 l     F .text	0000000000000000              register_tm_clones
000000000079e0f0 l     F .text	0000000000000000              __do_global_dtors_aux
000000000079e110 l     F .text	0000000000000000              frame_dummy
000000000079e140 l     F .text	000000000000002b              ssl_callback_ctrl
# <b>objdump -j .text -tT node | wc</b>
  76170  492908 9373788
# <b>wc /proc/kallsyms</b>
108919 339047 4659123 /proc/kallsyms
</pre>

That's over 75 thousand user-level probe points, plus over 108 thousand kernel probe points, plus those extra USDT probes I listed previously.

If you can solve your performance issues using v8/node-specific capabilities like V8 tracing, then great! But for times when you really need to dig into the depths of your application and its OS interactions: bcc/BPF can do it, and you can make custom tools to automate it.

Adding USDT support to bcc is just the beginning, and I've shared some details on how it works in this post. Next is to build useful tools and GUIs that make use of it.

[bcc]: https://github.com/iovisor/bcc
[example output]: https://github.com/iovisor/bcc/blob/master/tools/mysqld_qslower_example.txt
[here]: https://github.com/iovisor/bcc/blob/master/tools/mysqld_qslower.py
[BPF superpowers]: /blog/2016-03-05/linux-bpf-superpowers.html
[examples/tracing/nodejs_http_server.py]: https://github.com/iovisor/bcc/blob/master/examples/tracing/nodejs_http_server.py
[V8 has tracing]: https://github.com/v8/v8/wiki/Tracing-V8
[bcc Python Tutorial]: https://github.com/iovisor/bcc/blob/master/docs/tutorial_bcc_python_developer.md
[dtrace-provider]: https://github.com/chrisa/node-dtrace-provider
[trace viewer]: https://github.com/catapult-project/catapult/blob/master/tracing/README.md
]]></content:encoded>
      <dc:date>2016-10-12T00:00:00-07:00</dc:date>
    </item>
    <item>
      <title>Linux bcc/BPF Run Queue (Scheduler) Latency</title>
      <link>http://www.brendangregg.com/blog/2016-10-08/linux-bcc-runqlat.html</link>
      <description><![CDATA[I added this program to bcc earlier this year, and wanted to summarize it here as it fulfills an important need: examining scheduler run queue latency. It may not actually be a queue these days, but this metric has been called &quot;run queue latency&quot; for years: it&#39;s measuring the time from when a thread becomes runnable (eg, receives an interrupt, prompting it to process more work), to when it actually begins running on a CPU. Under CPU saturation, you can imagine threads have to wait their turn. But it can also happen for other weird scenarios, and there are cases where it can be tuned and reduced, improving overall system performance.
]]></description>
      <pubDate>Sat, 08 Oct 2016 00:00:00 -0700</pubDate>
      <guid>http://www.brendangregg.com/blog/2016-10-08/linux-bcc-runqlat.html</guid>
      <content:encoded><![CDATA[I added this program to bcc earlier this year, and wanted to summarize it here as it fulfills an important need: examining scheduler run queue latency. It may not actually be a queue these days, but this metric has been called "run queue latency" for years: it's measuring the time from when a thread becomes runnable (eg, receives an interrupt, prompting it to process more work), to when it actually begins running on a CPU. Under CPU saturation, you can imagine threads have to wait their turn. But it can also happen for other weird scenarios, and there are cases where it can be tuned and reduced, improving overall system performance.

The program is runqlat, and it summarizes run queue latency as a histogram. Here is a heavily loaded system:

<pre>
# <b>runqlat</b>
Tracing run queue latency... Hit Ctrl-C to end.
^C
     usecs               : count     distribution
         0 -> 1          : 233      |***********                             |
         2 -> 3          : 742      |************************************    |
         4 -> 7          : 203      |**********                              |
         8 -> 15         : 173      |********                                |
        16 -> 31         : 24       |*                                       |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 30       |*                                       |
       128 -> 255        : 6        |                                        |
       256 -> 511        : 3        |                                        |
       512 -> 1023       : 5        |                                        |
      1024 -> 2047       : 27       |*                                       |
      2048 -> 4095       : 30       |*                                       |
      4096 -> 8191       : 20       |                                        |
      8192 -> 16383      : 29       |*                                       |
     16384 -> 32767      : 809      |****************************************|
     32768 -> 65535      : 64       |***                                     |
</pre>

The distribution is bimodal, with one mode between 0 and 15 microseconds, and another between 16 and 65 milliseconds. These modes are visible as the spikes in the ASCII distribution (which is merely a visual representation of the "count" column). As an example of reading one line: 809 events fell into the 16384 to 32767 microsecond range (16 to 32 ms) while tracing.

A -m option can be used to show milliseconds instead, as well as an interval and a count. For example, showing three x five second summary in milliseconds:

<pre>
# <b>runqlat -m 5 3</b>
Tracing run queue latency... Hit Ctrl-C to end.

     msecs               : count     distribution
         0 -> 1          : 3818     |****************************************|
         2 -> 3          : 39       |                                        |
         4 -> 7          : 39       |                                        |
         8 -> 15         : 62       |                                        |
        16 -> 31         : 2214     |***********************                 |
        32 -> 63         : 226      |**                                      |

     msecs               : count     distribution
         0 -> 1          : 3775     |****************************************|
         2 -> 3          : 52       |                                        |
         4 -> 7          : 37       |                                        |
         8 -> 15         : 65       |                                        |
        16 -> 31         : 2230     |***********************                 |
        32 -> 63         : 212      |**                                      |

     msecs               : count     distribution
         0 -> 1          : 3816     |****************************************|
         2 -> 3          : 49       |                                        |
         4 -> 7          : 40       |                                        |
         8 -> 15         : 53       |                                        |
        16 -> 31         : 2228     |***********************                 |
        32 -> 63         : 221      |**                                      |
</pre>

This shows a similar distribution across the three summaries.

Here is the same system, but when it is CPU idle:

<pre>
# <b>runqlat 5 1</b>
Tracing run queue latency... Hit Ctrl-C to end.

     usecs               : count     distribution
         0 -> 1          : 2250     |********************************        |
         2 -> 3          : 2340     |**********************************      |
         4 -> 7          : 2746     |****************************************|
         8 -> 15         : 418      |******                                  |
        16 -> 31         : 93       |*                                       |
        32 -> 63         : 28       |                                        |
        64 -> 127        : 119      |*                                       |
       128 -> 255        : 9        |                                        |
       256 -> 511        : 4        |                                        |
       512 -> 1023       : 20       |                                        |
      1024 -> 2047       : 22       |                                        |
      2048 -> 4095       : 5        |                                        |
      4096 -> 8191       : 2        |                                        |
</pre>

Back to a microsecond scale, this time there is little run queue latency past 1 millisecond, as would be expected.

This tool can filter for specific pids or tids. Here is the USAGE message:

<pre>
# <b>runqlat -h</b>
usage: runqlat [-h] [-T] [-m] [-P] [-L] [-p PID] [interval] [count]

Summarize run queue (schedular) latency as a histogram

positional arguments:
  interval            output interval, in seconds
  count               number of outputs

optional arguments:
  -h, --help          show this help message and exit
  -T, --timestamp     include timestamp on output
  -m, --milliseconds  millisecond histogram
  -P, --pids          print a histogram per process ID
  -L, --tids          print a histogram per thread ID
  -p PID, --pid PID   trace this PID only

examples:
    ./runqlat            # summarize run queue latency as a histogram
    ./runqlat 1 10       # print 1 second summaries, 10 times
    ./runqlat -mT 1      # 1s summaries, milliseconds, and timestamps
    ./runqlat -P         # show each PID separately
    ./runqlat -p 185     # trace PID 185 only
</pre>

Also in [bcc] is cpudist, written by Sasha Goldshtein, which shows the time threads spent running on-CPU, rather than the time waiting for a turn.

[bcc]: https://github.com/iovisor/bcc
[example output]: https://github.com/iovisor/bcc/blob/master/tools/capable_example.txt
]]></content:encoded>
      <dc:date>2016-10-08T00:00:00-07:00</dc:date>
    </item>
    <item>
      <title>Linux bcc ext4 Latency Tracing</title>
      <link>http://www.brendangregg.com/blog/2016-10-06/linux-bcc-ext4dist-ext4slower.html</link>
      <description><![CDATA[Storage I/O performance issues are often studied at the block device layer, but instrumenting the file system instead provides more relevant metrics for understanding how applications are affected.
]]></description>
      <pubDate>Thu, 06 Oct 2016 00:00:00 -0700</pubDate>
      <guid>http://www.brendangregg.com/blog/2016-10-06/linux-bcc-ext4dist-ext4slower.html</guid>
      <content:encoded><![CDATA[Storage I/O performance issues are often studied at the block device layer, but instrumenting the file system instead provides more relevant metrics for understanding how applications are affected.

My ext4dist tool does this for the ext4 file system, and traces reads, writes, opens, and fsyncs, and summarizes their latency as a power-of-2 histogram. For example:

<pre>
# <b>ext4dist</b>
Tracing ext4 operation latency... Hit Ctrl-C to end.
^C

operation = 'read'
     usecs               : count     distribution
         0 -> 1          : 1210     |****************************************|
         2 -> 3          : 126      |****                                    |
         4 -> 7          : 376      |************                            |
         8 -> 15         : 86       |**                                      |
        16 -> 31         : 9        |                                        |
        32 -> 63         : 47       |*                                       |
        64 -> 127        : 6        |                                        |
       128 -> 255        : 24       |                                        |
       256 -> 511        : 137      |****                                    |
       512 -> 1023       : 66       |**                                      |
      1024 -> 2047       : 13       |                                        |
      2048 -> 4095       : 7        |                                        |
      4096 -> 8191       : 13       |                                        |
      8192 -> 16383      : 3        |                                        |

operation = 'write'
     usecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 75       |****************************************|
        16 -> 31         : 5        |**                                      |

operation = 'open'
     usecs               : count     distribution
         0 -> 1          : 1278     |****************************************|
         2 -> 3          : 40       |*                                       |
         4 -> 7          : 4        |                                        |
         8 -> 15         : 1        |                                        |
        16 -> 31         : 1        |                                        |
</pre>

This output shows a bi-modal distribution for read latency, with a faster mode of less than 7 microseconds, and a slower mode of between 256 and 1023 microseconds. The count column shows how many events fell into that latency range. It's likely that the faster mode was a hit from the in-memory file system cache, and the slower mode is a read from a storage device (disk).  

This "latency" is measured from when the operation was issued from the VFS interface to the file system, to when it completed. This spans everything: block device I/O (disk I/O), file system CPU cycles, file system locks, run queue latency, etc. This is a better measure of the latency suffered by applications reading from the file system than measuring this down at the block device interface. Measuring at the block device level is better suited for other uses: resource capacity planning.

Note that this tool only traces the common file system operations previously listed: other file system operations (eg, inode operations including getattr()) are not traced.

ext4dist is a [bcc] tool that uses kernel dynamic tracing (via kprobes) with BPF. bcc is a front-end and a collection of tools that use new Linux enhanced BPF tracing capabilities.

I also wrote ext4slower, to trace these ext4 operations that are slower than a custom threshold. Eg, 1 millisecond:

<pre>
# <b>ext4slower 1</b>
Tracing ext4 operations slower than 1 ms
TIME     COMM           PID    T BYTES   OFF_KB   LAT(ms) FILENAME
06:49:17 bash           3616   R 128     0           7.75 cksum
06:49:17 cksum          3616   R 39552   0           1.34 [
06:49:17 cksum          3616   R 96      0           5.36 2to3-2.7
06:49:17 cksum          3616   R 96      0          14.94 2to3-3.4
06:49:17 cksum          3616   R 10320   0           6.82 411toppm
06:49:17 cksum          3616   R 65536   0           4.01 a2p
06:49:17 cksum          3616   R 55400   0           8.77 ab
06:49:17 cksum          3616   R 36792   0          16.34 aclocal-1.14
06:49:17 cksum          3616   R 15008   0          19.31 acpi_listen
06:49:17 cksum          3616   R 6123    0          17.23 add-apt-repository
06:49:17 cksum          3616   R 6280    0          18.40 addpart
06:49:17 cksum          3616   R 27696   0           2.16 addr2line
06:49:17 cksum          3616   R 58080   0          10.11 ag
06:49:17 cksum          3616   R 906     0           6.30 ec2-meta-data
06:49:17 cksum          3616   R 6320    0          10.00 animate.im6
[...]
</pre>

This is great for proving or exonerating the storage subsystem (file systems, volume managers, and disks) as a source of high latency events. Let's say you had occasional 100 ms application request outliers, and suspected it may be a single slow I/O (of 100 ms) or several (adding to 100 ms). Running "ext4slower 10" would print everything beyond 10 ms, proving or exonerating these theories. (Note that it could be thousands of sub-1 ms I/O, not caught by a "ext4slower 10".)

The output will be far too verbose, but you can also use a threshold of "0" to dump all events:

<pre>
# <b>ext4slower 0</b>
Tracing ext4 operations
TIME     COMM           PID    T BYTES   OFF_KB   LAT(ms) FILENAME
06:58:05 supervise      1884   O 0       0           0.00 status.new
06:58:05 supervise      1884   W 18      0           0.02 status.new
06:58:05 supervise      1884   O 0       0           0.00 status.new
06:58:05 supervise      1884   W 18      0           0.01 status.new
06:58:05 supervise      15817  O 0       0           0.00 run
06:58:05 supervise      15817  R 92      0           0.00 run
06:58:05 supervise      15817  O 0       0           0.00 bash
06:58:05 supervise      15817  R 128     0           0.00 bash
06:58:05 supervise      15817  R 504     0           0.00 bash
06:58:05 supervise      15817  R 28      0           0.00 bash
06:58:05 supervise      15817  O 0       0           0.00 ld-2.19.so
[...]
</pre>

Do run ext4dist first, to check the rate of file system operations. If it's millions per second, then running "ext4slower 0" will try to print millions of lines of output per second &ndash; probably not what you want, and will cost overhead on the system.

These tools are in [bcc], and which has man pages (under /man/man8) and text files with more example screenshots (under /tools/\*\_example.txt). Each tool also has a USAGE message, eg:

<pre>
# <b>ext4slower -h</b>
usage: ext4slower [-h] [-j] [-p PID] [min_ms]

Trace common ext4 file operations slower than a threshold

positional arguments:
  min_ms             minimum I/O duration to trace, in ms (default 10)

optional arguments:
  -h, --help         show this help message and exit
  -j, --csv          just print fields: comma-separated values
  -p PID, --pid PID  trace this PID only

examples:
    ./ext4slower             # trace operations slower than 10 ms (default)
    ./ext4slower 1           # trace operations slower than 1 ms
    ./ext4slower -j 1        # ... 1 ms, parsable output (csv)
    ./ext4slower 0           # trace all operations (warning: verbose)
    ./ext4slower -p 185      # trace PID 185 only
</pre>

There are also equivalent tools in bcc for btrfs, xfs, and zfs (so far).

[bcc]: https://github.com/iovisor/bcc
[example output]: https://github.com/iovisor/bcc/blob/master/tools/capable_example.txt
]]></content:encoded>
      <dc:date>2016-10-06T00:00:00-07:00</dc:date>
    </item>
    <item>
      <title>Linux MySQL Slow Query Tracing with bcc/BPF</title>
      <link>http://www.brendangregg.com/blog/2016-10-04/linux-bcc-mysqld-qslower.html</link>
      <description><![CDATA[My mysqld_qslower tool prints MySQL queries slower than a given threshold, and is run on the MySQL server. By default, it prints queries slower than 1 millisecond:
]]></description>
      <pubDate>Tue, 04 Oct 2016 00:00:00 -0700</pubDate>
      <guid>http://www.brendangregg.com/blog/2016-10-04/linux-bcc-mysqld-qslower.html</guid>
      <content:encoded><![CDATA[My mysqld\_qslower tool prints MySQL queries slower than a given threshold, and is run on the MySQL server. By default, it prints queries slower than 1 millisecond:

<pre>
# <b>mysqld_qslower `pgrep -n mysqld`</b>
Tracing MySQL server queries for PID 14371 slower than 1 ms...
TIME(s)        PID          MS QUERY
0.000000       18608   130.751 SELECT * FROM words WHERE word REGEXP '^bre.*n$'
2.921535       18608   130.590 SELECT * FROM words WHERE word REGEXP '^alex.*$'
4.603549       18608    24.164 SELECT COUNT(*) FROM words
9.733847       18608   130.936 SELECT count(*) AS count FROM words WHERE word REGEXP '^bre.*n$'
17.864776      18608   130.298 SELECT * FROM words WHERE word REGEXP '^bre.*n$' ORDER BY word
</pre>

This is a bit like having a custom slow queries log, where the threshold can be picked on the fly.

It is a [bcc] tool that uses the MySQL USDT probes (user statically defined tracing) that were introduced for DTrace. bcc is a front-end and a collection of tools that use new Linux enhanced BPF tracing capabilities.

USDT support in bcc/BPF is new, and involves allowing BPF code to be attached to USDT probes, eg, from mysqld\_qslower:

<pre>
# enable USDT probe from given PID
u = USDT(pid=pid)
u.enable_probe(probe="query__start", fn_name="do_start")
u.enable_probe(probe="query__done", fn_name="do_done")
</pre>

... and then fetching arguments to those USDT probes. This BPF code hashes the timestamp and the query string pointer (from arg1) to the current thread (pid) for later lookup:

<pre>
struct start_t {
    u64 ts;
    char *query;
};
BPF_HASH(start_tmp, u32, struct start_t);

int do_start(struct pt_regs *ctx) {
    u32 pid = bpf_get_current_pid_tgid();
    struct start_t start = {};
    start.ts = bpf_ktime_get_ns();
    bpf_usdt_readarg(1, ctx, &start.query);
    start_tmp.update(&pid, &start);
    return 0;
};
</pre>

The full source to mysqld\_qslower is [here], and more [example output].

The tplist tool from bcc can be used to list USDT probes from a pid or binary. Eg:

<pre>
# <b>tplist -l /usr/local/mysql/bin/mysqld</b>
/usr/local/mysql/bin/mysqld mysql:filesort__start
/usr/local/mysql/bin/mysqld mysql:filesort__done
/usr/local/mysql/bin/mysqld mysql:handler__rdlock__start
/usr/local/mysql/bin/mysqld mysql:handler__rdlock__done
/usr/local/mysql/bin/mysqld mysql:handler__unlock__done
/usr/local/mysql/bin/mysqld mysql:handler__unlock__start
/usr/local/mysql/bin/mysqld mysql:handler__wrlock__start
/usr/local/mysql/bin/mysqld mysql:handler__wrlock__done
/usr/local/mysql/bin/mysqld mysql:insert__row__start
/usr/local/mysql/bin/mysqld mysql:insert__row__done
/usr/local/mysql/bin/mysqld mysql:update__row__start
/usr/local/mysql/bin/mysqld mysql:update__row__done
/usr/local/mysql/bin/mysqld mysql:delete__row__start
/usr/local/mysql/bin/mysqld mysql:delete__row__done
/usr/local/mysql/bin/mysqld mysql:net__write__start
/usr/local/mysql/bin/mysqld mysql:net__write__done
/usr/local/mysql/bin/mysqld mysql:net__read__start
/usr/local/mysql/bin/mysqld mysql:net__read__done
/usr/local/mysql/bin/mysqld mysql:query__exec__start
/usr/local/mysql/bin/mysqld mysql:query__exec__done
/usr/local/mysql/bin/mysqld mysql:query__cache__miss
/usr/local/mysql/bin/mysqld mysql:query__cache__hit
/usr/local/mysql/bin/mysqld mysql:connection__start
/usr/local/mysql/bin/mysqld mysql:connection__done
/usr/local/mysql/bin/mysqld mysql:select__start
/usr/local/mysql/bin/mysqld mysql:select__done
/usr/local/mysql/bin/mysqld mysql:query__parse__start
/usr/local/mysql/bin/mysqld mysql:query__parse__done
/usr/local/mysql/bin/mysqld mysql:command__start
/usr/local/mysql/bin/mysqld mysql:command__done
/usr/local/mysql/bin/mysqld mysql:query__start
/usr/local/mysql/bin/mysqld mysql:query__done
/usr/local/mysql/bin/mysqld mysql:update__start
/usr/local/mysql/bin/mysqld mysql:update__done
/usr/local/mysql/bin/mysqld mysql:multi__update__start
/usr/local/mysql/bin/mysqld mysql:multi__update__done
/usr/local/mysql/bin/mysqld mysql:delete__start
/usr/local/mysql/bin/mysqld mysql:delete__done
/usr/local/mysql/bin/mysqld mysql:multi__delete__start
/usr/local/mysql/bin/mysqld mysql:multi__delete__done
/usr/local/mysql/bin/mysqld mysql:insert__start
/usr/local/mysql/bin/mysqld mysql:insert__done
/usr/local/mysql/bin/mysqld mysql:insert__select__start
/usr/local/mysql/bin/mysqld mysql:insert__select__done
/usr/local/mysql/bin/mysqld mysql:keycache__read__block
/usr/local/mysql/bin/mysqld mysql:keycache__read__miss
/usr/local/mysql/bin/mysqld mysql:keycache__read__done
/usr/local/mysql/bin/mysqld mysql:keycache__read__hit
/usr/local/mysql/bin/mysqld mysql:keycache__read__start
/usr/local/mysql/bin/mysqld mysql:keycache__write__block
/usr/local/mysql/bin/mysqld mysql:keycache__write__done
/usr/local/mysql/bin/mysqld mysql:keycache__write__start
/usr/local/mysql/bin/mysqld mysql:index__read__row__start
/usr/local/mysql/bin/mysqld mysql:index__read__row__done
/usr/local/mysql/bin/mysqld mysql:read__row__start
/usr/local/mysql/bin/mysqld mysql:read__row__done
</pre>

You can also use "readelf -n .../mysqld" to double check. Applications that have these probes typically need to be compiled with --with-dtrace or --enable-dtrace on Linux for them to be included in the binary.

[bcc]: https://github.com/iovisor/bcc
[example output]: https://github.com/iovisor/bcc/blob/master/tools/mysqld_qslower_example.txt
[here]: https://github.com/iovisor/bcc/blob/master/tools/mysqld_qslower.py
[711]: https://github.com/iovisor/bcc/issues/711
]]></content:encoded>
      <dc:date>2016-10-04T00:00:00-07:00</dc:date>
    </item>
    <item>
      <title>Linux bcc Tracing Security Capabilities</title>
      <link>http://www.brendangregg.com/blog/2016-10-01/linux-bcc-security-capabilities.html</link>
      <description><![CDATA[Which Linux security capabilities are your applications using? I recently developed a new tool, capable, to print out capability checks live:
]]></description>
      <pubDate>Sat, 01 Oct 2016 00:00:00 -0700</pubDate>
      <guid>http://www.brendangregg.com/blog/2016-10-01/linux-bcc-security-capabilities.html</guid>
      <content:encoded><![CDATA[Which Linux security capabilities are your applications using? I recently developed a new tool, capable, to print out capability checks live:

<pre>
# <b>capable</b>
TIME      UID    PID    COMM             CAP  NAME                 AUDIT
22:11:23  114    2676   snmpd            12   CAP_NET_ADMIN        1
22:11:23  0      6990   run              24   CAP_SYS_RESOURCE     1
22:11:23  0      7003   chmod            3    CAP_FOWNER           1
22:11:23  0      7003   chmod            4    CAP_FSETID           1
22:11:23  0      7005   chmod            4    CAP_FSETID           1
22:11:23  0      7005   chmod            4    CAP_FSETID           1
22:11:23  0      7006   chown            4    CAP_FSETID           1
22:11:23  0      7006   chown            4    CAP_FSETID           1
22:11:23  0      6990   setuidgid        6    CAP_SETGID           1
22:11:23  0      6990   setuidgid        6    CAP_SETGID           1
22:11:23  0      6990   setuidgid        7    CAP_SETUID           1
22:11:24  0      7013   run              24   CAP_SYS_RESOURCE     1
22:11:24  0      7026   chmod            3    CAP_FOWNER           1
[...]
</pre>

capable uses [bcc], a front-end and a collection of tools that use new Linux enhanced BPF tracing capabilities. capable works by using BPF with kprobes to dynamically trace the kernel cap\_capable() function, and then uses a table to map the capability index to the name seen in the output. Here's the [source code]: it's pretty straightforward.

I wrote it as a colleague, Michael Wardrop, asked what security capabilities our applications were actually using. Given a list, we could use setcap(8) (or other software) to improve the security of applications by only allowing the necessary capabilities.

## Non-audit Checks

The cap\_capable() function has an audit argument, which directs whether the capability check should write an audit message or not, if that's configured. By default, I only print capability checks where this is true, but capable can also trace all checks with the -v option:

<pre>
# <b>capable -h</b>
usage: capable [-h] [-v] [-p PID]

Trace security capability checks

optional arguments:
  -h, --help         show this help message and exit
  -v, --verbose      include non-audit checks
  -p PID, --pid PID  trace this PID only

examples:
    ./capable             # trace capability checks
    ./capable -v          # verbose: include non-audit checks
    ./capable -p 181      # only trace PID 181
</pre>

Here's some non-audit events:

<pre>
# <b>capable -v</b>
TIME      UID    PID    COMM             CAP  NAME                 AUDIT
20:53:45  60004  22061  lsb_release      21   CAP_SYS_ADMIN        0
20:53:45  60004  22061  lsb_release      21   CAP_SYS_ADMIN        0
20:53:45  60004  22061  lsb_release      21   CAP_SYS_ADMIN        0
20:53:45  60004  22061  lsb_release      21   CAP_SYS_ADMIN        0
20:53:45  60004  22061  lsb_release      21   CAP_SYS_ADMIN        0
20:53:45  60004  22061  lsb_release      21   CAP_SYS_ADMIN        0
[...]
</pre>

What are all those?

I'll start by showing the cap\_capable() function prototype, from security/commoncap.c:

<pre>
int cap_capable(const struct cred *cred, struct user_namespace *targ_ns,
                int cap, int audit)
</pre>

Now I can use bcc's trace program to inspect these calls (bear with me), given that cap will be arg3, and audit arg4 (from the above prototype):

<pre>
# <b>trace 'cap_capable "cap: %d, audit: %d", arg3, arg4'</b>
TIME     PID    COMM         FUNC             -
20:56:18 25535  lsb_release  cap_capable      cap: 21, audit: 0
20:56:18 25535  lsb_release  cap_capable      cap: 21, audit: 0
20:56:18 25535  lsb_release  cap_capable      cap: 21, audit: 0
20:56:18 25535  lsb_release  cap_capable      cap: 21, audit: 0
20:56:18 25535  lsb_release  cap_capable      cap: 21, audit: 0
[...]
</pre>

That one-liner is pretty similar to my capable program, except it lacks the "NAME" column with human readable translations.

I'm really doing this so I can add the (newly added) -K and -U options, which print kernel and user-level stack traces. I'll just use -K:

<pre>
# <b>trace -K 'cap_capable "cap: %d, audit: %d", arg3, arg4'</b>
TIME     PID    COMM         FUNC             -
[...]
20:59:58 30607  lsb_release  cap_capable      cap: 21, audit: 0
    Kernel Stack Trace:
        ffffffff813659d1 cap_capable
        ffffffff813684bb security_vm_enough_memory_mm
        ffffffff811deda6 expand_downwards
        ffffffff811def64 expand_stack
        ffffffff81234321 setup_arg_pages
        ffffffff8128c10b load_elf_binary
        ffffffff81234cee search_binary_handler
        ffffffff8128b7ff load_script
        ffffffff81234cee search_binary_handler
        ffffffff8123635a do_execveat_common.isra.35
        ffffffff812367da sys_execve
        ffffffff81003bae do_syscall_64
        ffffffff81861ca5 return_from_SYSCALL_64

20:59:58 30607  lsb_release  cap_capable      cap: 21, audit: 0
    Kernel Stack Trace:
        ffffffff813659d1 cap_capable
        ffffffff813684bb security_vm_enough_memory_mm
        ffffffff811df623 mmap_region
        ffffffff811dff4b do_mmap
        ffffffff811c122a vm_mmap_pgoff
        ffffffff811c1295 vm_mmap
        ffffffff8128bb93 elf_map
        ffffffff8128c271 load_elf_binary
        ffffffff81234cee search_binary_handler
        ffffffff8128b7ff load_script
        ffffffff81234cee search_binary_handler
        ffffffff8123635a do_execveat_common.isra.35
        ffffffff812367da sys_execve
        ffffffff81003bae do_syscall_64
        ffffffff81861ca5 return_from_SYSCALL_64
[...]
</pre>

Awesome. So these are coming from security\_vm\_enough\_memory\_mm(). By reading the source, I see it's used to reserve some memory for root. It's not a hard failure if the capability is missing. And it's not really a security check, hence why it disabled audit.

I should add a -K option to the capable tool, so it can print stack traces too.

## Older Kernels

To use capable, you'll need a 4.4 kernel. To use the -K option, 4.6.

Here's a version using my older [perf-tools] collection, which uses ftrace and should work on much older kernels including the 3.x series:

<pre>
# <b>./perf-tools/bin/kprobe -s 'p:cap_capable cap=%dx audit=%cx' 'audit != 0'</b>
Tracing kprobe cap_capable. Ctrl-C to end.
             run-4440  [003] d... 6417394.367486: cap_capable: (cap_capable+0x0/0x70) cap=0x18 audit=0x1
             run-4440  [003] d... 6417394.367492: <stack trace>
 => ns_capable_common
 => capable
 => do_prlimit
 => SyS_setrlimit
 => entry_SYSCALL_64_fastpath
           chmod-4453  [006] d... 6417394.399020: cap_capable: (cap_capable+0x0/0x70) cap=0x3 audit=0x1
           chmod-4453  [006] d... 6417394.399027: <stack trace>
 => ns_capable_common
 => ns_capable
 => inode_owner_or_capable
 => inode_change_ok
 => xfs_setattr_nonsize
 => xfs_vn_setattr
 => notify_change
 => chmod_common
 => SyS_fchmodat
 => entry_SYSCALL_64_fastpath
           chmod-4453  [006] d... 6417394.399035: cap_capable: (cap_capable+0x0/0x70) cap=0x4 audit=0x1
           chmod-4453  [006] d... 6417394.399037: <stack trace>
 => ns_capable_common
 => capable_wrt_inode_uidgid
 => inode_change_ok
 => xfs_setattr_nonsize
 => xfs_vn_setattr
 => notify_change
 => chmod_common
 => SyS_fchmodat
 => entry_SYSCALL_64_fastpath
           chmod-4455  [007] d... 6417394.402524: cap_capable: (cap_capable+0x0/0x70) cap=0x4 audit=0x1
           chmod-4455  [007] d... 6417394.402529: <stack trace>
 => ns_capable_common
 => capable_wrt_inode_uidgid
 => inode_change_ok
 => xfs_setattr_nonsize
 => xfs_vn_setattr
 => notify_change
 => chmod_common
 => SyS_fchmodat
 => entry_SYSCALL_64_fastpath
[...]
</pre>

It's a one-liner using my kprobe tool. It's also (currently) a bit harder to use: I need to know which registers those arguments will be in: the example above is for x86\_64 only.

That's all for now. Happy hacking.

[bcc]: https://github.com/iovisor/bcc
[perf-tools]: https://github.com/brendangregg/perf-tools
[example output]: https://github.com/iovisor/bcc/blob/master/tools/capable_example.txt
[source code]: https://github.com/iovisor/bcc/blob/master/tools/capable.py
]]></content:encoded>
      <dc:date>2016-10-01T00:00:00-07:00</dc:date>
    </item>
    <item>
      <title>Java Warmup</title>
      <link>http://www.brendangregg.com/blog/2016-09-28/java-warmup.html</link>
      <description><![CDATA[I gave a talk at JavaOne last week on flame graphs (slides here), and in Q&amp;A was asked about Java warmup. I wanted to show some flame graphs that illustrated it, but didn&#39;t have them on hand. Here they are.
]]></description>
      <pubDate>Wed, 28 Sep 2016 00:00:00 -0700</pubDate>
      <guid>http://www.brendangregg.com/blog/2016-09-28/java-warmup.html</guid>
      <content:encoded><![CDATA[I gave a talk at JavaOne last week on flame graphs ([slides here]), and in Q&A was asked about Java warmup. I wanted to show some flame graphs that illustrated it, but didn't have them on hand. Here they are.

These are 30 second flame graphs from a Java microservice warming up and beginning to process load. The following 30 flame graphs (3 rows of 10) have been shrunk too small to read the frames, but at this level you can see how processing changes in the first 10 minutes then settles down. Click to zoom (a little):

<p><center><a href="/blog/images/2016/java_flamegraph_warmup_montage.png"><img src="/blog/images/2016/java_flamegraph_warmup_montage.png" border=0 width=700></a></center></p>

The hues are: yellow is C++, green is Java methods, orange is kernel code, and red is other user-level code. I'm using the Linux perf profiler, which can see everything, including kernel code execution with Java context. JVM profilers (which are what almost everyone is using) cannot do this, and have blind spots.

Now I'll zoom in to the first two:

<h3>Time = 0 - 30s</h3>

<center><p><object data="/blog/images/2016/java_flamegraph_warmup_000.svg" type="image/svg+xml" width=700 height=772>
<img src="/blog/images/2016/java_flamegraph_warmup_000.svg" width=700 height=772 />
</object></p></center>

This is the first 30 seconds of Java after launch.

On the left, the Java heap is being pre-allocated by the JVM code (C++, yellow). The os::pretouch\_memory() function is triggering page faults (kernel code, orange). Click Universe::initialize\_heap to get a breakdown of where time (CPU cycles) are spent. Neat, huh?

Because you're probably wondering, here's clear\_page\_c\_c() from arch/x86/lib/clear\_page\_64.S:

<pre>
ENTRY(clear_page_c_e)
    movl $4096,%ecx
    xorl %eax,%eax
    rep stosb
    ret
ENDPROC(clear_page_c_e)
</pre>

It uses REP and XORL instructions to wipe a page (4096 bytes) of memory. And that's a performance optimization added to Linux <a href="https://github.com/torvalds/linux/commit/e365c9df2f2f001450decf9512412d2d5bd1cdef#diff-e66daa94d10db3737c64b090770a16c2">by Intel</a> \(Linux gets such micro-optimizations frequently\). Now zoom back out using "Reset Zoom" (top left) or click a lower frame.

The tower on the right has many "Interpreter" frames in red (user-mode): these are Java methods that likely haven't hit the CompileThreshold tunable yet (by default: 10000). If these methods keep being called, they'll switch from being interpreted (executed by the JVM's "Interpreter" function) to being complied and executed natively. At that point, the compiled methods will be running from their own addresses which can be profiled and translated: in the Java flame graph, they'll be colored green.

The tower in the middle around C2Compiler::compile\_method() shows the C2 compiler working on compiling Java just-in-time.

<h3>Time = 30 - 60s</h3>

<center><p><object data="/blog/images/2016/java_flamegraph_warmup_001.svg" type="image/svg+xml" width=700 height=1066>
<img src="/blog/images/2016/java_flamegraph_warmup_001.svg" width=700 height=1066 />
</object></p></center>

We're done initializing the heap (that tower is now gone), but we're still compiling: about 55% of the CPU cycles are in the C2 compiler, and lots of Interpreter frame are still present. Click CompileBroker::compiler\_thread\_loop to zoom in to the compiler. I can't help but want to dig into these functions further, to look for optimizations in the compiler itself. Zooming back out...

There's a new tower, on the left, containing ParNewGenTask::work. Yes, GC has kicked in by the 30-60s profile. (Not all frames there are colored yellow, but they should be, it's just a bug in the flame graph software.)

See the orange frames in the top right? That's the kernel deleting files, called from java/io/UnixFileSystem:::delete0(). It's called so much that its CPU footprint is showing up in the profile. How interesting. So what's going on? The answer should be down the stack, but it's still Interpreter frames. I could do another warmup run and use my tracing tools ([perf-tools] or [bcc]). But right now I haven't done that so I don't know -- a mystery. I'd guess log rotation.

The frames with "\[perf-79379.map\]": these used to say "unknown", but a recent flame graph change defaults to the library name if the function symbol couldn't be found. In this case, it's not a library, but a map file which should have Java methods. So why weren't these symbols in it? The way I'm currently doing Java symbol translation is to take snapshots at the end of the profile, and it's possible those symbols were recompiled and moved by that point.

See the "Interpreter" tower on the far left? Notice how they don't begin with start\_thread(), java\_start(), etc? This is likely a truncated stack, due to the default setting of Linux perf to only capture the top 127 frames. Truncated stacks break flame graphs: it doesn't know where to merge them. This will be tunable in Linux 4.8 (kernel.perf\_event\_max\_stack). I'd bed this tower really belongs on as part of the tallest Interpreter tower on the right -- if you can imagine moving the frames over there and making it wider to fit.

<h3>Time = 90+s</h3>

Going back to the montage, you can see different phases of warmup come and go, and different patterns of Java methods in green for each of them:

<p><center><a href="/blog/images/2016/java_flamegraph_warmup_montage.png"><img src="/blog/images/2016/java_flamegraph_warmup_montage.png" border=0 width=700></a></center></p>

I was hoping to document and include more of the warmup, as each stage can be clearly seen and inspected (and there are more interesting things to describe, including a lock contention issue in CFLS\_LAB::alloc() that showed up), but this post would get very long. I'll skip ahead to the warmed-up point:

<h3>Time = 600 - 630s</h3>

<p><center><a href="/blog/images/2016/java_flamegraph_warmup_020.png"><img src="/blog/images/2016/java_flamegraph_warmup_020.png" border=0 width=598></a></center></p>

At the 10 minute mark, the microservice is running its production workload. Over time, the compiler tower shrinks further and we spend more time running Java code. (This image is a PNG, not a SVG.)

The small green frames on the left are more evidence of truncated stacks.

<h3>Generation</h3>

To generate these flame graphs, I ran the following the same moment I began load:

<pre>
for i in `seq 100 130`; do perf record -F 99 -a -g -- sleep 30; ./jmaps
    perf script > out.perf_warmup_$i.txt; echo $i done; done
</pre>

The [jmaps] program is an unsupported helper script for automating a symbol dump using Java perf-map-agent.

Then I post processed with the [FlameGraph] software:

<pre>
for i in `seq 100 130`; do ./FlameGraph/stackcollapse-perf.pl --kernel \
    < out.perf_warmup_$i.txt | grep '^java' | grep -v cpu_idle | \
    ./FlameGraph/flamegraph.pl --color=java --width=800 --minwidth=0.5 \
    --title="at $(((i - 100) * 30)) seconds" > warmup_$i.svg; echo $i done; done
</pre>

This is using the older way of profiling with Linux perf: via a perf.data file and "perf script". A newer way has been in development that uses BPF for efficient in-kernel summaries, but that's a topic for another post...

[FlameGraph]: https://github.com/brendangregg/FlameGraph
[slides here]: http://www.slideshare.net/brendangregg/java-performance-analysis-on-linux-with-flame-graphs
[perf-tools]: https://github.com/brendangregg/perf-tools
[bcc]: https://github.com/iovisor/bcc
[montage]: /blog/images/2016/java_flamegraph_warmup_montage.png
[jmaps]: https://raw.githubusercontent.com/brendangregg/Misc/master/java/jmaps
]]></content:encoded>
      <dc:date>2016-09-28T00:00:00-07:00</dc:date>
    </item>
    <item>
      <title>gdb Debugging Full Example (Tutorial): ncurses</title>
      <link>http://www.brendangregg.com/blog/2016-08-09/gdb-example-ncurses.html</link>
      <description><![CDATA[I&#39;m a little frustrated with finding &quot;gdb examples&quot; online that show the commands but not their output. gdb is the GNU Debugger, the standard debugger on Linux. I was reminded of the lack of example output when watching the Give me 15 minutes and I&#39;ll change your view of GDB talk by Greg Law at CppCon 2015, which, thankfully, includes output! It&#39;s well worth the 15 minutes.
]]></description>
      <pubDate>Tue, 09 Aug 2016 00:00:00 -0700</pubDate>
      <guid>http://www.brendangregg.com/blog/2016-08-09/gdb-example-ncurses.html</guid>
      <content:encoded><![CDATA[I'm a little frustrated with finding "gdb examples" online that show the commands but not their output. gdb is the GNU Debugger, the standard debugger on Linux. I was reminded of the lack of example output when watching the [Give me 15 minutes and I'll change your view of GDB] talk by Greg Law at CppCon 2015, which, thankfully, includes output! It's well worth the 15 minutes.

It also inspired me to share a full gdb debugging example, with output and every step involved, including dead ends. This isn't a particularly interesting or exotic issue, it's just a routine gdb debugging session. But it covers the basics and could serve as a tutorial of sorts, bearing in mind there's a lot more to gdb than I used here.

I'll be running the following commands as root, since I'm debugging a tool that needs root access (for now). Substitute non-root and sudo as desired. You also aren't expected to read through all this: I've enumerated each step so you can browse them and find ones of interest.

## 1. The Problem

The [bcc] collection of BPF tools had a pull request for [cachetop], which uses a top-like display to show page cache statistics by process. Great! However, when I tested it, it hit a segfault:

<pre>
# <b>./cachetop.py</b>
Segmentation fault
</pre>

Note that it says "Segmentation fault" and not "Segmentation fault (core dumped)". I'd like a core dump to debug this. (A core dump is a copy of process memory &ndash; the name coming from the era of magnetic core memory &ndash; and can be investigated using a debugger.)

Core dump analysis is one approach for debugging, but not the only one. I could run the program live in gdb to inspect the issue. I could also use an external tracer to grab data and stack traces on segfault events. We'll start with core dumps.

## 2. Fixing Core Dumps

I'll check the core dump settings:

<pre>
# <b>ulimit -c</b>
0
# <b>cat /proc/sys/kernel/core_pattern</b>
core
</pre>

<tt>ulimit -c</tt> shows the maximum size of core dumps created, and it's set to zero: disabling core dumps (for this process and its children).

The <tt>/proc/.../core\_pattern</tt> is set to just "core", which will drop a core dump file called "core" in the current directory. That will be ok for now, but I'll show how to set this up for a global location:

<pre>
# <b>ulimit -c unlimited</b>
# <b>mkdir /var/cores</b>
# <b>echo "/var/cores/core.%e.%p" > /proc/sys/kernel/core_pattern</b>
</pre>

You can customize that core\_pattern further; eg, <tt>%h</tt> for hostname and <tt>%t</tt> for time of dump. The options are documented in the Linux kernel source, under Documentation/sysctl/[kernel.txt].

To make the core\_pattern permanent, and survive reboots, you can set it via "kernel.core\_pattern" in /etc/sysctl.conf.

Trying again:

<pre>
# <b>./cachetop.py</b>
Segmentation fault (core dumped)
# <b>ls -lh /var/cores</b>
total 19M
-rw------- 1 root root 20M Aug  7 22:15 core.python.30520
# <b>file /var/cores/core.python.30520 </b>
/var/cores/core.python.30520: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from 'python ./cachetop.py'
</pre>

That's better: we have our core dump.

## 3. Starting GDB

Now I'll run gdb with the target program location (using shell substitution, "`", although you should specify the full path unless you're sure that will work), and the core dump file:

<pre>
# <b>gdb `which python` /var/cores/core.python.30520</b>
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/bin/python...(no debugging symbols found)...done.

warning: core file may not match specified executable file.
[New LWP 30520]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

warning: JITed object file architecture unknown is not compatible with target architecture i386:x86-64.
Core was generated by `python ./cachetop.py'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f0a37aac40d in doupdate () from /lib/x86_64-linux-gnu/libncursesw.so.5
</pre>

The last two lines are especially interesting: it tells us it's a segmentation fault in the <tt>doupdate()</tt> function from the libncursesw library. That's worth a quick web search in case it's a well-known issue. I took a quick look but didn't find a single common cause.

I already can guess what libncursesw is for, but if that were foreign to you, then being under "/lib" and ending in ".so.\*" shows it's a shared library, which might have a man page, website, package description, etc.

<pre>
# <b>dpkg -l | grep libncursesw</b>
ii  libncursesw5:amd64                  6.0+20160213-1ubuntu1                    amd64
     shared libraries for terminal handling (wide character support)
</pre>

I happen to be debugging this on Ubuntu, but the Linux distro shouldn't matter for gdb usage.

## 4. Back Trace

Stack back traces show how we arrived at the point of fail, and are often enough to help identify a common problem. It's usually the first command I use in a gdb session: <tt>bt</tt> (short for <tt>backtrace</tt>):

<pre>
(gdb) <b>bt</b>
#0  0x00007f0a37aac40d in doupdate () from /lib/x86_64-linux-gnu/libncursesw.so.5
#1  0x00007f0a37aa07e6 in wrefresh () from /lib/x86_64-linux-gnu/libncursesw.so.5
#2  0x00007f0a37a99616 in ?? () from /lib/x86_64-linux-gnu/libncursesw.so.5
#3  0x00007f0a37a9a325 in wgetch () from /lib/x86_64-linux-gnu/libncursesw.so.5
#4  0x00007f0a37cc6ec3 in ?? () from /usr/lib/python2.7/lib-dynload/_curses.x86_64-linux-gnu.so
#5  0x00000000004c4d5a in PyEval_EvalFrameEx ()
#6  0x00000000004c2e05 in PyEval_EvalCodeEx ()
#7  0x00000000004def08 in ?? ()
#8  0x00000000004b1153 in PyObject_Call ()
#9  0x00000000004c73ec in PyEval_EvalFrameEx ()
#10 0x00000000004c2e05 in PyEval_EvalCodeEx ()
#11 0x00000000004caf42 in PyEval_EvalFrameEx ()
#12 0x00000000004c2e05 in PyEval_EvalCodeEx ()
#13 0x00000000004c2ba9 in PyEval_EvalCode ()
#14 0x00000000004f20ef in ?? ()
#15 0x00000000004eca72 in PyRun_FileExFlags ()
#16 0x00000000004eb1f1 in PyRun_SimpleFileExFlags ()
#17 0x000000000049e18a in Py_Main ()
#18 0x00007f0a3be10830 in __libc_start_main (main=0x49daf0 &lt;main&gt;, argc=2, argv=0x7ffd33d94838, init=&lt;optimized out&gt;, fini=&lt;optimized out&gt;, rtld_fini=&lt;optimized out&gt;, 
    stack_end=0x7ffd33d94828) at ../csu/libc-start.c:291
#19 0x000000000049da19 in _start ()
</pre>

Read from bottom up, to go from parent to child. The "<tt>??</tt>" entries are where symbol translation failed. Stack walking &ndash; which produces the stack trace &ndash; can also fail. In that case you'll likely see a single valid frame, then a small number of bogus addresses. If symbols or stacks are too badly broken to make sense of the stack trace, then there are usually ways to fix it: installing debug info packages (giving gdb more symbols, and letting it do DWARF-based stack walks), or recompiling the software from source with frame pointers and debugging information (<tt>-fno-omit-frame-pointer -g</tt>). Many of the above "<tt>??</tt>" entries can be fixed by adding the python-dbg package.

This particular stack doesn't look very helpful: frames 5 to 17 (indexed on the left) are Python internals, although we can't see the Python methods (yet). Then frame 4 is the \_curses library, then we're in libncursesw. Looks like <tt>wgetch()</tt>-><tt>wrefresh()</tt>-><tt>doupdate()</tt>. Just based on the names, I'd guess a window refresh. Why would that core dump?

## 5. Disassembly

I'll start by disassembling the function we segfaulted in, <tt>doupdate()</tt>:

<pre>
(gdb) <b>disas doupdate</b>
Dump of assembler code for function doupdate:
   0x00007f0a37aac2e0 &lt;+0&gt;:	push   %r15
   0x00007f0a37aac2e2 &lt;+2&gt;:	push   %r14
   0x00007f0a37aac2e4 &lt;+4&gt;:	push   %r13
   0x00007f0a37aac2e6 &lt;+6&gt;:	push   %r12
   0x00007f0a37aac2e8 &lt;+8&gt;:	push   %rbp
   0x00007f0a37aac2e9 &lt;+9&gt;:	push   %rbx
   0x00007f0a37aac2ea &lt;+10&gt;:	sub    $0xc8,%rsp
[...]
---Type &lt;return&gt; to continue, or q &lt;return&gt; to quit---
[...]
   0x00007f0a37aac3f7 &lt;+279&gt;:	cmpb   $0x0,0x21(%rcx)
   0x00007f0a37aac3fb &lt;+283&gt;:	je     0x7f0a37aacc3b &lt;doupdate+2395&gt;
   0x00007f0a37aac401 &lt;+289&gt;:	mov    0x20cb68(%rip),%rax        # 0x7f0a37cb8f70
   0x00007f0a37aac408 &lt;+296&gt;:	mov    (%rax),%rsi
   0x00007f0a37aac40b &lt;+299&gt;:	xor    %eax,%eax
=> 0x00007f0a37aac40d &lt;+301&gt;:	mov    0x10(%rsi),%rdi
   0x00007f0a37aac411 &lt;+305&gt;:	cmpb   $0x0,0x1c(%rdi)
   0x00007f0a37aac415 &lt;+309&gt;:	jne    0x7f0a37aac6f7 &lt;doupdate+1047&gt;
   0x00007f0a37aac41b &lt;+315&gt;:	movswl 0x4(%rcx),%ecx
   0x00007f0a37aac41f &lt;+319&gt;:	movswl 0x74(%rdx),%edi
   0x00007f0a37aac423 &lt;+323&gt;:	mov    %rax,0x40(%rsp)
[...]
</pre>

Output truncated. (I could also have just typed "<tt>disas</tt>" and it would have defaulted to <tt>doupdate</tt>.)

The arrow "=>" is pointing to our segfault address, which is doing a <tt>mov 0x10(%rsi),%rdi</tt>: a move from the memory pointed to in the %rsi register plus an offset of 0x10, to the %rdi register. I'll check the state of the registers next.

## 6. Check Registers

Printing register state using <tt>i r</tt> (short for <tt>info registers</tt>):

<pre>
(gdb) <b>i r</b>
rax            0x0	0
rbx            0x1993060	26816608
rcx            0x19902a0	26804896
rdx            0x19ce7d0	27060176
rsi            0x0	0
rdi            0x19ce7d0	27060176
rbp            0x7f0a3848eb10	0x7f0a3848eb10 &lt;SP&gt;
rsp            0x7ffd33d93c00	0x7ffd33d93c00
r8             0x7f0a37cb93e0	139681862489056
r9             0x0	0
r10            0x8	8
r11            0x202	514
r12            0x0	0
r13            0x0	0
r14            0x7f0a3848eb10	139681870703376
r15            0x19ce7d0	27060176
rip            0x7f0a37aac40d	0x7f0a37aac40d &lt;doupdate+301&gt;
eflags         0x10246	[ PF ZF IF RF ]
cs             0x33	51
ss             0x2b	43
ds             0x0	0
es             0x0	0
fs             0x0	0
gs             0x0	0
</pre>

Well, %rsi is zero. There's our problem! Zero is unlikely a valid address, and this type of segfault is a common software bug: dereferencing an uninitialized or NULL pointer.

## 7. Memory Mappings

You can double check if zero is valid using <tt>i proc m</tt> (short for <tt>info proc mappings</tt>):

<pre>
(gdb) <b>i proc m</b>
Mapped address spaces:

      Start Addr           End Addr       Size     Offset objfile
        0x400000           0x6e7000   0x2e7000        0x0 /usr/bin/python2.7
        0x8e6000           0x8e8000     0x2000   0x2e6000 /usr/bin/python2.7
        0x8e8000           0x95f000    0x77000   0x2e8000 /usr/bin/python2.7
  0x7f0a37a8b000     0x7f0a37ab8000    0x2d000        0x0 /lib/x86_64-linux-gnu/libncursesw.so.5.9
  0x7f0a37ab8000     0x7f0a37cb8000   0x200000    0x2d000 /lib/x86_64-linux-gnu/libncursesw.so.5.9
  0x7f0a37cb8000     0x7f0a37cb9000     0x1000    0x2d000 /lib/x86_64-linux-gnu/libncursesw.so.5.9
  0x7f0a37cb9000     0x7f0a37cba000     0x1000    0x2e000 /lib/x86_64-linux-gnu/libncursesw.so.5.9
  0x7f0a37cba000     0x7f0a37ccd000    0x13000        0x0 /usr/lib/python2.7/lib-dynload/_curses.x86_64-linux-gnu.so
  0x7f0a37ccd000     0x7f0a37ecc000   0x1ff000    0x13000 /usr/lib/python2.7/lib-dynload/_curses.x86_64-linux-gnu.so
  0x7f0a37ecc000     0x7f0a37ecd000     0x1000    0x12000 /usr/lib/python2.7/lib-dynload/_curses.x86_64-linux-gnu.so
  0x7f0a37ecd000     0x7f0a37ecf000     0x2000    0x13000 /usr/lib/python2.7/lib-dynload/_curses.x86_64-linux-gnu.so
  0x7f0a38050000     0x7f0a38066000    0x16000        0x0 /lib/x86_64-linux-gnu/libgcc_s.so.1
  0x7f0a38066000     0x7f0a38265000   0x1ff000    0x16000 /lib/x86_64-linux-gnu/libgcc_s.so.1
  0x7f0a38265000     0x7f0a38266000     0x1000    0x15000 /lib/x86_64-linux-gnu/libgcc_s.so.1
  0x7f0a38266000     0x7f0a3828b000    0x25000        0x0 /lib/x86_64-linux-gnu/libtinfo.so.5.9
  0x7f0a3828b000     0x7f0a3848a000   0x1ff000    0x25000 /lib/x86_64-linux-gnu/libtinfo.so.5.9
[...]
</pre>

The first valid virtual address is 0x400000. Anything below that is invalid, and if referenced, will trigger a segmentation fault.

At this point there are several different ways to dig further. I'll start with some instruction stepping.

## 8. Breakpoints

Back to the disassembly:

<pre>
   0x00007f0a37aac401 <+289>:   mov    0x20cb68(%rip),%rax        # 0x7f0a37cb8f70
   0x00007f0a37aac408 <+296>:   mov    (%rax),%rsi
   0x00007f0a37aac40b <+299>:   xor    %eax,%eax
=> 0x00007f0a37aac40d <+301>:   mov    0x10(%rsi),%rdi
</pre>

Reading these four instructions: it looks like it's pulling something from the stack into %rax, then dereferencing %rax into %rsi, the setting %eax to zero (the xor is an optimization, instead of doing a mov of $0), and then we dereference %rsi with an offset, although we know %rsi is zero. This sequence is for walking data structures. Maybe %rax would be interesting, but it's been set to zero by the prior instruction, so we can't see it in the core dump register state.

I can set a breakpoint on doupdate+289, then single-step through each instruction to see how the registers are set and change. First, I need to launch gdb so that we're executing the program live:

<pre>
# <b>gdb `which python`</b>
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86\_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/bin/python...(no debugging symbols found)...done.
</pre>

Now to set the breakpoint using <tt>b</tt> (short for <tt>break</tt>):

<pre>
(gdb) <b>b *doupdate + 289</b>
No symbol table is loaded.  Use the "file" command.
</pre>

Oops. I wanted to show this error to explain why we often start out with a breakpoint on <tt>main</tt>, at which point the symbols are likely loaded, and then setting the real breakpoint of interest. I'll go straight to <tt>doupdate</tt> function entry, run the problem, then set the offset breakpoint once it hits the function:

<pre>
(gdb) <b>b doupdate</b>
Function "doupdate" not defined.
Make breakpoint pending on future shared library load? (y or [n]) <b>y</b>
Breakpoint 1 (doupdate) pending.
(gdb) <b>r cachetop.py</b>
Starting program: /usr/bin/python cachetop.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
warning: JITed object file architecture unknown is not compatible with target architecture i386:x86-64.

Breakpoint 1, 0x00007ffff34ad2e0 in doupdate () from /lib/x86_64-linux-gnu/libncursesw.so.5
(gdb) <b>b *doupdate + 289</b>
Breakpoint 2 at 0x7ffff34ad401
(gdb) <b>c</b>
Continuing.

Breakpoint 2, 0x00007ffff34ad401 in doupdate () from /lib/x86_64-linux-gnu/libncursesw.so.5
</pre>

We've arrived at our breakpoint.

If you haven't done this before, the <tt>r</tt> (<tt>run</tt>) command takes arguments that will be passed to the gdb target we specified earlier on the command line (python). So this ends up running "python cachetop.py".

## 9. Stepping

I'll step one instruction (<tt>si</tt>, short for <tt>stepi</tt>) then inspect registers:

<pre>
(gdb) <b>si</b>
0x00007ffff34ad408 in doupdate () from /lib/x86_64-linux-gnu/libncursesw.so.5
(gdb) <b>i r</b>
rax            0x7ffff3e8f948	140737285519688
rbx            0xaea060	11444320
rcx            0xae72a0	11432608
rdx            0xa403d0	10748880
rsi            0x7ffff7ea8e10	140737352732176
rdi            0xa403d0	10748880
rbp            0x7ffff3e8fb10	0x7ffff3e8fb10 &lt;SP&gt;
rsp            0x7fffffffd390	0x7fffffffd390
r8             0x7ffff36ba3e0	140737277305824
r9             0x0	0
r10            0x8	8
r11            0x202	514
r12            0x0	0
r13            0x0	0
r14            0x7ffff3e8fb10	140737285520144
r15            0xa403d0	10748880
rip            0x7ffff34ad408	0x7ffff34ad408 &lt;doupdate+296&gt;
eflags         0x202	[ IF ]
cs             0x33	51
ss             0x2b	43
ds             0x0	0
es             0x0	0
fs             0x0	0
gs             0x0	0
(gdb) <b>p/a 0x7ffff3e8f948</b>
$1 = 0x7ffff3e8f948 &lt;cur_term&gt;
</pre>

Another clue. So the NULL pointer we're dereferencing looks like it's in a symbol called "<tt>cur\_term</tt>" (<tt>p/a</tt> is short for <tt>print/a</tt>, where "<tt>/a</tt>" means format as an address). Given this is ncurses, is our TERM environment set to something odd?

<pre>
# <b>echo $TERM</b>
xterm-256color
</pre>

I tried setting that to vt100 and running the program, but it hit the same segfault.

Note that I've inspected just the first invocation of <tt>doupdate()</tt>, but it could be called multiple times, and the issue may be a later invocation. I can step through each by running <tt>c</tt> (short for <tt>continue</tt>). That will be ok if it's only called a few times, but if it's called a few thousand times I'll want a different approach. (I'll get back to this in section 15.)

## 10. Reverse Stepping

gdb has a great feature called reverse stepping, which Greg Law included in his talk. Here's an example.

I'll start a python session again, to show this from the beginning:

<pre>
# <b>gdb `which python`</b>
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86\_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
&lt;http://www.gnu.org/software/gdb/bugs/&gt;.
Find the GDB manual and other documentation resources online at:
&lt;http://www.gnu.org/software/gdb/documentation/&gt;.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/bin/python...(no debugging symbols found)...done.
</pre>

Now I'll set a breakpoint on <tt>doupdate</tt> as before, but once it's hit, I'll enable recording, then continue the program and let it crash. Recording adds considerable overhead, so I don't want to add it on <tt>main</tt>.

<pre>
(gdb) <b>b doupdate</b>
Function "doupdate" not defined.
Make breakpoint pending on future shared library load? (y or [n]) <b>y</b>
Breakpoint 1 (doupdate) pending.
(gdb) <b>r cachetop.py</b>
Starting program: /usr/bin/python cachetop.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
warning: JITed object file architecture unknown is not compatible with target architecture i386:x86-64.

Breakpoint 1, 0x00007ffff34ad2e0 in doupdate () from /lib/x86_64-linux-gnu/libncursesw.so.5
(gdb) <b>record</b>
(gdb) <b>c</b>
Continuing.

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff34ad40d in doupdate () from /lib/x86_64-linux-gnu/libncursesw.so.5
</pre>

At this point I can reverse-step through lines or instructions. It works by playing back register state from our recording. I'll move back in time two instructions, then print registers:

<pre>
(gdb) <b>reverse-stepi</b>
0x00007ffff34ad40d in doupdate () from /lib/x86_64-linux-gnu/libncursesw.so.5
(gdb) <b>reverse-stepi</b>
0x00007ffff34ad40b in doupdate () from /lib/x86_64-linux-gnu/libncursesw.so.5
(gdb) <b>i r</b>
rax            0x7ffff3e8f948	140737285519688
rbx            0xaea060	11444320
rcx            0xae72a0	11432608
rdx            0xa403d0	10748880
rsi            0x0	0
rdi            0xa403d0	10748880
rbp            0x7ffff3e8fb10	0x7ffff3e8fb10 &lt;SP&gt;
rsp            0x7fffffffd390	0x7fffffffd390
r8             0x7ffff36ba3e0	140737277305824
r9             0x0	0
r10            0x8	8
r11            0x302	770
r12            0x0	0
r13            0x0	0
r14            0x7ffff3e8fb10	140737285520144
r15            0xa403d0	10748880
rip            0x7ffff34ad40b	0x7ffff34ad40b &lt;doupdate+299&gt;
eflags         0x202	[ IF ]
cs             0x33	51
ss             0x2b	43
ds             0x0	0
es             0x0	0
fs             0x0	0
gs             0x0	0
(gdb) <b>p/a</b> 0x7ffff3e8f948
$1 = 0x7ffff3e8f948 &lt;cur_term&gt;
</pre>

So, back to finding the "cur\_term" clue. I really want to read the source code at this point, but I'll start with debug info.

## 11. Debug Info

This is libncursesw, and I don't have debug info installed (Ubuntu):

<pre>
# <b>apt-cache search libncursesw</b>
libncursesw5 - shared libraries for terminal handling (wide character support)
libncursesw5-dbg - debugging/profiling libraries for ncursesw
libncursesw5-dev - developer's libraries for ncursesw
# <b>dpkg -l | grep libncursesw</b>
ii  libncursesw5:amd64                  6.0+20160213-1ubuntu1                    amd64        shared libraries for terminal handling (wide character support)
</pre>

I'll add that:

<pre>
# <b>apt-get install -y libncursesw5-dbg</b>
Reading package lists... Done
Building dependency tree       
Reading state information... Done
[...]
After this operation, 2,488 kB of additional disk space will be used.
Get:1 http://us-west-1.ec2.archive.ubuntu.com/ubuntu xenial/main amd64 libncursesw5-dbg amd64 6.0+20160213-1ubuntu1 [729 kB]
Fetched 729 kB in 0s (865 kB/s)          
Selecting previously unselected package libncursesw5-dbg.
(Reading database ... 200094 files and directories currently installed.)
Preparing to unpack .../libncursesw5-dbg_6.0+20160213-1ubuntu1_amd64.deb ...
Unpacking libncursesw5-dbg (6.0+20160213-1ubuntu1) ...
Setting up libncursesw5-dbg (6.0+20160213-1ubuntu1) ...
# <b>dpkg -l | grep libncursesw</b>
ii  libncursesw5:amd64                  6.0+20160213-1ubuntu1                    amd64        shared libraries for terminal handling (wide character support)
ii  libncursesw5-dbg                    6.0+20160213-1ubuntu1                    amd64        debugging/profiling libraries for ncursesw
</pre>

Good, those versions match. So how does our segfault look now?

<pre>
# <b>gdb `which python` /var/cores/core.python.30520</b>
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
[...]
warning: JITed object file architecture unknown is not compatible with target architecture i386:x86-64.
Core was generated by `python ./cachetop.py'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  ClrBlank (win=0x1993060) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tty/tty_update.c:1129
1129	    if (back_color_erase)
(gdb) <b>bt</b>
#0  ClrBlank (win=0x1993060) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tty/tty_update.c:1129
#1  ClrUpdate () at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tty/tty_update.c:1147
#2  doupdate () at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tty/tty_update.c:1010
#3  0x00007f0a37aa07e6 in wrefresh (win=win@entry=0x1993060) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/base/lib_refresh.c:65
#4  0x00007f0a37a99499 in recur_wrefresh (win=win@entry=0x1993060) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/base/lib_getch.c:384
#5  0x00007f0a37a99616 in _nc_wgetch (win=win@entry=0x1993060, result=result@entry=0x7ffd33d93e24, use_meta=1)
    at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/base/lib_getch.c:491
#6  0x00007f0a37a9a325 in wgetch (win=0x1993060) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/base/lib_getch.c:672
#7  0x00007f0a37cc6ec3 in ?? () from /usr/lib/python2.7/lib-dynload/_curses.x86_64-linux-gnu.so
#8  0x00000000004c4d5a in PyEval_EvalFrameEx ()
#9  0x00000000004c2e05 in PyEval_EvalCodeEx ()
#10 0x00000000004def08 in ?? ()
#11 0x00000000004b1153 in PyObject_Call ()
#12 0x00000000004c73ec in PyEval_EvalFrameEx ()
#13 0x00000000004c2e05 in PyEval_EvalCodeEx ()
#14 0x00000000004caf42 in PyEval_EvalFrameEx ()
#15 0x00000000004c2e05 in PyEval_EvalCodeEx ()
#16 0x00000000004c2ba9 in PyEval_EvalCode ()
#17 0x00000000004f20ef in ?? ()
#18 0x00000000004eca72 in PyRun_FileExFlags ()
#19 0x00000000004eb1f1 in PyRun_SimpleFileExFlags ()
#20 0x000000000049e18a in Py_Main ()
#21 0x00007f0a3be10830 in __libc_start_main (main=0x49daf0 &lt;main&gt;, argc=2, argv=0x7ffd33d94838, init=&lt;optimized out&gt;, fini=&lt;optimized out&gt;, rtld_fini=&lt;optimized out&gt;, 
    stack_end=0x7ffd33d94828) at ../csu/libc-start.c:291
#22 0x000000000049da19 in _start ()
</pre>

The stack trace looks a bit different: we aren't really in <tt>doupdate()</tt>, but <tt>ClrBlank()</tt>, which has been inlined in <tt>ClrUpdate()</tt>, and inlined in <tt>doupdate()</tt>.

Now I really want to see source.

## 12. Source Code

With the debug info package installed, gdb can list the source along with the assembly:

<pre>
(gdb) <b>disas/s</b>
Dump of assembler code for function doupdate:
/build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tty/tty_update.c:
759	{
   0x00007f0a37aac2e0 &lt;+0&gt;:	push   %r15
   0x00007f0a37aac2e2 &lt;+2&gt;:	push   %r14
   0x00007f0a37aac2e4 &lt;+4&gt;:	push   %r13
   0x00007f0a37aac2e6 &lt;+6&gt;:	push   %r12
[...]
   0x00007f0a37aac3dd &lt;+253&gt;:	jne    0x7f0a37aac6ca &lt;doupdate+1002&gt;

1009	    if (CurScreen(SP_PARM)-&gt;_clear || NewScreen(SP_PARM)-&gt;_clear) {	/* force refresh ? */
   0x00007f0a37aac3e3 &lt;+259&gt;:	mov    0x80(%rdx),%rax
   0x00007f0a37aac3ea &lt;+266&gt;:	mov    0x88(%rdx),%rcx
   0x00007f0a37aac3f1 &lt;+273&gt;:	cmpb   $0x0,0x21(%rax)
   0x00007f0a37aac3f5 &lt;+277&gt;:	jne    0x7f0a37aac401 &lt;doupdate+289&gt;
   0x00007f0a37aac3f7 &lt;+279&gt;:	cmpb   $0x0,0x21(%rcx)
   0x00007f0a37aac3fb &lt;+283&gt;:	je     0x7f0a37aacc3b &lt;doupdate+2395&gt;

1129	    if (back_color_erase)
   0x00007f0a37aac401 &lt;+289&gt;:	mov    0x20cb68(%rip),%rax        # 0x7f0a37cb8f70
   0x00007f0a37aac408 &lt;+296&gt;:	mov    (%rax),%rsi

1128	    NCURSES_CH_T blank = blankchar;
   0x00007f0a37aac40b &lt;+299&gt;:	xor    %eax,%eax

1129	    if (back_color_erase)
=> 0x00007f0a37aac40d &lt;+301&gt;:	mov    0x10(%rsi),%rdi
   0x00007f0a37aac411 &lt;+305&gt;:	cmpb   $0x0,0x1c(%rdi)
   0x00007f0a37aac415 &lt;+309&gt;:	jne    0x7f0a37aac6f7 &lt;doupdate+1047&gt;
</pre>

Great! See the arrow "=>" and the line of code above it. So we're segfaulting on "<tt>if (back\_color\_erase)</tt>"? That doesn't seem possible. (A segfault would be due to a memory dereference, which in C would be <tt>a-&gt;b</tt> or <tt>\*a</tt>, but in this case it's just "<tt>back\_color\_erase</tt>", which looks like it's accessing an ordinary variable and not dereferencing memory.)

At this point I double checked that I had the right debug info version, and re-ran the application to segfault it in a live gdb session. Same place.

Is there something special about <tt>back_color_erase</tt>? We're in <tt>ClrBlank()</tt>, so I'll list that source code:

<pre>
(gdb) <b>list ClrBlank</b>
1124	
1125	static NCURSES_INLINE NCURSES_CH_T
1126	ClrBlank(NCURSES_SP_DCLx WINDOW *win)
1127	{
1128	    NCURSES_CH_T blank = blankchar;
1129	    if (back_color_erase)
1130		AddAttr(blank, (AttrOf(BCE_BKGD(SP_PARM, win)) & BCE_ATTRS));
1131	    return blank;
1132	}
1133	
</pre>

Ah, that's not defined in the function, so it's a global?

## 13. TUI

It's worth showing how this looks in the gdb text user interface (TUI), which I haven't used that much but was inspired after seeing Greg's talk.

You can launch it using <tt>--tui</tt>:

<pre>
# <b>gdb --tui `which python` /var/cores/core.python.30520</b>
   ┌───────────────────────────────────────────────────────────────────────────┐
   │                                                                           │
   │                                                                           │
   │                                                                           │
   │                                                                           │
   │                                                                           │
   │                                                                           │
   │             [ No Source Available ]                                       │
   │                                                                           │
   │                                                                           │
   │                                                                           │
   │                                                                           │
   │                                                                           │
   │                                                                           │
   └───────────────────────────────────────────────────────────────────────────┘
None No process In:                                                L??   PC: ?? 
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
---Type <return> to continue, or q <return> to quit---
</pre>

It's complaining about no Python source. I could fix that, but we're crashing in libncursesw. Hitting enter lets it finish loading, at which point it loads the libncursesw debug info source code:

<pre>
   ┌──/build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tty/tty_update.c──────┐
   │1124                                                                       │
   │1125    static NCURSES_INLINE NCURSES_CH_T                                 │
   │1126    ClrBlank(NCURSES_SP_DCLx WINDOW *win)                              │
   │1127    {                                                                  │
   │1128        NCURSES_CH_T blank = blankchar;                                │
  >│1129        if (back_color_erase)                                          │
   │1130            AddAttr(blank, (AttrOf(BCE_BKGD(SP_PARM, win)) & BCE_ATTRS)│
   │1131        return blank;                                                  │
   │1132    }                                                                  │
   │1133                                                                       │
   │1134    /*                                                                 │
   │1135    **      ClrUpdate()                                                │
   │1136    **                                                                 │
   └───────────────────────────────────────────────────────────────────────────┘
multi-thre Thread 0x7f0a3c5e87 In: doupdate            L1129 PC: 0x7f0a37aac40d 
warning: JITed object file architecture unknown is not compatible with target ar
chitecture i386:x86-64.
---Type &lt;return&gt; to continue, or q &lt;return&gt; to quit---
Core was generated by `python ./cachetop.py'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  ClrBlank (win=0x1993060)
    at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tty/tty_update.c:1129
(gdb) 
</pre>

Awesome!

The arrow ">" shows the line of code that we crashed in. It gets even better: with the <tt>layout split</tt> command we can follow the source with the disassembly in separate windows:

<pre>
   ┌──/build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tty/tty_update.c──────┐
  >│1129        if (back_color_erase)                                          │
   │1130            AddAttr(blank, (AttrOf(BCE_BKGD(SP_PARM, win)) & BCE_ATTRS)│
   │1131        return blank;                                                  │
   │1132    }                                                                  │
   │1133                                                                       │
   │1134    /*                                                                 │
   │1135    **      ClrUpdate()                                                │
   └───────────────────────────────────────────────────────────────────────────┘
  >│0x7f0a37aac40d &lt;doupdate+301&gt;   mov    0x10(%rsi),%rdi                     │
   │0x7f0a37aac411 &lt;doupdate+305&gt;   cmpb   $0x0,0x1c(%rdi)                     │
   │0x7f0a37aac415 &lt;doupdate+309&gt;   jne    0x7f0a37aac6f7 &lt;doupdate+1047&gt;      │
   │0x7f0a37aac41b &lt;doupdate+315&gt;   movswl 0x4(%rcx),%ecx                      │
   │0x7f0a37aac41f &lt;doupdate+319&gt;   movswl 0x74(%rdx),%edi                     │
   │0x7f0a37aac423 &lt;doupdate+323&gt;   mov    %rax,0x40(%rsp)                     │
   │0x7f0a37aac428 &lt;doupdate+328&gt;   movl   $0x20,0x48(%rsp)                    │
   │0x7f0a37aac430 &lt;doupdate+336&gt;   movl   $0x0,0x4c(%rsp)                     │
   └───────────────────────────────────────────────────────────────────────────┘
multi-thre Thread 0x7f0a3c5e87 In: doupdate            L1129 PC: 0x7f0a37aac40d 

chitecture i386:x86-64.
Core was generated by `python ./cachetop.py'.
Program terminated with signal SIGSEGV, Segmentation fault.
---Type &lt;return&gt; to continue, or q &lt;return&gt; to quit---
#0  ClrBlank (win=0x1993060)
    at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tty/tty_update.c:1129
(gdb) <b>layout split</b>
</pre>

Greg demonstrated this with reverse stepping, so you can imagine following both code and assembly execution at the same time (I'd need a video to demonstrate that here).

## 14. External: cscope

I still want to learn more about <tt>back\_color\_erase</tt>, and I could try gdb's <tt>search</tt> command, but I've found I'm quicker using an external tool: cscope. cscope is a text-based source code browser from Bell Labs in the 1980's. If you have a modern IDE that you prefer, use that instead.

Setting up cscope:

<pre>
# <b>apt-get install -y cscope</b>
# <b>wget http://archive.ubuntu.com/ubuntu/pool/main/n/ncurses/ncurses_6.0+20160213.orig.tar.gz</b>
# <b>tar xvf ncurses_6.0+20160213.orig.tar.gz</b>
# <b>cd ncurses-6.0-20160213</b>
# <b>cscope -bqR</b>
# <b>cscope -dq</b>
</pre>

<tt>cscope -bqR</tt> builds the lookup database. <tt>cscope -dq</tt> then launches cscope.

Searching for <tt>back\_color\_erase</tt> definition:

<pre>
Cscope version 15.8b                                   Press the ? key for help












Find this C symbol:
Find this global definition: <b>back_color_erase</b>
Find functions called by this function:
Find functions calling this function:
Find this text string:
Change this text string:
Find this egrep pattern:
Find this file:
Find files #including this file:
Find assignments to this symbol:
</pre>

Hitting enter:

<pre>
[...]
#define non_dest_scroll_region         CUR Booleans[26]
#define can_change                     CUR Booleans[27]
<b>#define back_color_erase               CUR Booleans[28]</b>
#define hue_lightness_saturation       CUR Booleans[29]
#define col_addr_glitch                CUR Booleans[30]
#define cr_cancels_micro_mode          CUR Booleans[31]
[...]
</pre>

Oh, a <tt>#define</tt>. (They could have at least capitalized it, as is a common style with <tt>#define</tt>'s.)

Ok, so what's <tt>CUR</tt>? Looking up definitions in cscope is a breeze.

<pre>
#define CUR cur_term->type.                                                     
</pre>

At least that <tt>#define</tt> is capitalized!

We'd found <tt>cur\_term</tt> earlier, by stepping instructions and examining registers. What is it?

<pre>
#if 0 && !0
extern NCURSES_EXPORT_VAR(TERMINAL *) cur_term;
#elif 0
NCURSES_WRAPPED_VAR(TERMINAL *, cur_term);
#define cur_term   NCURSES_PUBLIC_VAR(cur_term())
#else
<b>extern NCURSES_EXPORT_VAR(TERMINAL *) cur_term;</b>
#endif
</pre>

cscope read /usr/include/term.h for this. So, more macros. I had to highlight in bold the line of code I think is taking effect there. Why is there an "<tt>if 0 && !0 ... elif 0</tt>"? I don't know (I'd need to read more source). Sometimes programmers use "<tt>#if 0</tt>" around debug code they want to disable in production, however, this looks auto-generated.

Searching for <tt>NCURSES_EXPORT_VAR</tt> finds:

<pre>
#  define NCURSES_EXPORT_VAR(type) NCURSES_IMPEXP type
</pre>

... and <tt>NCURSES_IMPEXP</tt>:

<pre>
/* Take care of non-cygwin platforms */
#if !defined(NCURSES_IMPEXP)          
#  define NCURSES_IMPEXP /* nothing */
#endif                                
#if !defined(NCURSES_API)             
#  define NCURSES_API /* nothing */   
#endif                                
#if !defined(NCURSES_EXPORT)          
#  define NCURSES_EXPORT(type) NCURSES_IMPEXP type NCURSES_API
#endif                                
#if !defined(NCURSES_EXPORT_VAR)      
#  define NCURSES_EXPORT_VAR(type) NCURSES_IMPEXP type
#endif  
</pre>

... and <tt>TERMINAL</tt> was:

<pre>
typedef struct term {       /* describe an actual terminal */
    TERMTYPE    type;       /* terminal type description */
    short   Filedes;    /* file description being written to */
    TTY     Ottyb,      /* original state of the terminal */
        Nttyb;      /* current state of the terminal */
    int     _baudrate;  /* used to compute padding */
    char *      _termname;      /* used for termname() */
} TERMINAL;
</pre>

Gah! Now <tt>TERMINAL</tt> is capitalized. Along with the macros, this code is not that easy to follow...

Ok, who actually sets <tt>cur\_term</tt>? Remember our problem is that it's set to zero, maybe because it's uninitialized or explicitly set. Browsing the code paths that set it might provide more clues, to help answer why it isn't being set, or why it is set to zero. Using the first option in cscope:

<pre>
Find this C symbol: <b>cur_term</b>
Find this global definition:
Find functions called by this function:
Find functions calling this function:
[...]
</pre>

And browsing the entries quickly finds:

<pre>
NCURSES_EXPORT(TERMINAL *)
NCURSES_SP_NAME(<b>set_curterm</b>) (NCURSES_SP_DCLx TERMINAL * termp)
{
    TERMINAL *oldterm;

    T((T_CALLED("set_curterm(%p)"), (void *) termp));

    _nc_lock_global(curses);
    oldterm = cur_term;
    if (SP_PARM)
    SP_PARM->_term = termp;
#if USE_REENTRANT
    CurTerm = termp;
#else
    <b>cur_term = termp;</b>
#endif
</pre>

I added the highlighting. Even the function name is wrapped in a macro. But at least we've found how <tt>cur\_term</tt> is set: via <tt>set\_curterm()</tt>. Maybe that isn't being called?

## 15. External: perf-tools/ftrace/uprobes

I'll cover using gdb for this in a moment, but I can't help trying the uprobe tool from my [perf-tools] collection, which uses Linux ftrace and uprobes. One advantage of using tracers is that they don't pause the target process, like gdb does (although that doesn't matter for this cachetop.py example). Another advantage is that I can trace a few events or a few thousand just as easily.

I should be able to trace calls to <tt>set\_curterm()</tt> in libncursesw, and even print the first argument:

<pre>
# <b>/apps/perf-tools/bin/uprobe 'p:/lib/x86_64-linux-gnu/libncursesw.so.5:set_curterm %di'</b>
ERROR: missing symbol "set_curterm" in /lib/x86_64-linux-gnu/libncursesw.so.5
</pre>

Well, that didn't work. Where is <tt>set\_curterm()</tt>? There are lots of ways to find it, like gdb or objdump:

<pre>
(gdb) <b>info symbol set_curterm</b>
set_curterm in section .text of /lib/x86_64-linux-gnu/libtinfo.so.5

# <b>objdump -tT /lib/x86_64-linux-gnu/libncursesw.so.5 | grep cur_term</b>
0000000000000000      DO *UND*	0000000000000000  NCURSES_TINFO_5.0.19991023 cur_term
# <b>objdump -tT /lib/x86_64-linux-gnu/libtinfo.so.5 | grep cur_term</b>
0000000000228948 g    DO .bss	0000000000000008  NCURSES_TINFO_5.0.19991023 cur_term
</pre>

gdb works better. Plus if I took a closer look at the source, I would have noticed it was building it for libtinfo.

Trying to trace <tt>set\_curterm()</tt> in libtinfo:

<pre>
# <b>/apps/perf-tools/bin/uprobe 'p:/lib/x86_64-linux-gnu/libtinfo.so.5:set_curterm %di'</b>
Tracing uprobe set_curterm (p:set_curterm /lib/x86_64-linux-gnu/libtinfo.so.5:0xfa80 %di). Ctrl-C to end.
          python-31617 [007] d... 24236402.719959: set_curterm: (0x7f116fcc2a80) arg1=0x1345d70
          python-31617 [007] d... 24236402.720033: set_curterm: (0x7f116fcc2a80) arg1=0x13a22e0
          python-31617 [007] d... 24236402.723804: set_curterm: (0x7f116fcc2a80) arg1=0x14cdfa0
          python-31617 [007] d... 24236402.723838: set_curterm: (0x7f116fcc2a80) arg1=0x0
^C
</pre>

That works. So <tt>set\_curterm()</tt> _is_ called, and has been called four times. The last time it was passed zero, which sounds like it could be the problem.

If you're wondering how I knew the %di register was the first argument, then it comes from the AMD64/x86\_64 ABI (and the assumption that this compiled library is ABI compliant). Here's a reminder:

<pre>
# <b>man syscall</b>
[...]
       arch/ABI      arg1  arg2  arg3  arg4  arg5  arg6  arg7  Notes
       ──────────────────────────────────────────────────────────────────
       arm/OABI      a1    a2    a3    a4    v1    v2    v3
       arm/EABI      r0    r1    r2    r3    r4    r5    r6
       arm64         x0    x1    x2    x3    x4    x5    -
       blackfin      R0    R1    R2    R3    R4    R5    -
       i386          ebx   ecx   edx   esi   edi   ebp   -
       ia64          out0  out1  out2  out3  out4  out5  -
       mips/o32      a0    a1    a2    a3    -     -     -     See below
       mips/n32,64   a0    a1    a2    a3    a4    a5    -
       parisc        r26   r25   r24   r23   r22   r21   -
       s390          r2    r3    r4    r5    r6    r7    -
       s390x         r2    r3    r4    r5    r6    r7    -
       sparc/32      o0    o1    o2    o3    o4    o5    -
       sparc/64      o0    o1    o2    o3    o4    o5    -
       x86_64        rdi   rsi   rdx   r10   r8    r9    -
[...]
</pre>

I'd also like to see a stack trace for arg1=0x0 invocation, but this ftrace tool doesn't support stack traces yet.

## 16. External: bcc/BPF

Since we're debugging a bcc tool, cachetop.py, it's worth noting that bcc's trace.py has capabilities like my older uprobe tool:

<pre>
# <b>./trace.py 'p:tinfo:set_curterm "%d", arg1'</b>
TIME     PID    COMM         FUNC             -
01:00:20 31698  python       set_curterm      38018416
01:00:20 31698  python       set_curterm      38396640
01:00:20 31698  python       set_curterm      39624608
01:00:20 31698  python       set_curterm      0
</pre>

Yes, we're using bcc to debug bcc!

If you are new to [bcc], it's worth checking it out. It provides Python and lua interfaces for the new BPF tracing features that are in the Linux 4.x series. In short, it allows lots of performance tools that were previously impossible or prohibitively expensive to run. I've posted instructions for running it on [Ubuntu Xenial].

The bcc trace.py tool should have a switch for printing user stack traces, since the kernel now has BPF stack capabilities as of Linux 4.6, although at the time of writing we haven't added this switch yet.

## 17. More Breakpoints

I should really have used gdb breakpoints on <tt>set\_curterm()</tt> to start with, but I hope that was an interesting detour through ftrace and BPF.

Back to live running mode:

<pre>
# <b>gdb `which python`</b>
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
[...]
(gdb) <b>b set_curterm</b>
Function "set_curterm" not defined.
Make breakpoint pending on future shared library load? (y or [n]) <b>y</b>
Breakpoint 1 (set_curterm) pending.
(gdb) <b>r cachetop.py</b>
Starting program: /usr/bin/python cachetop.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Breakpoint 1, set_curterm (termp=termp@entry=0xa43150) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tinfo/lib_cur_term.c:80
80	{
(gdb) <b>c</b>
Continuing.

Breakpoint 1, set_curterm (termp=termp@entry=0xab5870) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tinfo/lib_cur_term.c:80
80	{
(gdb) <b>c</b>
Continuing.

Breakpoint 1, set_curterm (termp=termp@entry=0xbecb90) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tinfo/lib_cur_term.c:80
80	{
(gdb) <b>c</b>
Continuing.

Breakpoint 1, set_curterm (termp=0x0) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tinfo/lib_cur_term.c:80
80	{
</pre>

Ok, at this breakpoint we can see that <tt>set\_curterm()</tt> is being invoked with a termp=0x0 argument, thanks to debuginfo for that information. If I didn't have debuginfo, I could just print the registers on each breakpoint.

I'll print the stack trace so that we can see _who_ was setting <tt>curterm</tt> to 0.

<pre>
(gdb) <b>bt</b>
#0  set_curterm (termp=0x0) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tinfo/lib_cur_term.c:80
#1  0x00007ffff5a44e75 in llvm::sys::Process::FileDescriptorHasColors(int) () from /usr/lib/x86_64-linux-gnu/libbcc.so.0
#2  0x00007ffff45cabb8 in clang::driver::tools::Clang::ConstructJob(clang::driver::Compilation&, clang::driver::JobAction const&, clang::driver::InputInfo const&, llvm::SmallVector&lt;clang::driver::InputInfo, 4u&gt; const&, llvm::opt::ArgList const&, char const*) const () from /usr/lib/x86_64-linux-gnu/libbcc.so.0
#3  0x00007ffff456ffa5 in clang::driver::Driver::BuildJobsForAction(clang::driver::Compilation&, clang::driver::Action const*, clang::driver::ToolChain const*, char const*, bool, bool, char const*, clang::driver::InputInfo&) const () from /usr/lib/x86_64-linux-gnu/libbcc.so.0
#4  0x00007ffff4570501 in clang::driver::Driver::BuildJobs(clang::driver::Compilation&) const () from /usr/lib/x86_64-linux-gnu/libbcc.so.0
#5  0x00007ffff457224a in clang::driver::Driver::BuildCompilation(llvm::ArrayRef&lt;char const*&gt;) () from /usr/lib/x86_64-linux-gnu/libbcc.so.0
#6  0x00007ffff4396cda in ebpf::ClangLoader::parse(std::unique_ptr&lt;llvm::Module, std::default_delete&lt;llvm::Module&gt; &gt;*, std::unique_ptr&lt;std::vector&lt;ebpf::TableDesc, std::allocator&lt;ebpf::TableDesc&gt; &gt;, std::default_delete&lt;std::vector&lt;ebpf::TableDesc, std::allocator&lt;ebpf::TableDesc&gt; &gt; &gt; &gt;*, std::__cxx11::basic_string&lt;char, std::char_traits&lt;char&gt;, std::allocator&lt;char&gt; &gt; const&, bool, char const**, int) () from /usr/lib/x86_64-linux-gnu/libbcc.so.0
#7  0x00007ffff4344314 in ebpf::BPFModule::load_cfile(std::__cxx11::basic_string&lt;char, std::char_traits&lt;char&gt;, std::allocator&lt;char&gt; &gt; const&, bool, char const**, int) ()
   from /usr/lib/x86_64-linux-gnu/libbcc.so.0
#8  0x00007ffff4349e5e in ebpf::BPFModule::load_string(std::__cxx11::basic_string&lt;char, std::char_traits&lt;char&gt;, std::allocator&lt;char&gt; &gt; const&, char const**, int) ()
   from /usr/lib/x86_64-linux-gnu/libbcc.so.0
#9  0x00007ffff43430c8 in bpf_module_create_c_from_string () from /usr/lib/x86_64-linux-gnu/libbcc.so.0
#10 0x00007ffff690ae40 in ffi_call_unix64 () from /usr/lib/x86_64-linux-gnu/libffi.so.6
#11 0x00007ffff690a8ab in ffi_call () from /usr/lib/x86_64-linux-gnu/libffi.so.6
#12 0x00007ffff6b1a68c in _ctypes_callproc () from /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so
#13 0x00007ffff6b1ed82 in ?? () from /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so
#14 0x00000000004b1153 in PyObject_Call ()
#15 0x00000000004ca5ca in PyEval_EvalFrameEx ()
#16 0x00000000004c2e05 in PyEval_EvalCodeEx ()
#17 0x00000000004def08 in ?? ()
#18 0x00000000004b1153 in PyObject_Call ()
#19 0x00000000004f4c3e in ?? ()
#20 0x00000000004b1153 in PyObject_Call ()
#21 0x00000000004f49b7 in ?? ()
#22 0x00000000004b6e2c in ?? ()
#23 0x00000000004b1153 in PyObject_Call ()
#24 0x00000000004ca5ca in PyEval_EvalFrameEx ()
#25 0x00000000004c2e05 in PyEval_EvalCodeEx ()
#26 0x00000000004def08 in ?? ()
#27 0x00000000004b1153 in PyObject_Call ()
#28 0x00000000004c73ec in PyEval_EvalFrameEx ()
#29 0x00000000004c2e05 in PyEval_EvalCodeEx ()
#30 0x00000000004caf42 in PyEval_EvalFrameEx ()
#31 0x00000000004c2e05 in PyEval_EvalCodeEx ()
#32 0x00000000004c2ba9 in PyEval_EvalCode ()
#33 0x00000000004f20ef in ?? ()
#34 0x00000000004eca72 in PyRun_FileExFlags ()
#35 0x00000000004eb1f1 in PyRun_SimpleFileExFlags ()
#36 0x000000000049e18a in Py_Main ()
#37 0x00007ffff7811830 in __libc_start_main (main=0x49daf0 &lt;main&gt;, argc=2, argv=0x7fffffffdfb8, init=&lt;optimized out&gt;, fini=&lt;optimized out&gt;, rtld_fini=&lt;optimized out&gt;, 
    stack_end=0x7fffffffdfa8) at ../csu/libc-start.c:291
#38 0x000000000049da19 in _start ()
</pre>

Ok, more clues...I think. We're in <tt>llvm::sys::Process::FileDescriptorHasColors()</tt>. The llvm compiler?

## 18. External: cscope, take 2

More source code browsing using cscope, this time in llvm. The FileDescriptorHasColors() function has:

<pre>
static bool terminalHasColors(int fd) {
[...]
  // Now extract the structure allocated by setupterm and free its memory
  // through a really silly dance.
  struct term *termp = set_curterm((struct term *)nullptr);
  (void)del_curterm(termp); // Drop any errors here.
</pre>

Here's what that code used to be in an earlier version:

<pre>
static bool terminalHasColors() {
  if (const char *term = std::getenv("TERM")) {
    // Most modern terminals support ANSI escape sequences for colors.
    // We could check terminfo, or have a list of known terms that support
    // colors, but that would be overkill.
    // The user can always ask for no colors by setting TERM to dumb, or
    // using a commandline flag.
    return strcmp(term, "dumb") != 0;
  }
  return false;
}
</pre>

It [became] a "silly dance" involving calling <tt>set\_curterm()</tt> with a null pointer.

## 19. Writing Memory

As an experiment and to explore a possible workaround, I'll modify memory of the running process to avoid the <tt>set\_curterm()</tt> of zero.

I'll run gdb, set a breakpoint on <tt>set\_curterm()</tt>, and take it to the zero invocation:

<pre>
# <b>gdb `which python`</b>
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1                                  
[...]
(gdb) <b>b set_curterm</b>
Function "set_curterm" not defined.
Make breakpoint pending on future shared library load? (y or [n]) <b>y</b>
Breakpoint 1 (set_curterm) pending.
(gdb) <b>r cachetop.py</b>
Starting program: /usr/bin/python cachetop.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Breakpoint 1, set_curterm (termp=termp@entry=0xa43150) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tinfo/lib_cur_term.c:80
80      {
(gdb) <b>c</b>
Continuing.

Breakpoint 1, set_curterm (termp=termp@entry=0xab5870) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tinfo/lib_cur_term.c:80
80      {
(gdb) <b>c</b>
Continuing.

Breakpoint 1, set_curterm (termp=termp@entry=0xbecb90) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tinfo/lib_cur_term.c:80
80      {
(gdb) <b>c</b>
Continuing.                                                                    
                                                                               
Breakpoint 1, set_curterm (termp=0x0) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tinfo/lib_cur_term.c:80
80      { 
</pre>

At this point I'll use the <tt>set</tt> command to overwrite memory and replace zero with the previous argument of <tt>set\_curterm()</tt>, 0xbecb90, seen above, on the hope that it's still valid.

<b>WARNING: Writing memory is not safe!</b> gdb won't ask "are you sure?". If you get it wrong or make a typo, you will corrupt the application. Best case, your application crashes immediately, and you realize your mistake. Worst case, your application continues with silently corrupted data that is only discovered years later.

In this case, I'm experimenting on a lab machine with no production data, so I'll continue. I'll print the value of the %rdi register as hex (<tt>p/x</tt>), then <tt>set</tt> it to the previous address, print it again, then print all registers:

<pre>
(gdb) <b>p/x $rdi</b>
$1 = 0x0
(gdb) <b>set $rdi=0xbecb90</b>
(gdb) <b>p/x $rdi</b>
$2 = 0xbecb90
(gdb) <b>i r</b>
rax            0x100	256
rbx            0x1	1
rcx            0xe71	3697
rdx            0x0	0
rsi            0x7ffff5dd45d3	140737318307283
rdi            0xbecb90	12503952
rbp            0x100	0x100
rsp            0x7fffffffa5b8	0x7fffffffa5b8
r8             0xbf0050	12517456
r9             0x1999999999999999	1844674407370955161
r10            0xbf0040	12517440
r11            0x7ffff7bb4b78	140737349634936
r12            0xbecb70	12503920
r13            0xbeaea0	12496544
r14            0x7fffffffa9a0	140737488333216
r15            0x7fffffffa8a0	140737488332960
rip            0x7ffff3c76a80	0x7ffff3c76a80 &lt;set_curterm&gt;
eflags         0x246	[ PF ZF IF ]
cs             0x33	51
ss             0x2b	43
ds             0x0	0
es             0x0	0
fs             0x0	0
gs             0x0	0
</pre>

(Since at this point I have debug info installed, I don't need to refer to registers in this case, I could have called <tt>set</tt> on "<tt>termp</tt>", the variable name argument to <tt>set\_curterm()</tt>, instead of <tt>$rdi</tt>.)

%rdi is now populated, so those registers look ok to continue.

<pre>
(gdb) <b>c</b>
Continuing.

Breakpoint 1, set_curterm (termp=termp@entry=0x0) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tinfo/lib_cur_term.c:80
80	{
</pre>

Ok, we survived a call to <tt>set\_curterm()</tt>! However, we've hit another, also with an argument of zero. Trying our write trick again:

<pre>
(gdb) <b>set $rdi=0xbecb90</b>
(gdb) <b>c</b>
Continuing.
warning: JITed object file architecture unknown is not compatible with target architecture i386:x86-64.

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff34ad411 in ClrBlank (win=0xaea060) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tty/tty_update.c:1129
1129	    if (back_color_erase)
</pre>

Ahhh. That's what I get for writing memory. So this experiment ended in another segfault.

## 20. Conditional Breakpoints

In the previous section, I had to use three continues to reach the right invocation of a breakpoint. If that were hundreds of invocations, then I'd use a conditional breakpoint. Here's an example.

I'll run the program and break on <tt>set\_curterm()</tt> as usual:

<pre>
# <b>gdb `which python`</b>
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1                                  
[...]
(gdb) <b>b set_curterm</b>
Function "set_curterm" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (set_curterm) pending.
(gdb) <b>r cachetop.py</b>
Starting program: /usr/bin/python cachetop.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Breakpoint 1, set_curterm (termp=termp@entry=0xa43150) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tinfo/lib_cur_term.c:80
80	{
</pre>

Now I'll turn breakpoint 1 into a conditional breakpoint, so that it only fires when the %rdi register is zero:

<pre>
(gdb) <b>cond 1 $rdi==0x0</b>
(gdb) <b>i b</b>
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x00007ffff3c76a80 in set_curterm at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tinfo/lib_cur_term.c:80
	stop only if $rdi==0x0
	breakpoint already hit 1 time
(gdb) <b>c</b>
Continuing.

Breakpoint 1, set_curterm (termp=0x0) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tinfo/lib_cur_term.c:80
(gdb)
</pre>

Neat! <tt>cond</tt> is short for <tt>conditional</tt>. So why didn't I run it right away, when I first created the "pending" breakpoint? I've found conditionals don't work on pending breakpoints, at least on this gdb version. (Either that or I'm doing it wrong.) I also used <tt>i b</tt> here (<tt>info breakpoints</tt>) to list them with information.

## 21. Returns

I did try another write-like hack, but this time changing the instruction path rather than the data.

<b>WARNING: see previous warning</b>, which also applies here.

I'll take us to the <tt>set\_curterm()</tt> 0x0 breakpoint as before, and then issue a <tt>ret</tt> (short for <tt>return</tt>), which will return from the function immediately and not execute it. My hope is that by not executing it, it won't set the global <tt>curterm</tt> to 0x0.

<pre>
[...]
(gdb) <b>c</b>
Continuing.

Breakpoint 1, set_curterm (termp=0x0) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tinfo/lib_cur_term.c:80

(gdb) <b>ret</b>
Make set_curterm return now? (y or n) <b>y</b>
#0  0x00007ffff5a44e75 in llvm::sys::Process::FileDescriptorHasColors(int) () from /usr/lib/x86_64-linux-gnu/libbcc.so.0
(gdb) <b>c</b>
Continuing.

Program received signal SIGSEGV, Segmentation fault.
                                                    _nc_free_termtype (ptr=ptr@entry=0x100) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tinfo/free_ttype.c:52
52	    FreeIfNeeded(ptr->str_table);
</pre>

Another crash. Again, that's what I get for messing in this way.

One more try. After browsing the code a bit more, I want to try doing a <tt>ret</tt> twice, in case the parent function is also involved. Again, this is just a hacky experiment:

<pre>
[...]
(gdb) <b>c</b>
Continuing.

Breakpoint 1, set_curterm (termp=0x0) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tinfo/lib_cur_term.c:80
80	{
(gdb) <b>ret</b>
Make set_curterm return now? (y or n) <b>y</b>
#0  0x00007ffff5a44e75 in llvm::sys::Process::FileDescriptorHasColors(int) () from /usr/lib/x86_64-linux-gnu/libbcc.so.0
(gdb) <b>ret</b>
Make selected stack frame return now? (y or n) <b>y</b>
#0  0x00007ffff45cabb8 in clang::driver::tools::Clang::ConstructJob(clang::driver::Compilation&, clang::driver::JobAction const&, clang::driver::InputInfo const&, llvm::SmallVector<clang::driver::InputInfo, 4u> const&, llvm::opt::ArgList const&, char const*) const () from /usr/lib/x86_64-linux-gnu/libbcc.so.0
(gdb) <b>c</b>
</pre>

The screen goes blank and pauses...then redraws:

<pre>
07:44:22 Buffers MB: 61 / Cached MB: 1246
PID      UID      CMD              HITS     MISSES   DIRTIES  READ_HIT%  WRITE_HIT%
    2742 root     systemd-logind          3       66        2       1.4%      95.7%
   15836 root     kworker/u30:1           7        0        1      85.7%       0.0%
    2736 messageb dbus-daemon             8       66        2       8.1%      89.2%
       1 root     systemd                15        0        0     100.0%       0.0%
    2812 syslog   rs:main Q:Reg          16       66        8       9.8%      80.5%
     435 root     systemd-journal        32       66        8      24.5%      67.3%
    2740 root     accounts-daemon       113       66        2      62.0%      36.9%
   15847 root     bash                  160        0        1      99.4%       0.0%
   15864 root     lesspipe              306        0        2      99.3%       0.0%
   15854 root     bash                  309        0        2      99.4%       0.0%
   15856 root     bash                  309        0        2      99.4%       0.0%
   15866 root     bash                  309        0        2      99.4%       0.0%
   15867 root     bash                  309        0        2      99.4%       0.0%
   15860 root     bash                  313        0        2      99.4%       0.0%
   15868 root     bash                  341        0        2      99.4%       0.0%
   15858 root     uname                 452        0        2      99.6%       0.0%
   15858 root     bash                  453        0        2      99.6%       0.0%
   15866 root     dircolors             464        0        2      99.6%       0.0%
   15861 root     basename              465        0        2      99.6%       0.0%
   15864 root     dirname               468        0        2      99.6%       0.0%
   15856 root     ls                    476        0        2      99.6%       0.0%
[...]
</pre>

Wow! It's working!

## 22. A Better Workaround

I'd been posting debugging output to [github], especially since the lead BPF engineer, Alexei Starovoitov, is also well versed in llvm internals, and the root cause seemed to be a bug in llvm. While I was messing with writes and returns, he suggested adding the llvm option <tt>-fno-color-diagnostics</tt> to bcc, to avoid this problem code path. It worked! It was added to bcc as a workaround. (And we should get that llvm bug fixed.)

## 23. Python Context

At this point we've fixed the problem, but you might be curious to see the stack trace fully fixed.

Adding python-dbg:

<pre>
# <b>apt-get install -y python-dbg</b>
Reading package lists... Done
[...]
The following additional packages will be installed:
  libpython-dbg libpython2.7-dbg python2.7-dbg
Suggested packages:
  python2.7-gdbm-dbg python2.7-tk-dbg python-gdbm-dbg python-tk-dbg
The following NEW packages will be installed:
  libpython-dbg libpython2.7-dbg python-dbg python2.7-dbg
0 upgraded, 4 newly installed, 0 to remove and 20 not upgraded.
Need to get 11.9 MB of archives.
After this operation, 36.4 MB of additional disk space will be used.
[...]
</pre>

Now I'll rerun gdb and view the stack trace:

<pre>
# <b>gdb `which python` /var/cores/core.python.30520</b>
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
[...]
Reading symbols from /usr/bin/python...Reading symbols from /usr/lib/debug/.build-id/4e/a0539215b2a9e32602f81c90240874132c1a54.debug...done.
[...]
(gdb) <b>bt</b>
#0  ClrBlank (win=0x1993060) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tty/tty_update.c:1129
#1  ClrUpdate () at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tty/tty_update.c:1147
#2  doupdate () at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tty/tty_update.c:1010
#3  0x00007f0a37aa07e6 in wrefresh (win=win@entry=0x1993060) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/base/lib_refresh.c:65
#4  0x00007f0a37a99499 in recur_wrefresh (win=win@entry=0x1993060) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/base/lib_getch.c:384
#5  0x00007f0a37a99616 in _nc_wgetch (win=win@entry=0x1993060, result=result@entry=0x7ffd33d93e24, use_meta=1)
    at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/base/lib_getch.c:491
#6  0x00007f0a37a9a325 in wgetch (win=0x1993060) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/base/lib_getch.c:672
#7  0x00007f0a37cc6ec3 in PyCursesWindow_GetCh.lto_priv.109 (self=0x7f0a3c57b198, args=()) at /build/python2.7-HpIZBG/python2.7-2.7.11/Modules/_cursesmodule.c:853
#8  0x00000000004c4d5a in call_function (oparg=&lt;optimized out&gt;, pp_stack=0x7ffd33d93f30) at ../Python/ceval.c:4350
#9  PyEval_EvalFrameEx () at ../Python/ceval.c:2987
#10 0x00000000004c2e05 in PyEval_EvalCodeEx () at ../Python/ceval.c:3582
#11 0x00000000004def08 in function_call.lto_priv () at ../Objects/funcobject.c:523
#12 0x00000000004b1153 in PyObject_Call () at ../Objects/abstract.c:2546
#13 0x00000000004c73ec in ext_do_call (nk=0, na=&lt;optimized out&gt;, flags=&lt;optimized out&gt;, pp_stack=0x7ffd33d941e8, func=&lt;function at remote 0x7f0a37edcc80&gt;)
    at ../Python/ceval.c:4662
#14 PyEval_EvalFrameEx () at ../Python/ceval.c:3026
#15 0x00000000004c2e05 in PyEval_EvalCodeEx () at ../Python/ceval.c:3582
#16 0x00000000004caf42 in fast_function (nk=0, na=&lt;optimized out&gt;, n=&lt;optimized out&gt;, pp_stack=0x7ffd33d943f0, func=&lt;function at remote 0x7f0a38039140&gt;)
    at ../Python/ceval.c:4445
#17 call_function (oparg=&lt;optimized out&gt;, pp_stack=0x7ffd33d943f0) at ../Python/ceval.c:4370
#18 PyEval_EvalFrameEx () at ../Python/ceval.c:2987
#19 0x00000000004c2e05 in PyEval_EvalCodeEx () at ../Python/ceval.c:3582
#20 0x00000000004c2ba9 in PyEval_EvalCode (co=&lt;optimized out&gt;, globals=&lt;optimized out&gt;, locals=&lt;optimized out&gt;) at ../Python/ceval.c:669
#21 0x00000000004f20ef in run_mod.lto_priv () at ../Python/pythonrun.c:1376
#22 0x00000000004eca72 in PyRun_FileExFlags () at ../Python/pythonrun.c:1362
#23 0x00000000004eb1f1 in PyRun_SimpleFileExFlags () at ../Python/pythonrun.c:948
#24 0x000000000049e18a in Py_Main () at ../Modules/main.c:640
#25 0x00007f0a3be10830 in __libc_start_main (main=0x49daf0 &lt;main&gt;, argc=2, argv=0x7ffd33d94838, init=&lt;optimized out&gt;, fini=&lt;optimized out&gt;, rtld_fini=&lt;optimized out&gt;, 
    stack_end=0x7ffd33d94828) at ../csu/libc-start.c:291
#26 0x000000000049da19 in _start ()
</pre>

No more "<tt>??</tt>"'s, but not hugely more helpful, yet.

The python debug packages have added other capabilities to gdb. Now we can look at the python backtrace:

<pre>
(gdb) <b>py-bt</b>
Traceback (most recent call first):
  File "./cachetop.py", line 188, in handle_loop
    s = stdscr.getch()
  File "/usr/lib/python2.7/curses/wrapper.py", line 43, in wrapper
    return func(stdscr, *args, **kwds)
  File "./cachetop.py", line 260, in <module>
    curses.wrapper(handle_loop, args)
</pre>

... and Python source list:

<pre>
(gdb) <b>py-list</b>
 183        b.attach_kprobe(event="mark_buffer_dirty", fn_name="do_count")
 184    
 185        exiting = 0
 186    
 187        while 1:
>188            s = stdscr.getch()
 189            if s == ord('q'):
 190                exiting = 1
 191            elif s == ord('r'):
 192                sort_reverse = not sort_reverse
 193            elif s == ord('&lt;'):
</pre>

It's identifying where in our Python code we were executing that hit the segfault. That's really nice!

The problem with the initial stack trace is that we're seeing Python internals that are executing the methods, but not the methods themselves. If you're debugging another language, it's up to its complier/runtime how it ends up executing code. If you do a web search for "_language name_" and "gdb" you might find it has gdb debugging extensions like Python does. If it doesn't, the bad news is you'll need to write your own. The good news is that this is even possible! Search for documentation on "adding new GDB commands in Python", as they can be written in Python.

## 24. And More

While it might look like I've written comprehensive tour of gdb, I really haven't: there's a lot more to gdb. The <tt>help</tt> command will list the major sections:

<pre>
(gdb) <b>help</b>
List of classes of commands:

aliases -- Aliases of other commands
breakpoints -- Making program stop at certain points
data -- Examining data
files -- Specifying and examining files
internals -- Maintenance commands
obscure -- Obscure features
running -- Running the program
stack -- Examining the stack
status -- Status inquiries
support -- Support facilities
tracepoints -- Tracing of program execution without stopping the program
user-defined -- User-defined commands

Type "help" followed by a class name for a list of commands in that class.
Type "help all" for the list of all commands.
Type "help" followed by command name for full documentation.
Type "apropos word" to search for commands related to "word".
Command name abbreviations are allowed if unambiguous.
</pre>

You can then run <tt>help</tt> on each command class. For example, here's the full listing for breakpoints:

<pre>
(gdb) <b>help breakpoints</b>
Making program stop at certain points.

List of commands:

awatch -- Set a watchpoint for an expression
break -- Set breakpoint at specified location
break-range -- Set a breakpoint for an address range
catch -- Set catchpoints to catch events
catch assert -- Catch failed Ada assertions
catch catch -- Catch an exception
catch exception -- Catch Ada exceptions
catch exec -- Catch calls to exec
catch fork -- Catch calls to fork
catch load -- Catch loads of shared libraries
catch rethrow -- Catch an exception
catch signal -- Catch signals by their names and/or numbers
catch syscall -- Catch system calls by their names and/or numbers
catch throw -- Catch an exception
catch unload -- Catch unloads of shared libraries
catch vfork -- Catch calls to vfork
clear -- Clear breakpoint at specified location
commands -- Set commands to be executed when a breakpoint is hit
condition -- Specify breakpoint number N to break only if COND is true
delete -- Delete some breakpoints or auto-display expressions
delete bookmark -- Delete a bookmark from the bookmark list
delete breakpoints -- Delete some breakpoints or auto-display expressions
delete checkpoint -- Delete a checkpoint (experimental)
delete display -- Cancel some expressions to be displayed when program stops
delete mem -- Delete memory region
delete tracepoints -- Delete specified tracepoints
delete tvariable -- Delete one or more trace state variables
disable -- Disable some breakpoints
disable breakpoints -- Disable some breakpoints
disable display -- Disable some expressions to be displayed when program stops
disable frame-filter -- GDB command to disable the specified frame-filter
disable mem -- Disable memory region
disable pretty-printer -- GDB command to disable the specified pretty-printer
disable probes -- Disable probes
disable tracepoints -- Disable specified tracepoints
disable type-printer -- GDB command to disable the specified type-printer
disable unwinder -- GDB command to disable the specified unwinder
disable xmethod -- GDB command to disable a specified (group of) xmethod(s)
dprintf -- Set a dynamic printf at specified location
enable -- Enable some breakpoints
enable breakpoints -- Enable some breakpoints
enable breakpoints count -- Enable breakpoints for COUNT hits
enable breakpoints delete -- Enable breakpoints and delete when hit
enable breakpoints once -- Enable breakpoints for one hit
enable count -- Enable breakpoints for COUNT hits
enable delete -- Enable breakpoints and delete when hit
enable display -- Enable some expressions to be displayed when program stops
enable frame-filter -- GDB command to disable the specified frame-filter
enable mem -- Enable memory region
enable once -- Enable breakpoints for one hit
enable pretty-printer -- GDB command to enable the specified pretty-printer
enable probes -- Enable probes
enable tracepoints -- Enable specified tracepoints
enable type-printer -- GDB command to enable the specified type printer
enable unwinder -- GDB command to enable unwinders
enable xmethod -- GDB command to enable a specified (group of) xmethod(s)
ftrace -- Set a fast tracepoint at specified location
hbreak -- Set a hardware assisted breakpoint
ignore -- Set ignore-count of breakpoint number N to COUNT
rbreak -- Set a breakpoint for all functions matching REGEXP
rwatch -- Set a read watchpoint for an expression
save -- Save breakpoint definitions as a script
save breakpoints -- Save current breakpoint definitions as a script
save gdb-index -- Save a gdb-index file
save tracepoints -- Save current tracepoint definitions as a script
skip -- Ignore a function while stepping
skip delete -- Delete skip entries
skip disable -- Disable skip entries
skip enable -- Enable skip entries
skip file -- Ignore a file while stepping
skip function -- Ignore a function while stepping
strace -- Set a static tracepoint at location or marker
tbreak -- Set a temporary breakpoint
tcatch -- Set temporary catchpoints to catch events
tcatch assert -- Catch failed Ada assertions
tcatch catch -- Catch an exception
tcatch exception -- Catch Ada exceptions
tcatch exec -- Catch calls to exec
tcatch fork -- Catch calls to fork
tcatch load -- Catch loads of shared libraries
tcatch rethrow -- Catch an exception
tcatch signal -- Catch signals by their names and/or numbers
tcatch syscall -- Catch system calls by their names and/or numbers
tcatch throw -- Catch an exception
tcatch unload -- Catch unloads of shared libraries
tcatch vfork -- Catch calls to vfork
thbreak -- Set a temporary hardware assisted breakpoint
trace -- Set a tracepoint at specified location
watch -- Set a watchpoint for an expression

Type "help" followed by command name for full documentation.
Type "apropos word" to search for commands related to "word".
Command name abbreviations are allowed if unambiguous.
</pre>

This helps to illustrate how many capabilities gdb has, and how few I needed to use in this example.

## 25. Final Words

Well, that was kind of a nasty issue: an LLVM bug breaking ncurses and causing a Python program to segfault. But the commands and procedures I used to debug it were mostly routine: viewing stack traces, checking registers, setting breakpoints, stepping, and browsing source. 

When I first used gdb (years ago), I really didn't like it. It felt clumsy and limited. gdb has improved a lot since then, as have my gdb skills, and I now see it as a powerful modern debugger. Feature sets vary between debuggers, but gdb may be the most powerful text-based debugger nowadays, with lldb catching up.

I hope anyone searching for gdb examples finds the full output I've shared to be useful, as well as the various caveats I discussed along the way. Maybe I'll post some more gdb sessions when I get a chance, especially for other runtimes like Java.

It's <tt>q</tt> to quit gdb.

[Give me 15 minutes and I'll change your view of GDB]: http://undo.io/resources/presentations/cppcon-2015-greg-law-give-me-15-minutes-ill-change/
[pull request]: https://github.com/iovisor/bcc/pull/615
[cachetop]: https://github.com/iovisor/bcc/blob/master/tools/cachetop.py
[perf-tools]: https://github.com/brendangregg/perf-tools
[bcc]: https://github.com/iovisor/bcc
[became]: https://github.com/llvm-mirror/llvm/commit/d485e7bd7639cd6b39c6113a30fbc3cdc8c41c4c#diff-a4fb6575a290937bc9142e3d7efc8989
[github]: https://github.com/iovisor/bcc/pull/615
[kernel.txt]: https://www.kernel.org/doc/Documentation/sysctl/kernel.txt
[Ubuntu Xenial]: /blog/2016-06-14/ubuntu-xenial-bcc-bpf.html
]]></content:encoded>
      <dc:date>2016-08-09T00:00:00-07:00</dc:date>
    </item>
    <item>
      <title>Deirdré</title>
      <link>http://www.brendangregg.com/blog/2016-07-23/deirdre.html</link>
      <description><![CDATA[

Several years ago I had a burning desire to improve the state of technical education, and wanted to develop books, blog posts, talks, and videos. I&#39;d previously worked as a technical instructor and saw opportunities to try new things, but had encountered resistance to change. Some felt it both ridiculous and egotistical to want to actually film engineers talking about their own engineering! Look at youtube now, where almost every conference talk is videoed, is popular, and is hugely useful. I was wanting to do this a decade ago.
]]></description>
      <pubDate>Sat, 23 Jul 2016 00:00:00 -0700</pubDate>
      <guid>http://www.brendangregg.com/blog/2016-07-23/deirdre.html</guid>
      <content:encoded><![CDATA[<div style="float:right;padding-left:10px;padding-right:5px;padding-top:1px;padding-bottom:1px"><a href="/blog/images/2016/deirdre_2011s.jpg"><img src="/blog/images/2016/deirdre_2011s.jpg" width=250 border=0></a></div>
Several years ago I had a burning desire to improve the state of technical education, and wanted to develop books, blog posts, talks, and videos. I'd previously worked as a technical instructor and saw opportunities to try new things, but had encountered resistance to change. Some felt it both ridiculous and egotistical to want to actually film engineers talking about their own engineering! Look at youtube now, where almost every conference talk is videoed, is popular, and is hugely useful. I was wanting to do this a decade ago.

At Kernel Conference Australia in 2009, I met Deirdré Straughan. She was an OpenSolaris community manager at the time, and was dragging video equipment with her everywhere to film engineers. She shared my passion and drive to create excellent technical content, pioneer new ideas, and to help the community. She was also wicked smart, and someone who I felt was my equal.

Since then we've worked on many projects together: articles, blog posts, talks, videos, and books. She helped me get my first solo conference talk, USENIX/LISA 2010, and helped with many more. She's filmed me on countless occasions, and edited countless articles. I was never a great writer to begin with, but I've improved a lot with her help, and cannot write anything nowadays without hearing her advice in my head.

The largest projects we worked on were my last two books. We spent hundreds of hours together, discussing ideas, planning, creating, researching, soliciting feedback, and, arguing. Deirdré touched every word on every page, had me rewrite sections, delete sections, and rewrote sections herself. She was able to handle any technical depth and improve the content, without breaking subtle technical meanings.

Copy-editing large books, when you both really care about the outcome, is an intimate process. You get to know someone. I got to know Deirdré, who is not just an amazing woman in tech, but an amazing person.

Last year Deirdré got cancer. For someone I cared deeply about and who has helped me so much, it was painful to feel so helpless in return, although my pain was nothing compared to what she endured. I couldn't do much, apart from drive her to chemo, and stay with her every night. Fortunately, as far as the doctors can say, the treatment worked! Deirdré is cured.

Will we write another book? We have several good ideas, and we've started work on one a while ago. Although my spare time has been more spent on other projects, especially bcc/BPF (my books are spare time projects). The heavy lifting in bcc/BPF is wrapping up soon, so, I'll have more time for books... I'll also have more time with Deirdré.

Deirdré and I are now partners!

You can read more about Deirdré on [her blog] and the [techies] project, and follow her on <a href="https://twitter.com/DeirdreS">twitter</a>.

[her blog]: http://www.beginningwithi.com/
[techies]: http://www.techiesproject.com/deirdre-straughan/
]]></content:encoded>
      <dc:date>2016-07-23T00:00:00-07:00</dc:date>
    </item>
    <item>
      <title>llnode for Node.js Memory Leak Analysis</title>
      <link>http://www.brendangregg.com/blog/2016-07-13/llnode-nodejs-memory-leak-analysis.html</link>
      <description><![CDATA[The memory of your Node.js process is growing endlessly, what do you do?
]]></description>
      <pubDate>Wed, 13 Jul 2016 00:00:00 -0700</pubDate>
      <guid>http://www.brendangregg.com/blog/2016-07-13/llnode-nodejs-memory-leak-analysis.html</guid>
      <content:encoded><![CDATA[The memory of your Node.js process is growing endlessly, what do you do?

I start with a page fault flame graph using Linux [perf], which can be generated on the live running process with low overhead. That only solves some issues, though. Other approaches include analyzing heap snapshots with [heapdump], or taking a core dump and analyzing it with mdb findjsobjects on an old Solaris image.

Fortunately, findjsobjects has just been made available for Linux in [llnode], which I'll write about here. Thanks to Fedor Indutny for creating llnode, Howard Hellyer for [contributing] findjsobjects support, and thanks to Dave Pacheco for first developing this [kind of analysis].

llnode is not yet well known or documented. To help out, I'll share the commands and screenshots I used for taking a fresh Ubuntu Xenial server with Node v4.4.7 and running some memory object analysis. NOTE: It's 13-Jul-2016, and I'd expect this to be simplified in future versions. See the [llnode] repository for the latest.

## 1. Installing llnode

Just the commands:

<pre>
sudo bash
apt-get update
apt-get install -y lldb-3.8 lldb-3.8-dev gcc g++ make gdb lldb
apt-get install -y python-pip; pip install six
git clone https://github.com/indutny/llnode
cd llnode
git clone https://chromium.googlesource.com/external/gyp.git tools/gyp
./gyp_llnode -Dlldb_dir=/usr/lib/llvm-3.8/
make -C out/ -j2
make install-linux
</pre>

Perhaps is the future this will just be "apt-get install -y llnode". llnode works for MacOS as well, and can be installed via brew.

## 2. Taking a Core Dump

You can use gcore, noting that it may pause node for some period (short or long). My Node.js PID is 30833.

<pre>
# <b>ulimit -c unlimited</b>
# <b>gcore 30833</b>
[New LWP 30834]
[New LWP 30835]
[New LWP 30836]
[New LWP 30837]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
38	../sysdeps/unix/sysv/linux/x86_64/syscall.S: No such file or directory.
warning: target file /proc/30833/cmdline contained unexpected null characters
Saved corefile core.30833
# ls -lh core.30833 
-rw-r--r-- 1 root root 855M Jul 12 20:56 core.30833
</pre>

That's failing, and the core dump isn't readable (lldb will hang if you try). It's likely that the gdb (which includes gcore) installed for Xenial isn't matching this kernel version. I thought I'd better note this in case you hit it.

If gcore works for you, great. If not, then use another method. lldb and "process save-core"?

I tried crashing the process. WARNING, this will crash the process! To start with I disabled Ubuntu's appreport, and setup traditional core dumps, then sent a SIGBUS (why not):

<pre>
# <b>mkdir /var/cores</b>
# <b>echo "/var/cores/core.%e.%p.%t" > /proc/sys/kernel/core_pattern</b>
# <b>kill -BUS 30833</b>
# <b>ls -lh /var/cores/core.node.30833.1468367170</b>
-rw------- 1 root root 65M Jul 12 23:46 /var/cores/core.node.30833.1468367170
</pre>

For production use I'll go fix gcore (although, we are in a fault-tolerant environment).

## 3. Memory Ranges File

For findjsobjects to work, a memory ranges file and environment variable must be set. There are scripts to generate this file for Mac OS and Linux in llnode. While in the llnode directory:

<pre>
# <b>./scripts/readelf2segments.py /var/cores/core.node.30833.1468367170 > core.30833.ranges</b>
# <b>export LLNODE_RANGESFILE=core.30833.ranges</b>
</pre>

I'd expect this step to vanish one day when lldb can do it all [directly].

## 4. Starting llnode

Starting up lldb (as root):

<pre>
# <b>lldb -c /var/cores/core.node.30833.1468367170</b>
(lldb) target create --core "/var/cores/core.node.30833.1468367170"
Core file '/var/cores/core.node.30833.1468367170' (x86_64) was loaded.
</pre>

The help message:

<pre>
(lldb) <b>v8</b>
The following subcommands are supported:

      bt              -- Show a backtrace with node.js JavaScript functions and their args. An optional argument is accepted; if that argument is a number, it specifies
                         the number of frames to display. Otherwise all frames will be dumped.
                         
                         Syntax: v8 bt [number]
      findjsinstances -- List all objects which share the specified map.
                         Accepts the same options as `v8 inspect`
      findjsobjects   -- List all object types and instance counts grouped by map and sorted by instance count.
                         Requires `LLNODE_RANGESFILE` environment variable to be set to a file containing memory ranges for the core file being debugged.
                         There are scripts for generating this file on Linux and Mac in the scripts directory of the llnode repository.
      inspect         -- Print detailed description and contents of the JavaScript value.
                         
                         Possible flags (all optional):
                         
                          * -F, --full-string    - print whole string without adding ellipsis
                          * -m, --print-map      - print object's map address
                          * --string-length num  - print maximum of `num` characters in string
                         
                         Syntax: v8 inspect [flags] expr
      print           -- Print short description of the JavaScript value.
                         
                         Syntax: v8 print expr
      source          -- Source code information

For more help on any particular subcommand, type 'help <command> <subcommand>'.
</pre>

Stack backtrace (not interesting in this case, just showing that the command exists):

<pre>
(lldb) <b>v8 bt</b>
 * thread #1: tid = 0, 0x00007f75719a7c19 libc.so.6`syscall + 25 at syscall.S:38, name = 'node', stop reason = signal SIGBUS
  * frame #0: 0x00007f75719a7c19 libc.so.6`syscall + 25 at syscall.S:38
    frame #1: 0x0000000000fc375a node`uv__epoll_wait(epfd=<unavailable>, events=<unavailable>, nevents=<unavailable>, timeout=<unavailable>) + 26 at linux-syscalls.c:321
    frame #2: 0x0000000000fc1838 node`uv__io_poll(loop=0x00000000019aa080, timeout=-1) + 424 at linux-core.c:243
    frame #3: 0x0000000000fb2506 node`uv_run(loop=0x00000000019aa080, mode=UV_RUN_ONCE) + 342 at core.c:351
    frame #4: 0x0000000000dfedf8 node`node::Start(int, char**) + 1080
    frame #5: 0x00007f75718c7830 libc.so.6`__libc_start_main(main=(node`main), argc=2, argv=0x00007ffc531bb6b8, init=<unavailable>, fini=<unavailable>, rtld_fini=<unavailable>, stack_end=0x00007ffc531bb6a8) + 240 at libc-start.c:291
    frame #6: 0x0000000000722f1d node`_start + 41
</pre>

## 5. llnode Object Analysis

Finding all JavaScript objects (reminds me of Java's jmap -histo):

<pre>
(lldb) <b>v8 findjsobjects</b>
 Instances  Total Size Name
 ---------- ---------- ----
          1         24 (anonymous)
          1         24 JSON
          1         24 process
          1         32 Arguments
          1         32 HTTPParser
          1         32 Signal
          1         32 Timer
          1         32 WriteWrap
          1         56 DefineError.aV
          1         56 MathConstructor
          1         56 RangeError
          1         96 Console
          1        104 Agent
          1        112 Server
          1        120 exports.FreeList
          1        136 Module
          2         64 TTY
          2        208 EventEmitter
          2        208 WriteStream
          6        336 Error
          7        560 (ArrayBufferView)
         29        704 (Object)
         42       2352 NativeModule
       1219     146280 ServerResponse
       1219     263304 IncomingMessage
       1220      39040 TCP
       1859     416416 Socket
       1860     386880 WritableState
       2435     136360 WriteReq
       3080     591360 ReadableState
       3718     178528 CorkedRequest
       9708     543000 Object
       9747     467856 TickObject
      32585    1042720 (Array)
</pre>

This is a simple app without a real issue, I'm just showing the commands and screenshots. At this point you'd be looking for large counts of an unexpected object to study further. You could also take two core dumps at different times, and then look at the difference between the findjsobject counts, to see which objects were growing.

Listing all (2) instances of WriteStream:

<pre>
(lldb) <b>v8 findjsinstances WriteStream</b>
0x0000360583199f89:&lt;Object: WriteStream>
0x0000360583199ff1:&lt;Object: WriteStream>
</pre>

Now inspecting one of those instances of WriteStream:

<pre>
(lldb) <b>v8 i 0x0000360583199f89</b>
0x0000360583199f89:&lt;Object: WriteStream properties {
    ._connecting=0x00001c604c904251:&lt;false>,
    ._hadError=0x00001c604c904251:&lt;false>,
    ._handle=0x0000360583119091:&lt;Object: TTY>,
    ._parent=0x00001c604c904101:&lt;null>,
    ._host=0x00001c604c904101:&lt;null>,
    ._readableState=0x00003605831a8dd1:&lt;Object: ReadableState>,
    .readable=0x00001c604c904251:&lt;false>,
    .domain=0x00001c604c904101:&lt;null>,
    ._events=0x00003605831a8e91:&lt;Object: Object>,
    ._eventsCount=&lt;Smi: 3>,
    ._maxListeners=0x00001c604c9041b9:&lt;undefined>,
    ._writableState=0x00003605831a2609:&lt;Object: WritableState>,
    .writable=0x00001c604c904211:&lt;true>,
    .allowHalfOpen=0x00001c604c904251:&lt;false>,
    .destroyed=0x00001c604c904251:&lt;false>,
    ._bytesDispatched=&lt;Smi: 36>,
    ._sockname=0x00001c604c904101:&lt;null>,
    ._writev=0x00001c604c904101:&lt;null>,
    ._pendingData=0x00001c604c904101:&lt;null>,
    ._pendingEncoding=0x00001c604c904291:&lt;String: "">,
    .server=0x00001c604c904101:&lt;null>,
    ._server=0x00001c604c904101:&lt;null>,
    .&lt;non-string>=&lt;Smi: 0>,
    .columns=&lt;Smi: 172>,
    .rows=&lt;Smi: 42>,
    ._type=0x00001c604c9bd571:&lt;String: "tty">,
    .fd=&lt;Smi: 1>,
    ._isStdio=0x00001c604c904211:&lt;true>,
    .destroySoon=0x00001002183b4821:&lt;function: stdout.destroy.stdout.destroySoon at node.js:633:53>,
    .destroy=0x00001002183b4821:&lt;function: stdout.destroy.stdout.destroySoon at node.js:633:53>}>
</pre>

Listing all instances of Socket:

<pre>
(lldb) <b>v8 findjsinstances Socket</b>
0x00000c41530496f9:&lt;Object: Socket>
0x00000c4153049909:&lt;Object: Socket>
0x00000c4153049b19:&lt;Object: Socket>
0x00000c4153049d29:&lt;Object: Socket>
0x00000c4153049f39:&lt;Object: Socket>
0x00000c415304a149:&lt;Object: Socket>
0x00000c415304a359:&lt;Object: Socket>
[...]
</pre>

Inspecting an instance of Socket:

<pre>
(lldb) <b>v8 i 0x00000c41530496f9</b>
0x00000c41530496f9:&lt;Object: Socket properties {
    ._connecting=0x00001c604c904251:&lt;false>,
    ._hadError=0x00001c604c904251:&lt;false>,
    ._handle=0x00001c604c904101:&lt;null>,
    ._parent=0x00001c604c904101:&lt;null>,
    ._host=0x00001c604c904101:&lt;null>,
    ._readableState=0x00000c4153087361:&lt;Object: ReadableState>,
    .readable=0x00001c604c904251:&lt;false>,
    .domain=0x00001c604c904101:&lt;null>,
    ._events=0x00000c4153087421:&lt;Object: Object>,
    ._eventsCount=&lt;Smi: 10>,
    ._maxListeners=0x00001c604c9041b9:&lt;undefined>,
    ._writableState=0x00000c41530497d9:&lt;Object: WritableState>,
    .writable=0x00001c604c904251:&lt;false>,
    .allowHalfOpen=0x00001c604c904211:&lt;true>,
    .destroyed=0x00001c604c904211:&lt;true>,
    ._bytesDispatched=&lt;Smi: 145>,
    ._sockname=0x00001c604c904101:&lt;null>,
    ._pendingData=0x00001c604c904101:&lt;null>,
    ._pendingEncoding=0x00001c604c904291:&lt;String: "">,
    .server=0x0000360583196da1:&lt;Object: Server>,
    ._server=0x0000360583196da1:&lt;Object: Server>,
    .&lt;non-string>=&lt;Smi: 59>,
    ._idleTimeout=&lt;Smi: -1>,
    ._idleNext=0x00001c604c904101:&lt;null>,
    ._idlePrev=0x00001c604c904101:&lt;null>,
    ._idleStart=&lt;Smi: 3883>,
    .parser=0x00001c604c904101:&lt;null>,
    .on=0x0000360583196801:&lt;function: socketOnWrap at _http_server.js:575:22>,
    ._paused=0x00001c604c904251:&lt;false>,
    .read=0x00001002183260c9:&lt;function: Readable.read at _stream_readable.js:258:35>,
    ._consuming=0x00001c604c904211:&lt;true>,
    ._httpMessage=0x00001c604c904101:&lt;null>}>
</pre>

One of the largest Object counts for my process was TickObject:

<pre>
(lldb) <b>v8 findjsinstances TickObject</b>
0x00000c41535a9221:&lt;Object: TickObject>
0x00000c41535a9301:&lt;Object: TickObject>
0x00000c41535a93e1:&lt;Object: TickObject>
0x00000c41535a94c1:&lt;Object: TickObject>
0x00000c41535aa241:&lt;Object: TickObject>
0x00000c41535ab5a9:&lt;Object: TickObject>
0x00000c41535ab7b1:&lt;Object: TickObject>
[...]
(lldb) <b>v8 i 0x00000c41535a9221</b>
0x00000c41535a9221:&lt;Object: TickObject properties {
    .callback=0x000036058319be09:&lt;function: endReadableNT at _stream_readable.js:916:23>,
    .domain=0x00001c604c904101:&lt;null>,
    .args=0x00000c41535a9171:&lt;Array: length=2>}>
</pre>

Fewer details for this object, but hopefully still enough to continue investigating. It's given me a callback function, plus I know the object name.

That's all for now. Hopefully these steps and screenshots are helpful, although do note that the steps are likely to be improved in future versions. Follow [llnode] for the latest.

[kind of analysis]: http://dtrace.org/blogs/dap/2011/10/31/nodejs-v8-postmortem-debugging/
[llnode]: https://github.com/indutny/llnode
[perf]: http://www.brendangregg.com/perf.html
[contributing]: https://github.com/indutny/llnode/commit/c517e0ca09b714a487850ca74730d89f4a1c321d
[heapdump]: https://strongloop.com/strongblog/how-to-heap-snapshots/
[directly]: https://github.com/indutny/llnode/pull/16
]]></content:encoded>
      <dc:date>2016-07-13T00:00:00-07:00</dc:date>
    </item>
    <dc:date>2017-04-29T00:00:00-07:00</dc:date>
  </channel>
</rss>