Systems Performance 2nd Ed.



BPF Performance Tools book

Recent posts:
Blog index
About
RSS

Performance@Scale 2016: Linux BPF Superpowers

Video: https://www.facebook.com/atscaleevents/videos/1693888610884236/

Opening talk by Brendan Gregg from Facebook's Performance @Scale.

Description: "Linux performance analysis has been the domain of ancient tools and metrics, but that's now changing in the Linux 4.x series. A new tracer is available in the mainline kernel, built from dynamic tracing (kprobes, uprobes) and enhanced BPF (Berkeley Packet Filter), aka, eBPF. It allows us to measure latency distributions for file system I/O and run queue latency, print details of storage device I/O and TCP retransmits, investigate blocked stack traces and memory leaks, and a whole lot more. These lead to performance wins large and small, especially when instrumenting areas that previously had zero visibility. This talk will summarize this new technology and some long-standing issues that it can solve, and how we intend to use it at Netflix."

next
prev
1/60
next
prev
2/60
next
prev
3/60
next
prev
4/60
next
prev
5/60
next
prev
6/60
next
prev
7/60
next
prev
8/60
next
prev
9/60
next
prev
10/60
next
prev
11/60
next
prev
12/60
next
prev
13/60
next
prev
14/60
next
prev
15/60
next
prev
16/60
next
prev
17/60
next
prev
18/60
next
prev
19/60
next
prev
20/60
next
prev
21/60
next
prev
22/60
next
prev
23/60
next
prev
24/60
next
prev
25/60
next
prev
26/60
next
prev
27/60
next
prev
28/60
next
prev
29/60
next
prev
30/60
next
prev
31/60
next
prev
32/60
next
prev
33/60
next
prev
34/60
next
prev
35/60
next
prev
36/60
next
prev
37/60
next
prev
38/60
next
prev
39/60
next
prev
40/60
next
prev
41/60
next
prev
42/60
next
prev
43/60
next
prev
44/60
next
prev
45/60
next
prev
46/60
next
prev
47/60
next
prev
48/60
next
prev
49/60
next
prev
50/60
next
prev
51/60
next
prev
52/60
next
prev
53/60
next
prev
54/60
next
prev
55/60
next
prev
56/60
next
prev
57/60
next
prev
58/60
next
prev
59/60
next
prev
60/60

PDF: PerformanceAtScale2016_LinuxBPFSuperpowers.pdf

Keywords (from pdftotext):

slide 1:
    Feb	
      
    2016	
      
    Linux	
      4.x	
      Performance	
      
    Using	
      BPF	
      Superpowers	
      
    Brendan Gregg
    Senior Performance Architect
    
slide 2:
    Ten	
      years	
      ago,	
      
    I	
      gave	
      a	
      talk	
      here	
      	
      
    about	
      DTrace	
      tools…	
      
    
slide 3:
slide 4:
    Superpowers	
      are	
      coming	
      to	
      Linux	
      	
      
    Solve performance issues that were previously impossible
    For example, full off-CPU analysis…
    
slide 5:
slide 6:
slide 7:
    Ideal	
      Thread	
      States	
      
    A starting point for
    deeper analysis
    
slide 8:
    Linux	
      Thread	
      States	
      
    Based on:
    TASK_RUNNING
    TASK_INTERRUPTIBLE
    TASK_UNINTERRUPTIBLE
    Still a useful
    starting point
    
slide 9:
    Linux	
      On-­‐CPU	
      Analysis	
      
    • I'll start with on-CPU analysis:
    • Split into user/kernel states
    using /proc, mpstat(1), ...
    CPU	
      Flame	
      Graph	
      
    • perf_events ("perf") to analyze further:
    – User & kernel stack sampling (as a CPU flame graph)
    – CPI
    – Should be easy, but…
    
slide 10:
    Broken	
      Stacks	
      
    Missing Java
    stacks
    
slide 11:
    Missing	
      Symbols	
      
    "[unknown]"
    
slide 12:
    Java	
      Mixed-­‐Mode	
      CPU	
      Flame	
      Graph	
      
    • Fixed!
    – Java –XX:+PreserveFramePointer
    – Java perf-map-agent
    Kernel
    – Linux perf_events
    Java
    JVM
    
slide 13:
    Samples
    (alphabetical sort)
    Stack depth
    
slide 14:
    Also,	
      CPI	
      Flame	
      Graph	
      
    Cycles Per Instruction
    - red == instruction
    heavy
    - blue == cycle heavy
    (likely mem stalls)
    zoomed:
    
slide 15:
    Linux	
      Off-­‐CPU	
      Analysis	
      
    On Linux, the state
    isn't helpful, but the
    code path is
    Off-CPU analysis by
    measuring blocked
    time with stack traces
    
slide 16:
    Off-­‐CPU	
      Time	
      Flame	
      Graph	
      
    Stack depth
    Off-CPU time
    From	
      hRp://www.brendangregg.com/blog/2016-­‐02-­‐01/linux-­‐wakeup-­‐offwake-­‐profiling.html	
      
    
slide 17:
    Off-­‐CPU	
      Time	
      (zoomed):	
      tar(1)	
      
    directory read
    from disk
    file read
    from disk
    fstat from disk
    path read from disk
    pipe write
    Currently kernel stacks only; user stacks will add more context
    
slide 18:
    Off-­‐CPU	
      Time:	
      more	
      states	
      
    lock
    contention
    sleep
    run queue
    latency
    Flame graph quantifies total time spent in states
    
slide 19:
    CPU	
      +	
      Off-­‐CPU	
      ==	
      See	
      Everything?	
      
    
slide 20:
    Off-­‐CPU	
      Time	
      (zoomed):	
      gzip(1)	
      
    Off-CPU doesn't always make sense:
    what is gzip blocked on?
    
slide 21:
    Wakeup	
      Time	
      Flame	
      Graph	
      
    
slide 22:
    Wakeup	
      Time	
      (zoomed):	
      gzip(1)	
      
    gzip(1) is blocked on tar(1)!
    tar cf - * | gzip >gt; out.tar.gz
    Can't we associate off-CPU with wakeup stacks?
    
slide 23:
    Off-­‐Wake	
      Time	
      Flame	
      Graph	
      
    
slide 24:
    Wakeup stacks
    are associated
    and merged
    in-kernel
    using
    BPF
    We couldn't do
    this before
    
slide 25:
slide 26:
    Haven't	
      Solved	
      Everything	
      Yet…	
      
    • One wakeup stack is often
    not enough…
    • Who woke the waker?
    
slide 27:
    Chain	
      Graphs	
      
    
slide 28:
    Merging multiple
    wakeup stacks
    in kernel using
    BPF
    With enough
    stacks, all paths
    lead to metal
    
slide 29:
    Solve	
      Everything	
      
    CPU + off-CPU analysis can solve most issues
    Flame graph (profiling) types:
    CPU
    CPI
    Off-CPU time
    Wakeup time
    Off-wake time
    Chain
    different off-CPU analysis views,
    with more context and
    increasing measurement cost
    BPF makes this all more practical
    
slide 30:
    2.	
      BPF	
      
    "One	
      of	
      the	
      more	
      interesbng	
      features	
      in	
      this	
      
    cycle	
      is	
      the	
      ability	
      to	
      aRach	
      eBPF	
      programs	
      
    (user-­‐defined,	
      sandboxed	
      bytecode	
      executed	
      
    by	
      the	
      kernel)	
      to	
      kprobes.	
      This	
      allows	
      user-­‐
    defined	
      instrumentabon	
      on	
      a	
      live	
      kernel	
      image	
      
    that	
      can	
      never	
      crash,	
      hang	
      or	
      interfere	
      with	
      the	
      
    kernel	
      negabvely."	
      
    –	
      Ingo	
      Molnár	
      (Linux	
      developer)	
      
    Source:	
      hRps://lkml.org/lkml/2015/4/14/232	
      
    
slide 31:
    2.	
      BPF	
      
    "crazy	
      stuff"	
      
    –	
      Alexei	
      Starovoitov	
      (eBPF	
      lead)	
      
    Source:	
      hRp://www.slideshare.net/AlexeiStarovoitov/bpf-­‐inkernel-­‐virtual-­‐machine	
      
    
slide 32:
    BPF	
      
    • eBPF == enhanced Berkeley Packet Filter; now just BPF
    • Integrated into Linux (in stages: 3.15, 3.19, 4.1, 4.5, …)
    • Uses
    – virtual networking
    – tracing
    – "crazy stuff"
    • Front-ends
    – samples/bpf (raw)
    – bcc: Python, C
    – Linux perf_events
    BPF	
      mascot	
      
    
slide 33:
    BPF	
      for	
      Tracing	
      
    • Can do per-event output and in-kernel summary
    statistics (histograms, etc).
    User	
      Program	
      
    Kernel	
      
    1.	
      generate	
      
    kprobes	
      
    BPF	
      bytecode	
      
    per-­‐
    event	
      
    data	
      
    BPF	
      
    2.	
      load	
      
    tracepoints	
      
    perf_output	
      
    stabsbcs	
      
    uprobes	
      
    3.	
      async	
      
    read	
      
    maps	
      
    
slide 34:
    Old	
      way:	
      TCP	
      Retransmits	
      
    • tcpdump of all send & receive, dump to FS, post-process
    • Overheads adds up on 10GbE+
    tcpdump	
      
    1.	
      read	
      
    2.	
      dump	
      
    Kernel	
      
    buffer	
      
    send	
      
    receive	
      
    Analyzer	
      
    1.	
      read	
      
    2.	
      state	
      machine	
      
    3.	
      print	
      
    file	
      system	
      
    disks	
      
    
slide 35:
    New	
      way:	
      BPF	
      TCP	
      Retransmits	
      
    • Just trace the retransmit functions
    • Negligible overhead
    tcpretrans	
      (bcc)	
      
    1.	
      Config	
      BPF	
      &	
      kprobe	
      
    2.	
      read,	
      print	
      
    Kernel	
      
    send/recv	
       send	
      
    as-­‐is	
      
    receive	
      
    tcp_retransmit_skb()	
      
    
slide 36:
    BPF:	
      TCP	
      Retransmits	
      
    # ./tcpretrans
    TIME
    PID
    01:55:05 0
    01:55:05 0
    01:55:17 0
    […]
    IP LADDR:LPORT
    4 10.153.223.157:22
    4 10.153.223.157:22
    4 10.153.223.157:22
    T>gt; RADDR:RPORT
    R>gt; 69.53.245.40:34619
    R>gt; 69.53.245.40:34619
    R>gt; 69.53.245.40:22957
    STATE
    ESTABLISHED
    ESTABLISHED
    ESTABLISHED
    includes	
      kernel	
      state	
      
    
slide 37:
    Old:	
      Off-­‐CPU	
      Time	
      Stack	
      Profiling	
      
    • perf_events tracing of sched events, post-process
    • Despite buffering, usually high cost (>gt;1M events/sec)
    perf	
      record	
      
    1.	
      async	
      read	
      
    2.	
      dump	
      
    Kernel	
      
    buffer	
      
    scheduler	
      
    perf	
      inject	
      
    1.	
      read	
      
    2.	
      rewrite	
      
    perf	
      report/script	
      
    read,	
      process,	
      print	
      
    file	
      system	
      
    (or	
      pipe)	
      
    disks	
      
    
slide 38:
    New:	
      BPF	
      Off-­‐CPU	
      Time	
      Stacks	
      
    • Measure off-CPU time, add to map with key = stack,
    value = total time. Async read map.
    offcpuDme	
      (bcc)	
      
    Kernel	
      
    1.	
      Config	
      BPF	
      &	
      kprobe	
      
    2.	
      async	
      read	
      stacks	
      
    3.	
      symbol	
      translate	
      
    4.	
      print	
      
    scheduler	
      
    finish_task_switch()	
      
    BPF	
      
    maps	
      
    
slide 39:
    Stack	
      Trace	
      Hack	
      
    • For my offcputime tool, I wrote a BPF stack walker:
    
slide 40:
    "Crazy	
      Stuff"	
      
    • … using unrolled loops & goto:
    
slide 41:
    BPF	
      Stack	
      Traces	
      
    • Proper BPF stack support just landed in net-next:
    Date
    Sat, 20 Feb 2016 00:25:05 -0500 (EST)
    Subject Re: [PATCH net-next 0/3] bpf_get_stackid() and stack_trace map
    From
    David Miller gt;
    From: Alexei Starovoitov gt;
    Date: Wed, 17 Feb 2016 19:58:56 -0800
    >gt; This patch set introduces new map type to store stack traces and
    >gt; corresponding bpf_get_stackid() helper.
    Series applied, thanks Alexei.
    • Allows more than just chain graphs
    
slide 42:
    memleak	
      
    • Real-time memory growth and leak analysis:
    # ./memleak.py -o 10 60 1
    Attaching to kmalloc and kfree, Ctrl+C to quit.
    [01:27:34] Top 10 stacks with outstanding allocations:
    72 bytes in 1 allocations from stack
    alloc_fdtable [kernel] (ffffffff8121960f)
    expand_files [kernel] (ffffffff8121986b)
    sys_dup2 [kernel] (ffffffff8121a68d)
    […]
    2048 bytes in 1 allocations from stack
    alloc_fdtable [kernel] (ffffffff812195da)
    expand_files [kernel] (ffffffff8121986b)
    sys_dup2 [kernel] (ffffffff8121a68d) ]
    Trace	
      for	
      60s	
      
    Show	
      kernel	
      
    allocabons	
      
    older	
      than	
      10s	
      
    that	
      were	
      not	
      
    freed	
      
    • Uses my stack hack, but will switch to BPF stacks soon
    • By Sasha Goldshtein. Another bcc tool.
    
slide 43:
    3.	
      bcc	
      
    • BPF Compiler Collection
    – https://github.com/iovisor/bcc
    • Python front-end, C instrumentation
    • Currently beta – in development!
    • Some example tracing tools…
    
slide 44:
    execsnoop	
      
    • Trace new processes:
    # ./execsnoop
    PCOMM
    PID
    RET ARGS
    bash
    0 /usr/bin/man ls
    preconv
    0 /usr/bin/preconv -e UTF-8
    man
    0 /usr/bin/tbl
    man
    0 /usr/bin/nroff -mandoc -rLL=169n -rLT=169n -Tutf8
    man
    0 /usr/bin/pager -s
    nroff
    0 /usr/bin/locale charmap
    nroff
    0 /usr/bin/groff -mtty-char -Tutf8 -mandoc -rLL=169n …
    groff
    0 /usr/bin/troff -mtty-char -mandoc -rLL=169n -rLT=169 …
    groff
    0 /usr/bin/grotty
    
slide 45:
    biolatency	
      
    • Block device (disk) I/O latency distribution:
    # ./biolatency -mT 1 5
    Tracing block device I/O... Hit Ctrl-C to end.
    06:20:16
    msecs
    0 ->gt; 1
    2 ->gt; 3
    4 ->gt; 7
    8 ->gt; 15
    16 ->gt; 31
    32 ->gt; 63
    64 ->gt; 127
    […]
    : count
    : 36
    : 1
    : 3
    : 17
    : 33
    : 7
    : 6
    distribution
    |**************************************|
    |***
    |*****************
    |**********************************
    |*******
    |******
    
slide 46:
    ext4slower	
      
    • ext4 file system I/O, slower than a threshold:
    # ./ext4slower 1
    Tracing ext4 operations slower than 1 ms
    TIME
    COMM
    PID
    T BYTES
    OFF_KB
    06:49:17 bash
    R 128
    06:49:17 cksum
    R 39552
    06:49:17 cksum
    R 96
    06:49:17 cksum
    R 96
    06:49:17 cksum
    R 10320
    06:49:17 cksum
    R 65536
    06:49:17 cksum
    R 55400
    06:49:17 cksum
    R 36792
    06:49:17 cksum
    R 15008
    06:49:17 cksum
    R 6123
    repository
    06:49:17 cksum
    R 6280
    06:49:17 cksum
    R 27696
    06:49:17 cksum
    R 58080
    06:49:17 cksum
    R 906
    […]
    LAT(ms) FILENAME
    7.75 cksum
    1.34 [
    5.36 2to3-2.7
    14.94 2to3-3.4
    6.82 411toppm
    4.01 a2p
    8.77 ab
    16.34 aclocal-1.14
    19.31 acpi_listen
    17.23 add-apt18.40 addpart
    2.16 addr2line
    10.11 ag
    6.30 ec2-meta-data
    
slide 47:
    bashreadline	
      
    • Trace bash interactive commands system-wide:
    # ./bashreadline
    TIME
    PID
    COMMAND
    05:28:25 21176 ls -l
    05:28:28 21176 date
    05:28:35 21176 echo hello world
    05:28:43 21176 foo this command failed
    05:28:45 21176 df -h
    05:29:04 3059
    echo another shell
    05:29:13 21176 echo first shell again
    
slide 48:
    gethostlatency	
      
    • Show latency for getaddrinfo/gethostbyname[2] calls:
    # ./gethostlatency
    TIME
    PID
    COMM
    06:10:24 28011 wget
    06:10:28 28127 wget
    06:10:41 28404 wget
    06:10:48 28544 curl
    06:11:10 29054 curl
    06:11:16 29195 curl
    06:11:25 29404 curl
    06:11:28 29475 curl
    LATms HOST
    90.00 www.iovisor.org
    0.00 www.iovisor.org
    9.00 www.netflix.com
    35.00 www.netflix.com.au
    31.00 www.plumgrid.com
    3.00 www.facebook.com
    72.00 foo
    1.00 foo
    
slide 49:
    trace	
      
    • Trace custom events. Ad hoc analysis multitool:
    # trace 'sys_read (arg3 >gt; 20000) "read %d bytes", arg3'
    TIME
    PID
    COMM
    FUNC
    05:18:23 4490
    sys_read
    read 1048576 bytes
    05:18:23 4490
    sys_read
    read 1048576 bytes
    05:18:23 4490
    sys_read
    read 1048576 bytes
    05:18:23 4490
    sys_read
    read 1048576 bytes
    
slide 50:
    Linux	
      bcc/BPF	
      Tracing	
      Tools	
      
    
slide 51:
    4.	
      Future	
      Work	
      
    All event sources
    Language improvements
    More tools: eg, TCP
    GUI support
    
slide 52:
    Linux	
      Event	
      Sources	
      
    XXX:	
      todo	
      
    done	
      
    done	
      
    XXX:	
      todo	
      
    XXX:	
      todo	
      
    
slide 53:
    BPF/bcc	
      Language	
      Improvements	
      
    
slide 54:
    More	
      Tools	
      
    • eg, netstat(8)…
    $ netstat -s
    Ip:
    7962754 total packets received
    8 with invalid addresses
    0 forwarded
    0 incoming packets discarded
    7962746 incoming packets delivered
    8019427 requests sent out
    Icmp:
    382 ICMP messages received
    0 input ICMP message failed.
    ICMP input histogram:
    destination unreachable: 125
    timeout in transit: 257
    3410 ICMP messages sent
    0 ICMP messages failed
    ICMP output histogram:
    destination unreachable: 3410
    IcmpMsg:
    InType3: 125
    InType11: 257
    OutType3: 3410
    Tcp:
    17337 active connections openings
    395515 passive connection openings
    8953 failed connection attempts
    240214 connection resets received
    3 connections established
    7198375 segments received
    7504939 segments send out
    62696 segments retransmited
    10 bad segments received.
    1072 resets sent
    InCsumErrors: 5
    Udp:
    759925 packets received
    3412 packets to unknown port received.
    0 packet receive errors
    784370 packets sent
    UdpLite:
    TcpExt:
    858 invalid SYN cookies received
    8951 resets received for embryonic SYN_RECV sockets
    14 packets pruned from receive queue because of socket buffer overrun
    6177 TCP sockets finished time wait in fast timer
    293 packets rejects in established connections because of timestamp
    733028 delayed acks sent
    89 delayed acks further delayed because of locked socket
    Quick ack mode was activated 13214 times
    336520 packets directly queued to recvmsg prequeue.
    43964 packets directly received from backlog
    11406012 packets directly received from prequeue
    1039165 packets header predicted
    7066 packets header predicted and directly queued to user
    1428960 acknowledgments not containing data received
    1004791 predicted acknowledgments
    1 times recovered from packet loss due to fast retransmit
    5044 times recovered from packet loss due to SACK data
    2 bad SACKs received
    Detected reordering 4 times using SACK
    Detected reordering 11 times using time stamp
    13 congestion windows fully recovered
    11 congestion windows partially recovered using Hoe heuristic
    TCPDSACKUndo: 39
    2384 congestion windows recovered after partial ack
    228 timeouts after SACK recovery
    100 timeouts in loss state
    5018 fast retransmits
    39 forward retransmits
    783 retransmits in slow start
    32455 other TCP timeouts
    TCPLossProbes: 30233
    TCPLossProbeRecovery: 19070
    992 sack retransmits failed
    18 times receiver scheduled too late for direct processing
    705 packets collapsed in receive queue due to low socket buffer
    13658 DSACKs sent for old packets
    8 DSACKs sent for out of order packets
    13595 DSACKs received
    33 DSACKs for out of order packets received
    32 connections reset due to unexpected data
    108 connections reset due to early user close
    1608 connections aborted due to timeout
    TCPSACKDiscard: 4
    TCPDSACKIgnoredOld: 1
    TCPDSACKIgnoredNoUndo: 8649
    TCPSpuriousRTOs: 445
    TCPSackShiftFallback: 8588
    TCPRcvCoalesce: 95854
    TCPOFOQueue: 24741
    TCPOFOMerge: 8
    TCPChallengeACK: 1441
    TCPSYNChallenge: 5
    TCPSpuriousRtxHostQueues: 1
    TCPAutoCorking: 4823
    IpExt:
    InOctets: 1561561375
    OutOctets: 1509416943
    InNoECTPkts: 8201572
    InECT1Pkts: 2
    InECT0Pkts: 3844
    InCEPkts: 306
    
slide 55:
slide 56:
    BeRer	
      TCP	
      Tools	
      
    • TCP retransmit by type and time
    • Congestion algorithm metrics
    • etc.
    
slide 57:
    GUI	
      Support	
      
    • eg, Netflix Vector: open source instance analyzer:
    
slide 58:
    Summary	
      
    • BPF in Linux 4.x makes many new things possible
    – Stack-based thread state analysis (solve all issues!)
    – Real-time memory growth/leak detection
    – Better TCP metrics
    – etc...
    • Get involved: see iovisor/bcc
    • So far just a preview of things to come
    
slide 59:
    Links	
      
    iovisor bcc:
    • https://github.com/iovisor/bcc
    • http://www.brendangregg.com/blog/2015-09-22/bcc-linux-4.3-tracing.html
    • http://blogs.microsoft.co.il/sasha/2016/02/14/two-new-ebpf-tools-memleak-and-argdist/
    BPF Off-CPU, Wakeup, Off-Wake & Chain Graphs:
    • http://www.brendangregg.com/blog/2016-01-20/ebpf-offcpu-flame-graph.html
    • http://www.brendangregg.com/blog/2016-02-01/linux-wakeup-offwake-profiling.html
    • http://www.brendangregg.com/blog/2016-02-05/ebpf-chaingraph-prototype.html
    Linux Performance:
    • http://www.brendangregg.com/linuxperf.html
    Linux perf_events:
    • https://perf.wiki.kernel.org/index.php/Main_Page
    • http://www.brendangregg.com/perf.html
    Flame Graphs:
    • http://techblog.netflix.com/2015/07/java-in-flames.html
    • http://www.brendangregg.com/flamegraphs.html
    Netflix Tech Blog on Vector:
    • http://techblog.netflix.com/2015/04/introducing-vector-netflixs-on-host.html
    Wordcloud: https://www.jasondavies.com/wordcloud/
    
slide 60:
    Feb	
      
    2016	
      
    Questions?
    http://slideshare.net/brendangregg
    http://www.brendangregg.com
    bgregg@netflix.com
    @brendangregg
    Thanks to Alexei Starovoitov (Facebook), Brenden
    Blanco (PLUMgrid), Daniel Borkmann (Cisco), Wang
    Nan (Huawei), Sasha Goldshtein (Sela), and other
    BPF and bcc contributors!