Performance@Scale 2016: Linux BPF Superpowers
Video: https://www.facebook.com/atscaleevents/videos/1693888610884236/Opening talk by Brendan Gregg from Facebook's Performance @Scale.
Description: "Linux performance analysis has been the domain of ancient tools and metrics, but that's now changing in the Linux 4.x series. A new tracer is available in the mainline kernel, built from dynamic tracing (kprobes, uprobes) and enhanced BPF (Berkeley Packet Filter), aka, eBPF. It allows us to measure latency distributions for file system I/O and run queue latency, print details of storage device I/O and TCP retransmits, investigate blocked stack traces and memory leaks, and a whole lot more. These lead to performance wins large and small, especially when instrumenting areas that previously had zero visibility. This talk will summarize this new technology and some long-standing issues that it can solve, and how we intend to use it at Netflix."
next prev 1/60 | |
next prev 2/60 | |
next prev 3/60 | |
next prev 4/60 | |
next prev 5/60 | |
next prev 6/60 | |
next prev 7/60 | |
next prev 8/60 | |
next prev 9/60 | |
next prev 10/60 | |
next prev 11/60 | |
next prev 12/60 | |
next prev 13/60 | |
next prev 14/60 | |
next prev 15/60 | |
next prev 16/60 | |
next prev 17/60 | |
next prev 18/60 | |
next prev 19/60 | |
next prev 20/60 | |
next prev 21/60 | |
next prev 22/60 | |
next prev 23/60 | |
next prev 24/60 | |
next prev 25/60 | |
next prev 26/60 | |
next prev 27/60 | |
next prev 28/60 | |
next prev 29/60 | |
next prev 30/60 | |
next prev 31/60 | |
next prev 32/60 | |
next prev 33/60 | |
next prev 34/60 | |
next prev 35/60 | |
next prev 36/60 | |
next prev 37/60 | |
next prev 38/60 | |
next prev 39/60 | |
next prev 40/60 | |
next prev 41/60 | |
next prev 42/60 | |
next prev 43/60 | |
next prev 44/60 | |
next prev 45/60 | |
next prev 46/60 | |
next prev 47/60 | |
next prev 48/60 | |
next prev 49/60 | |
next prev 50/60 | |
next prev 51/60 | |
next prev 52/60 | |
next prev 53/60 | |
next prev 54/60 | |
next prev 55/60 | |
next prev 56/60 | |
next prev 57/60 | |
next prev 58/60 | |
next prev 59/60 | |
next prev 60/60 |
PDF: PerformanceAtScale2016_LinuxBPFSuperpowers.pdf
Keywords (from pdftotext):
slide 1:
Feb 2016 Linux 4.x Performance Using BPF Superpowers Brendan Gregg Senior Performance Architectslide 2:
Ten years ago, I gave a talk here about DTrace tools…slide 3:
slide 4:
Superpowers are coming to Linux Solve performance issues that were previously impossible For example, full off-CPU analysis…slide 5:
slide 6:
slide 7:
Ideal Thread States A starting point for deeper analysisslide 8:
Linux Thread States Based on: TASK_RUNNING TASK_INTERRUPTIBLE TASK_UNINTERRUPTIBLE Still a useful starting pointslide 9:
Linux On-‐CPU Analysis • I'll start with on-CPU analysis: • Split into user/kernel states using /proc, mpstat(1), ... CPU Flame Graph • perf_events ("perf") to analyze further: – User & kernel stack sampling (as a CPU flame graph) – CPI – Should be easy, but…slide 10:
Broken Stacks Missing Java stacksslide 11:
Missing Symbols "[unknown]"slide 12:
Java Mixed-‐Mode CPU Flame Graph • Fixed! – Java –XX:+PreserveFramePointer – Java perf-map-agent Kernel – Linux perf_events Java JVMslide 13:
Samples (alphabetical sort) Stack depthslide 14:
Also, CPI Flame Graph Cycles Per Instruction - red == instruction heavy - blue == cycle heavy (likely mem stalls) zoomed:slide 15:
Linux Off-‐CPU Analysis On Linux, the state isn't helpful, but the code path is Off-CPU analysis by measuring blocked time with stack tracesslide 16:
Off-‐CPU Time Flame Graph Stack depth Off-CPU time From hRp://www.brendangregg.com/blog/2016-‐02-‐01/linux-‐wakeup-‐offwake-‐profiling.htmlslide 17:
Off-‐CPU Time (zoomed): tar(1) directory read from disk file read from disk fstat from disk path read from disk pipe write Currently kernel stacks only; user stacks will add more contextslide 18:
Off-‐CPU Time: more states lock contention sleep run queue latency Flame graph quantifies total time spent in statesslide 19:
CPU + Off-‐CPU == See Everything?slide 20:
Off-‐CPU Time (zoomed): gzip(1) Off-CPU doesn't always make sense: what is gzip blocked on?slide 21:
Wakeup Time Flame Graphslide 22:
Wakeup Time (zoomed): gzip(1) gzip(1) is blocked on tar(1)! tar cf - * | gzip >gt; out.tar.gz Can't we associate off-CPU with wakeup stacks?slide 23:
Off-‐Wake Time Flame Graphslide 24:
Wakeup stacks are associated and merged in-kernel using BPF We couldn't do this beforeslide 25:
slide 26:
Haven't Solved Everything Yet… • One wakeup stack is often not enough… • Who woke the waker?slide 27:
Chain Graphsslide 28:
Merging multiple wakeup stacks in kernel using BPF With enough stacks, all paths lead to metalslide 29:
Solve Everything CPU + off-CPU analysis can solve most issues Flame graph (profiling) types: CPU CPI Off-CPU time Wakeup time Off-wake time Chain different off-CPU analysis views, with more context and increasing measurement cost BPF makes this all more practicalslide 30:
2. BPF "One of the more interesbng features in this cycle is the ability to aRach eBPF programs (user-‐defined, sandboxed bytecode executed by the kernel) to kprobes. This allows user-‐ defined instrumentabon on a live kernel image that can never crash, hang or interfere with the kernel negabvely." – Ingo Molnár (Linux developer) Source: hRps://lkml.org/lkml/2015/4/14/232slide 31:
2. BPF "crazy stuff" – Alexei Starovoitov (eBPF lead) Source: hRp://www.slideshare.net/AlexeiStarovoitov/bpf-‐inkernel-‐virtual-‐machineslide 32:
BPF • eBPF == enhanced Berkeley Packet Filter; now just BPF • Integrated into Linux (in stages: 3.15, 3.19, 4.1, 4.5, …) • Uses – virtual networking – tracing – "crazy stuff" • Front-ends – samples/bpf (raw) – bcc: Python, C – Linux perf_events BPF mascotslide 33:
BPF for Tracing • Can do per-event output and in-kernel summary statistics (histograms, etc). User Program Kernel 1. generate kprobes BPF bytecode per-‐ event data BPF 2. load tracepoints perf_output stabsbcs uprobes 3. async read mapsslide 34:
Old way: TCP Retransmits • tcpdump of all send & receive, dump to FS, post-process • Overheads adds up on 10GbE+ tcpdump 1. read 2. dump Kernel buffer send receive Analyzer 1. read 2. state machine 3. print file system disksslide 35:
New way: BPF TCP Retransmits • Just trace the retransmit functions • Negligible overhead tcpretrans (bcc) 1. Config BPF & kprobe 2. read, print Kernel send/recv send as-‐is receive tcp_retransmit_skb()slide 36:
BPF: TCP Retransmits # ./tcpretrans TIME PID 01:55:05 0 01:55:05 0 01:55:17 0 […] IP LADDR:LPORT 4 10.153.223.157:22 4 10.153.223.157:22 4 10.153.223.157:22 T>gt; RADDR:RPORT R>gt; 69.53.245.40:34619 R>gt; 69.53.245.40:34619 R>gt; 69.53.245.40:22957 STATE ESTABLISHED ESTABLISHED ESTABLISHED includes kernel stateslide 37:
Old: Off-‐CPU Time Stack Profiling • perf_events tracing of sched events, post-process • Despite buffering, usually high cost (>gt;1M events/sec) perf record 1. async read 2. dump Kernel buffer scheduler perf inject 1. read 2. rewrite perf report/script read, process, print file system (or pipe) disksslide 38:
New: BPF Off-‐CPU Time Stacks • Measure off-CPU time, add to map with key = stack, value = total time. Async read map. offcpuDme (bcc) Kernel 1. Config BPF & kprobe 2. async read stacks 3. symbol translate 4. print scheduler finish_task_switch() BPF mapsslide 39:
Stack Trace Hack • For my offcputime tool, I wrote a BPF stack walker:slide 40:
"Crazy Stuff" • … using unrolled loops & goto:slide 41:
BPF Stack Traces • Proper BPF stack support just landed in net-next: Date Sat, 20 Feb 2016 00:25:05 -0500 (EST) Subject Re: [PATCH net-next 0/3] bpf_get_stackid() and stack_trace map From David Millerslide 42:gt; From: Alexei Starovoitov gt; Date: Wed, 17 Feb 2016 19:58:56 -0800 >gt; This patch set introduces new map type to store stack traces and >gt; corresponding bpf_get_stackid() helper. Series applied, thanks Alexei. • Allows more than just chain graphs
memleak • Real-time memory growth and leak analysis: # ./memleak.py -o 10 60 1 Attaching to kmalloc and kfree, Ctrl+C to quit. [01:27:34] Top 10 stacks with outstanding allocations: 72 bytes in 1 allocations from stack alloc_fdtable [kernel] (ffffffff8121960f) expand_files [kernel] (ffffffff8121986b) sys_dup2 [kernel] (ffffffff8121a68d) […] 2048 bytes in 1 allocations from stack alloc_fdtable [kernel] (ffffffff812195da) expand_files [kernel] (ffffffff8121986b) sys_dup2 [kernel] (ffffffff8121a68d) ] Trace for 60s Show kernel allocabons older than 10s that were not freed • Uses my stack hack, but will switch to BPF stacks soon • By Sasha Goldshtein. Another bcc tool.slide 43:
3. bcc • BPF Compiler Collection – https://github.com/iovisor/bcc • Python front-end, C instrumentation • Currently beta – in development! • Some example tracing tools…slide 44:
execsnoop • Trace new processes: # ./execsnoop PCOMM PID RET ARGS bash 0 /usr/bin/man ls preconv 0 /usr/bin/preconv -e UTF-8 man 0 /usr/bin/tbl man 0 /usr/bin/nroff -mandoc -rLL=169n -rLT=169n -Tutf8 man 0 /usr/bin/pager -s nroff 0 /usr/bin/locale charmap nroff 0 /usr/bin/groff -mtty-char -Tutf8 -mandoc -rLL=169n … groff 0 /usr/bin/troff -mtty-char -mandoc -rLL=169n -rLT=169 … groff 0 /usr/bin/grottyslide 45:
biolatency • Block device (disk) I/O latency distribution: # ./biolatency -mT 1 5 Tracing block device I/O... Hit Ctrl-C to end. 06:20:16 msecs 0 ->gt; 1 2 ->gt; 3 4 ->gt; 7 8 ->gt; 15 16 ->gt; 31 32 ->gt; 63 64 ->gt; 127 […] : count : 36 : 1 : 3 : 17 : 33 : 7 : 6 distribution |**************************************| |*** |***************** |********************************** |******* |******slide 46:
ext4slower • ext4 file system I/O, slower than a threshold: # ./ext4slower 1 Tracing ext4 operations slower than 1 ms TIME COMM PID T BYTES OFF_KB 06:49:17 bash R 128 06:49:17 cksum R 39552 06:49:17 cksum R 96 06:49:17 cksum R 96 06:49:17 cksum R 10320 06:49:17 cksum R 65536 06:49:17 cksum R 55400 06:49:17 cksum R 36792 06:49:17 cksum R 15008 06:49:17 cksum R 6123 repository 06:49:17 cksum R 6280 06:49:17 cksum R 27696 06:49:17 cksum R 58080 06:49:17 cksum R 906 […] LAT(ms) FILENAME 7.75 cksum 1.34 [ 5.36 2to3-2.7 14.94 2to3-3.4 6.82 411toppm 4.01 a2p 8.77 ab 16.34 aclocal-1.14 19.31 acpi_listen 17.23 add-apt18.40 addpart 2.16 addr2line 10.11 ag 6.30 ec2-meta-dataslide 47:
bashreadline • Trace bash interactive commands system-wide: # ./bashreadline TIME PID COMMAND 05:28:25 21176 ls -l 05:28:28 21176 date 05:28:35 21176 echo hello world 05:28:43 21176 foo this command failed 05:28:45 21176 df -h 05:29:04 3059 echo another shell 05:29:13 21176 echo first shell againslide 48:
gethostlatency • Show latency for getaddrinfo/gethostbyname[2] calls: # ./gethostlatency TIME PID COMM 06:10:24 28011 wget 06:10:28 28127 wget 06:10:41 28404 wget 06:10:48 28544 curl 06:11:10 29054 curl 06:11:16 29195 curl 06:11:25 29404 curl 06:11:28 29475 curl LATms HOST 90.00 www.iovisor.org 0.00 www.iovisor.org 9.00 www.netflix.com 35.00 www.netflix.com.au 31.00 www.plumgrid.com 3.00 www.facebook.com 72.00 foo 1.00 fooslide 49:
trace • Trace custom events. Ad hoc analysis multitool: # trace 'sys_read (arg3 >gt; 20000) "read %d bytes", arg3' TIME PID COMM FUNC 05:18:23 4490 sys_read read 1048576 bytes 05:18:23 4490 sys_read read 1048576 bytes 05:18:23 4490 sys_read read 1048576 bytes 05:18:23 4490 sys_read read 1048576 bytesslide 50:
Linux bcc/BPF Tracing Toolsslide 51:
4. Future Work All event sources Language improvements More tools: eg, TCP GUI supportslide 52:
Linux Event Sources XXX: todo done done XXX: todo XXX: todoslide 53:
BPF/bcc Language Improvementsslide 54:
More Tools • eg, netstat(8)… $ netstat -s Ip: 7962754 total packets received 8 with invalid addresses 0 forwarded 0 incoming packets discarded 7962746 incoming packets delivered 8019427 requests sent out Icmp: 382 ICMP messages received 0 input ICMP message failed. ICMP input histogram: destination unreachable: 125 timeout in transit: 257 3410 ICMP messages sent 0 ICMP messages failed ICMP output histogram: destination unreachable: 3410 IcmpMsg: InType3: 125 InType11: 257 OutType3: 3410 Tcp: 17337 active connections openings 395515 passive connection openings 8953 failed connection attempts 240214 connection resets received 3 connections established 7198375 segments received 7504939 segments send out 62696 segments retransmited 10 bad segments received. 1072 resets sent InCsumErrors: 5 Udp: 759925 packets received 3412 packets to unknown port received. 0 packet receive errors 784370 packets sent UdpLite: TcpExt: 858 invalid SYN cookies received 8951 resets received for embryonic SYN_RECV sockets 14 packets pruned from receive queue because of socket buffer overrun 6177 TCP sockets finished time wait in fast timer 293 packets rejects in established connections because of timestamp 733028 delayed acks sent 89 delayed acks further delayed because of locked socket Quick ack mode was activated 13214 times 336520 packets directly queued to recvmsg prequeue. 43964 packets directly received from backlog 11406012 packets directly received from prequeue 1039165 packets header predicted 7066 packets header predicted and directly queued to user 1428960 acknowledgments not containing data received 1004791 predicted acknowledgments 1 times recovered from packet loss due to fast retransmit 5044 times recovered from packet loss due to SACK data 2 bad SACKs received Detected reordering 4 times using SACK Detected reordering 11 times using time stamp 13 congestion windows fully recovered 11 congestion windows partially recovered using Hoe heuristic TCPDSACKUndo: 39 2384 congestion windows recovered after partial ack 228 timeouts after SACK recovery 100 timeouts in loss state 5018 fast retransmits 39 forward retransmits 783 retransmits in slow start 32455 other TCP timeouts TCPLossProbes: 30233 TCPLossProbeRecovery: 19070 992 sack retransmits failed 18 times receiver scheduled too late for direct processing 705 packets collapsed in receive queue due to low socket buffer 13658 DSACKs sent for old packets 8 DSACKs sent for out of order packets 13595 DSACKs received 33 DSACKs for out of order packets received 32 connections reset due to unexpected data 108 connections reset due to early user close 1608 connections aborted due to timeout TCPSACKDiscard: 4 TCPDSACKIgnoredOld: 1 TCPDSACKIgnoredNoUndo: 8649 TCPSpuriousRTOs: 445 TCPSackShiftFallback: 8588 TCPRcvCoalesce: 95854 TCPOFOQueue: 24741 TCPOFOMerge: 8 TCPChallengeACK: 1441 TCPSYNChallenge: 5 TCPSpuriousRtxHostQueues: 1 TCPAutoCorking: 4823 IpExt: InOctets: 1561561375 OutOctets: 1509416943 InNoECTPkts: 8201572 InECT1Pkts: 2 InECT0Pkts: 3844 InCEPkts: 306slide 55:
slide 56:
BeRer TCP Tools • TCP retransmit by type and time • Congestion algorithm metrics • etc.slide 57:
GUI Support • eg, Netflix Vector: open source instance analyzer:slide 58:
Summary • BPF in Linux 4.x makes many new things possible – Stack-based thread state analysis (solve all issues!) – Real-time memory growth/leak detection – Better TCP metrics – etc... • Get involved: see iovisor/bcc • So far just a preview of things to comeslide 59:
Links iovisor bcc: • https://github.com/iovisor/bcc • http://www.brendangregg.com/blog/2015-09-22/bcc-linux-4.3-tracing.html • http://blogs.microsoft.co.il/sasha/2016/02/14/two-new-ebpf-tools-memleak-and-argdist/ BPF Off-CPU, Wakeup, Off-Wake & Chain Graphs: • http://www.brendangregg.com/blog/2016-01-20/ebpf-offcpu-flame-graph.html • http://www.brendangregg.com/blog/2016-02-01/linux-wakeup-offwake-profiling.html • http://www.brendangregg.com/blog/2016-02-05/ebpf-chaingraph-prototype.html Linux Performance: • http://www.brendangregg.com/linuxperf.html Linux perf_events: • https://perf.wiki.kernel.org/index.php/Main_Page • http://www.brendangregg.com/perf.html Flame Graphs: • http://techblog.netflix.com/2015/07/java-in-flames.html • http://www.brendangregg.com/flamegraphs.html Netflix Tech Blog on Vector: • http://techblog.netflix.com/2015/04/introducing-vector-netflixs-on-host.html Wordcloud: https://www.jasondavies.com/wordcloud/slide 60:
Feb 2016 Questions? http://slideshare.net/brendangregg http://www.brendangregg.com bgregg@netflix.com @brendangregg Thanks to Alexei Starovoitov (Facebook), Brenden Blanco (PLUMgrid), Daniel Borkmann (Cisco), Wang Nan (Huawei), Sasha Goldshtein (Sela), and other BPF and bcc contributors!