AWS re:Invent 2019: BPF Performance Analysis at Netflix
Talk by Brendan Gregg at AWS re:Invent 2019.Video: https://www.youtube.com/watch?v=16slh29iN1g
Description: "Extended BPF (eBPF) is an open source Linux technology that powers a whole new class of software: mini programs that run on events. Among its many uses, BPF can be used to create powerful performance analysis tools capable of analyzing everything: CPUs, memory, disks, file systems, networking, languages, applications, and more. In this session, Netflix's Brendan Gregg tours BPF tracing capabilities, including many new open source performance analysis tools he developed for his new book "BPF Performance Tools: Linux System and Application Observability." The talk includes examples of using these tools in the Amazon EC2 cloud."
next prev 1/65 | |
next prev 2/65 | |
next prev 3/65 | |
next prev 4/65 | |
next prev 5/65 | |
next prev 6/65 | |
next prev 7/65 | |
next prev 8/65 | |
next prev 9/65 | |
next prev 10/65 | |
next prev 11/65 | |
next prev 12/65 | |
next prev 13/65 | |
next prev 14/65 | |
next prev 15/65 | |
next prev 16/65 | |
next prev 17/65 | |
next prev 18/65 | |
next prev 19/65 | |
next prev 20/65 | |
next prev 21/65 | |
next prev 22/65 | |
next prev 23/65 | |
next prev 24/65 | |
next prev 25/65 | |
next prev 26/65 | |
next prev 27/65 | |
next prev 28/65 | |
next prev 29/65 | |
next prev 30/65 | |
next prev 31/65 | |
next prev 32/65 | |
next prev 33/65 | |
next prev 34/65 | |
next prev 35/65 | |
next prev 36/65 | |
next prev 37/65 | |
next prev 38/65 | |
next prev 39/65 | |
next prev 40/65 | |
next prev 41/65 | |
next prev 42/65 | |
next prev 43/65 | |
next prev 44/65 | |
next prev 45/65 | |
next prev 46/65 | |
next prev 47/65 | |
next prev 48/65 | |
next prev 49/65 | |
next prev 50/65 | |
next prev 51/65 | |
next prev 52/65 | |
next prev 53/65 | |
next prev 54/65 | |
next prev 55/65 | |
next prev 56/65 | |
next prev 57/65 | |
next prev 58/65 | |
next prev 59/65 | |
next prev 60/65 | |
next prev 61/65 | |
next prev 62/65 | |
next prev 63/65 | |
next prev 64/65 | |
next prev 65/65 |
PDF: reInvent2019_BPF_Performance_Analysis.pdf
Keywords (from pdftotext):
slide 1:
slide 2:
OPN303 BPF Performance Analysis at Netflix Brendan Gregg Senior Performance Architect Netflix © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.slide 3:
Superpowers Demo © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.slide 4:
Agenda Why BPF Is Changing Linux BPF Internals Performance Analysis Tool Developmentslide 5:
Why BPF Is Changing Linux © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.slide 6:
50 years, one (dominant) OS model Applications System Calls Kernel Hardwareslide 7:
Origins: Multics 1960s Applications Supervisor Hardware Privilege Ring 0 Ring 1 Ring 2slide 8:
Modern Linux: a new OS model User-mode Applications Kernel-mode Applications (BPF) System Calls BPF Helper Calls Kernel Hardwareslide 9:
50 years, one process state model User preemption or time quantum expired On-CPU schedule Kernel resource I/O acquire lock Off-CPU swap out sleep wait for work Wait Block Sleep Idle Runnable wakeup Swapping swap in acquired wakeup work arrives Linux groups most sleep statesslide 10:
BPF uses a new program state model On-CPU Off-CPU Enabled attach Loaded event fires program ended BPF helpers Kernel spin lock Spinningslide 11:
Netconf 2018 Alexei Starvoitovslide 12:
Kernel Recipes 2019, Alexei Starovoitov ~40 active BPF programs on every Facebook serverslide 13:
>gt;150K Amazon EC2 server instances ~34% US Internet traffic at night >gt;130M subscribers ~14 active BPF programs on every instance (so far)slide 14:
Modern Linux: Event-based Applications Kernel-mode Applications (BPF) User-mode Applications U.E. Scheduler Kernel Kernel Events Hardware Events (incl. clock)slide 15:
Modern Linux is becoming microkernel-ish Kernel-mode Services & Drivers User-mode Applications BPF BPF BPF Smaller Kernel Hardware The word “microkernel” has already been invoked by Jonathan Corbet, Thomas Graf, Greg Kroah-Hartman, ...slide 16:
BPF Internals © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.slide 17:
BPF 1992: Berkeley Packet Filter # tcpdump -d host 127.0.0.1 and port 80 (000) ldh [12] (001) jeq #0x800 jt 2 jf 18 (002) ld [26] (003) jeq #0x7f000001 jt 6 jf 4 (004) ld [30] (005) jeq #0x7f000001 jt 6 jf 18 (006) ldb [23] (007) jeq #0x84 jt 10 jf 8 (008) jeq #0x6 jt 10 jf 9 (009) jeq #0x11 jt 10 jf 18 (010) ldh [20] (011) jset #0x1fff jt 18 jf 12 (012) ldxb 4*([14]&0xf) (013) ldh [x + 14] (014) jeq #0x50 jt 17 jf 15 (015) ldh [x + 16] (016) jeq #0x50 jt 17 jf 18 (017) ret #262144 (018) ret A limited virtual machine for efficient packet filtersslide 18:
BPF 2019: aka extended BPF bpftrace XDP bpfconf BPF Microconference & Facebook Katran, Google KRSI, Netflix flowsrus, and many moreslide 19:
BPF 2019 User-Defined BPF Programs SDN Configuration DDoS Mitigation Kernel Runtime Event Targets verifier sockets Intrusion Detection Container Security kprobes BPF Observability Firewalls Device Drivers uprobes tracepoints BPF actions perf_eventsslide 20:
BPF is open source and in the Linux kernel (you’re all getting it) BPF is also now a technology name, and no longer an acronymslide 21:
BPF Internals BPF Instructions Verifier Interpreter JIT Compiler Registers Machine Code Execution Map Storage (Mbytes) Events BPF Context BPF Helpers Rest of Kernelslide 22:
Is BPF Turing complete?slide 23:
BPF: a new type of software Execution model Userdefined Compile Security Failure mode Resource access User task yes any userbased abort syscall, fault Kernel task static none panic direct BPF event yes JIT, CO-RE verified, JIT error message restricted helpersslide 24:
Performance Analysis © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.slide 25:
BPF enables a new class of custom, efficient, and production-safe performance analysis toolsslide 26:
BPF Performance Toolsslide 27:
Tool Examples by Subsystem 1. CPUs (scheduling) 2. Memory 3. Disks 4. File Systems 5. Networking 6. Languages 7. Applications 8. Kernel 9. Hypervisors 10. Containersslide 28:
Tool Extensions & Sources .py: BCC (Python) .bt: bpftrace (some tools exist for both) https://github.com/iovisor/bcc https://github.com/iovisor/bpftrace https://github.com/brendangregg/bpf-perf-tools-bookslide 29:
CPUs: execsnoop New process trace # execsnoop.py -T TIME(s) PCOMM run bash svstat perl grep sed xargs cut echo mkdir [...] run bash svstat perl [...] PID PPID RET ARGS 0 ./run 0 /bin/bash 0 /command/svstat /service/nflx-httpd 0 /usr/bin/perl -e $l=slide 30:gt;;$l=~/(\d+) sec/;pr... 0 /bin/ps --ppid 1 -o pid,cmd,args 0 /bin/grep org.apache.catalina 0 /bin/sed s/^ *//; 0 /usr/bin/xargs 0 /usr/bin/cut -d -f 1 0 /bin/echo 0 /bin/mkdir -v -p /data/tomcat 0 ./run 0 /bin/bash 0 /command/svstat /service/nflx-httpd 0 /usr/bin/perl -e $l= gt;;$l=~/(\d+) sec/;pr...
CPUs: runqlat Scheduler latency (run queue latency) # runqlat.py 10 1 Tracing run queue latency... Hit Ctrl-C to end. usecs 0 ->gt; 1 2 ->gt; 3 4 ->gt; 7 8 ->gt; 15 16 ->gt; 31 32 ->gt; 63 64 ->gt; 127 128 ->gt; 255 256 ->gt; 511 512 ->gt; 1023 1024 ->gt; 2047 2048 ->gt; 4095 4096 ->gt; 8191 8192 ->gt; 16383 16384 ->gt; 32767 : count : 1906 : 22087 : 21245 : 7333 : 4902 : 6002 : 7370 : 13001 : 4823 : 1519 : 3682 : 3170 : 5759 : 14549 : 5589 distribution |*** |****************************************| |************************************** | |************* |******** |********** |************* |*********************** |******** |** |****** |***** |********** |************************** |**********slide 31:
CPUs: runqlen Run queue length # runqlen.py 10 1 Sampling run queue length... Hit Ctrl-C to end. runqlen : count : 47284 : 211 : 28 : 6 : 4 : 1 : 1 distribution |****************************************|slide 32:
Memory: ffaults (book) Page faults by filename # ffaults.bt Attaching 1 probe... [...] @[dpkg]: 18 @[sudoers.so]: 19 @[ld.so.cache]: 27 @[libpthread-2.27.so]: 29 @[ld-2.27.so]: 32 @[locale-archive]: 34 @[system.journal]: 39 @[libstdc++.so.6.0.25]: 43 @[libapt-pkg.so.5.0.2]: 47 @[BrowserMetrics-5D8A6422-77F1.pma]: 86 @[libc-2.27.so]: 168 @[i915]: 409 @[pkgcache.bin]: 860 @[]: 25038slide 33:
Disks: biolatency Disk I/O latency histograms, per second # biolatency.py -mT 1 5 Tracing block device I/O... Hit Ctrl-C to end. 06:20:16 msecs 0 ->gt; 1 2 ->gt; 3 4 ->gt; 7 8 ->gt; 15 16 ->gt; 31 32 ->gt; 63 64 ->gt; 127 06:20:17 msecs 0 ->gt; 1 2 ->gt; 3 4 ->gt; 7 [...] : count : 36 : 1 : 3 : 17 : 33 : 7 : 6 distribution |**************************************| |*** |***************** |********************************** |******* |****** : count : 96 : 25 : 29 distribution |************************************ |********* |***********slide 34:
File Systems: xfsslower XFS I/O slower than a threshold (variants for ext4, btrfs, zfs) # xfsslower.py 50 Tracing XFS operations slower than 50 ms TIME COMM PID T BYTES OFF_KB 21:20:46 java 112789 R 8012 21:20:47 java 112789 R 3571 21:20:49 java 112789 R 5152 21:20:52 java 112789 R 5214 21:20:52 java 112789 R 7465 21:20:54 java 112789 R 5326 21:20:55 java 112789 R 4336 [...] 22:02:39 java 112789 R 65536 22:02:39 java 112789 R 65536 22:02:39 java 112789 R 65536 22:02:39 java 112789 R 65536 22:02:39 java 112789 R 65536 22:02:39 java 112789 R 65536 [...] LAT(ms) FILENAME 60.16 file.out 136.60 file.out 63.88 file.out 108.47 file.out 58.09 file.out 89.14 file.out 67.89 file.out 182.10 shuffle_6_646_0.data 30.10 shuffle_6_646_0.data 309.52 shuffle_6_646_0.data 400.31 shuffle_6_646_0.data 324.92 shuffle_6_646_0.data 119.37 shuffle_6_646_0.dataslide 35:
File Systems: xfsdist XFS I/O latency histograms, by operation # xfsdist.py 60 Tracing XFS operation latency... Hit Ctrl-C to end. 22:41:24: operation = 'read' usecs 0 ->gt; 1 2 ->gt; 3 4 ->gt; 7 8 ->gt; 15 16 ->gt; 31 32 ->gt; 63 [...] operation = 'write' usecs 0 ->gt; 1 2 ->gt; 3 [...] : count : 382130 : 85717 : 23639 : 5668 : 3594 : 21387 distribution |****************************************| |******** |** |** : count : 12925 : 83375 distribution |***** |*************************************slide 36:
Networking: tcplife TCP session lifespans with connection details # tcplife.py PID COMM LADDR 22597 recordProg 127.0.0.1 3277 redis-serv 127.0.0.1 22598 curl 22604 curl 22624 recordProg 127.0.0.1 3277 redis-serv 127.0.0.1 22647 recordProg 127.0.0.1 3277 redis-serv 127.0.0.1 [...] LPORT RADDR 46644 127.0.0.1 28527 127.0.0.1 61620 52.205.89.26 44400 52.204.43.121 46648 127.0.0.1 28527 127.0.0.1 46650 127.0.0.1 28527 127.0.0.1 RPORT TX_KB RX_KB MS 0 0.23 0 0.28 1 91.79 1 121.38 0 0.22 0 0.27 0 0.21 0 0.26slide 37:
Networking: tcpsynbl (book) TCP SYN backlogs as histograms # tcpsynbl.bt Attaching 4 probes... Tracing SYN backlog size. Ctrl-C to end. @backlog[backlog limit]: histogram of backlog size @backlog[128]: [0] 2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| @backlog[500]: [0] [1] [2, 4) [4, 8) 2783 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| 9 | 4 | 1 |slide 38:
Languages: funccount Count native function calls (C, C++, Go, etc.) # funccount.py 'tcp_s*' Tracing 50 functions for "tcp_s*"... Hit Ctrl-C to end. FUNC COUNT [...] tcp_setsockopt tcp_shutdown tcp_sndbuf_expand tcp_send_delayed_ack tcp_set_state tcp_sync_mss tcp_sendmsg_locked tcp_sendmsg tcp_send_mss tcp_small_queue_check.isra.29 tcp_schedule_loss_probe tcp_send_ack tcp_stream_memory_free Detaching...slide 39:
Applications: mysqld_qslower MySQL queries slower than a threshold # mysqld_qslower.py $(pgrep mysqld) Tracing MySQL server queries for PID 9908 slower than 1 ms... TIME(s) PID MS QUERY 9962 169.032 SELECT * FROM words WHERE word REGEXP '^bre.*n$' 9962 205.787 SELECT * FROM words WHERE word REGEXP '^bpf.tools$' 95.276 SELECT COUNT(*) FROM words 23.723025 9962 186.680 SELECT count(*) AS count FROM words WHERE word REGEXP '^bre.*n$' 30.343233 9962 181.494 SELECT * FROM words WHERE word REGEXP '^bre.*n$' ORDER BY word [...]slide 40:
Kernel: workq (book) Work queue function execution times # workq.bt Attaching 4 probes... Tracing workqueue request latencies. Ctrl-C to end. @us[blk_mq_timeout_work]: [1] 1 |@@ [2, 4) 11 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ [4, 8) 18 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| @us[xfs_end_io]: [1] [2, 4) [4, 8) [8, 16) [16, 32) [32, 64) [...] 2 |@@@@@@@@ 6 |@@@@@@@@@@@@@@@@@@@@@@@@@@ 6 |@@@@@@@@@@@@@@@@@@@@@@@@@@ 12 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| 12 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| 3 |@@@@@@@@@@@@@slide 41:
Hypervisor: xenhyper (book) Count hypercalls from Xen PV guests # xenhyper.bt Attaching 1 probe... @[mmu_update]: 44 @[update_va_mapping]: 78 @[mmuext_op]: 6473 @[stack_switch]: 23445slide 42:
Containers: blkthrot (book) Count block I/O throttles by blk cgroup # blkthrot.bt Attaching 3 probes... Tracing block I/O throttles by cgroup. Ctrl-C to end @notthrottled[1]: 506 @throttled[1]: 31slide 43:
That was only 14 out of 150+ tools All are open source Not all 150+ tools shown hereslide 44:
Coping with so many BPF tools at Netflix On Netflix servers, /apps/nflx-bpf-alltools has all the tools BCC, bpftrace, my book, Netflix internal Open source at: https://github.com/Netflix-Skunkworks/bpftoolkit Latest tools are fetched & put in a hierarchy: cpu, disk, … We are also building GUIs to front these toolsslide 45:
Tool Development © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.slide 46:
Only one engineer at your company needs to learn tool development They can turn everyone’s ideas into toolsslide 47:
The Tracing Landscape, Dec 2019 (brutal) Ease of use (less brutal) (my opinion) (eBPF) (0.9.3) bpftrace ply/BPF sysdig (many) perf stap LTTng recent changes (alpha) (mature) Stage of Development (hist trigg ftrace ers, synth etic e vents bcc/BPF C/BPF Raw BPF Scope & Capabilityslide 48:
bcc/BPF (C & Python) bcc examples/tracing/bitehist.py entire programslide 49:
bpftrace/BPF bpftrace -e 'kr:vfs_read { @ = hist(retval); }' https://github.com/iovisor/bpftrace entire programslide 50:
bpftrace Syntax bpftrace -e ‘k:do_nanosleep /pid >gt; 100/ { @[comm]++ }’ Probe Action Filter (optional)slide 51:
Probe Type Shortcuts tracepoint Kernel static tracepoints usdt User-level statically defined tracing kprobe Kernel function tracing kretprobe Kernel function returns uprobe User-level function tracing uretprobe User-level function returns profile Timed sampling across all CPUs interval Interval output software Kernel software events hardware Processor hardware eventsslide 52:
Filters /pid == 181/ /comm != “sshd”/ /@ts[tid]/slide 53:
Actions Per-event output printf() system() join() time() Map summaries @ = count() or @++ @ = hist() The following is in the https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.mdslide 54:
Functions hist(n) Log2 histogram Print formatted printf(fmt, ...) lhist(n, min, max, step) Linear hist. print(@x[, top[, div]]) Print map count() Count events delete(@x) Delete map element sum(n) Sum value clear(@x) Delete all keys/values min(n) Minimum value Maximum value Register lookup max(n) reg(n) join(a) Join string array time(fmt) Print formatted time system(fmt) Run shell command avg(n) stats(n) Average value Statistics str(s) String ksym(p) Resolve kernel addr cat(file) Print file contents usym(p) Resolve user addr exit() Quit bpftrace kaddr(n) Resolve kernel symbol uaddr(n) Resolve user symbolslide 55:
Variable Types Basic Variables @global @thread_local[tid] $scratch Associative Arrays @array[key] = value Buitins pidslide 56:
Builtin Variables pid Process ID (kernel tgid) arg0, arg1, … Function args tid Thread ID (kernel pid) retval Return value cgroup Current Cgroup ID args Tracepoint args uid User ID func Function name gid Group ID probe Full probe name nsecs Nanosecond timestamp curtask Curr task_struct (u64) cpu Processor ID rand Random number (u32) comm Process name kstack Kernel stack trace ustack User stack traceslide 57:
bpftrace: BPF observability front-end # Files opened by process bpftrace -e 't:syscalls:sys_enter_open { printf("%s %s\n", comm, str(args->gt;filename)) }' # Read size distribution by process bpftrace -e 't:syscalls:sys_exit_read { @[comm] = hist(args->gt;ret) }' # Count VFS calls bpftrace -e 'kprobe:vfs_* { @[func]++ }' # Show vfs_read latency as a histogram bpftrace -e 'k:vfs_read { @[tid] = nsecs } kr:vfs_read /@[tid]/ { @ns = hist(nsecs - @[tid]); delete(@tid) }’ # Trace user-level function bpftrace -e 'uretprobe:bash:readline { printf(“%s\n”, str(retval)) }’ Linux 4.9+, https://github.com/iovisor/bpftraceslide 58:
Example: bpftrace biolatency Disk I/O latency histograms, per second # biolatency.bt Attaching 3 probes... Tracing block device I/O... Hit Ctrl-C to end. @usecs: [256, 512) [512, 1K) [1K, 2K) [2K, 4K) [4K, 8K) [8K, 16K) [16K, 32K) [32K, 64K) [64K, 128K) [128K, 256K) [...] 2 | 10 |@ 426 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| 230 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 9 |@ 128 |@@@@@@@@@@@@@@@ 68 |@@@@@@@@ 0 | 0 | 10 |@slide 59:
Example: bpftrace biolatency Implemented inslide 60: Netflix Vector (old)slide 61:Grafana at Netflixslide 62:Takeaways Add BCC & bpftrace packages to your servers Start using BPF perf tools directly or via GUIs Identify 1+ engineer at your company to develop tools & GUIs From: BPF Performance Tools: Linux System and Application Observability, Brendan Gregg, Addison Wesley 2019slide 63:Thanks & URLs BPF: Alexei Starovoitov, Daniel Borkmann, David S. Miller, Linus Torvalds, BPF community BCC: Brenden Blanco, Yonghong Song, Sasha Goldsthein, BCC community bpftrace: Alastair Robertson, Matheus Marchini, Dan Xu, bpftrace community https://github.com/iovisor/bcc https://github.com/iovisor/bpftrace https://github.com/brendangregg/bpf-perf-tools-book http://www.brendangregg.com/ebpf.html http://www.brendangregg.com/bpf-performance-tools-book.html All diagrams and photos (slides 11 & 22) are my own; slide 12 is from KernelRecipes: https://www.youtube.com/watch?v=bbHFg9IsTk8slide 64:Thank you! Brendan Gregg @brendangregg bgregg@netflix.com © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.slide 65:Please complete the session survey in the mobile app. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.