Ubuntu Masters 2019: Extended BPF, A New Type of Software
Keynote for Ubuntu Masters 2019 by Brendan Gregg, Netflix.Video https://www.youtube.com/watch?v=7pmXdG8-7WU&feature=youtu.be
Description: "Extended BPF is a new type of software, and the first fundamental change to how kernels are used in 50 years. This new type of software is already in use by major companies: Netflix has 14 BPF programs running by default on all of its cloud servers, which run Ubuntu Linux. Facebook has 40 BPF programs running by default. Extended BPF is composed of an in-kernel runtime for executing a virtual BPF instruction set through a safety verifier and with JIT compilation. So far it has been used for software defined networking, performance tools, security policies, and device drivers, with more uses planned and more we have yet to think of. It is changing how we use and think about systems. This talk explores the past, present, and future of BPF, with BPF performance tools as a use case."
next prev 1/48 | |
next prev 2/48 | |
next prev 3/48 | |
next prev 4/48 | |
next prev 5/48 | |
next prev 6/48 | |
next prev 7/48 | |
next prev 8/48 | |
next prev 9/48 | |
next prev 10/48 | |
next prev 11/48 | |
next prev 12/48 | |
next prev 13/48 | |
next prev 14/48 | |
next prev 15/48 | |
next prev 16/48 | |
next prev 17/48 | |
next prev 18/48 | |
next prev 19/48 | |
next prev 20/48 | |
next prev 21/48 | |
next prev 22/48 | |
next prev 23/48 | |
next prev 24/48 | |
next prev 25/48 | |
next prev 26/48 | |
next prev 27/48 | |
next prev 28/48 | |
next prev 29/48 | |
next prev 30/48 | |
next prev 31/48 | |
next prev 32/48 | |
next prev 33/48 | |
next prev 34/48 | |
next prev 35/48 | |
next prev 36/48 | |
next prev 37/48 | |
next prev 38/48 | |
next prev 39/48 | |
next prev 40/48 | |
next prev 41/48 | |
next prev 42/48 | |
next prev 43/48 | |
next prev 44/48 | |
next prev 45/48 | |
next prev 46/48 | |
next prev 47/48 | |
next prev 48/48 |
PDF: UM2019_BPF_a_new_type_of_software.pdf
Keywords (from pdftotext):
slide 1:
Extended BPF A New Type of Software Brendan Gregg UbuntuMasters Oct 2019slide 2:
BPFslide 3:
50 Years, one (dominant) OS model Applications System Calls Kernel Hardwareslide 4:
Origins: Multics, 1960s Applications Supervisor Hardware Privilege Ring 0 Ring 1 Ring 2slide 5:
Modern Linux: A new OS model User-mode Applications Kernel-mode Applications (BPF) System Calls BPF Helper Calls Kernel Hardwareslide 6:
50 Years, one process state model User preemption or time quantum expired On-CPU schedule Kernel resource I/O acquire lock Off-CPU swap out sleep wait for work Wait Block Sleep Idle Swapping Runnable wakeup swap in acquired wakeup work arrives Linux groups most sleep statesslide 7:
BPF program state model On-CPU Off-CPU Enabled attach Loaded event fires program ended BPF helpers Kernel spin lock Spinningslide 8:
Netconf 2018 Alexei Starvoitovslide 9:
Kernel Recipes 2019, Alexei Starovoitov ~40 active BPF programs on every Facebook serverslide 10:
>gt;150k AWS EC2 Ubuntu server instances ~34% US Internet traffic at night >gt;130M subscribers ~14 active BPF programs on every instance (so far)slide 11:
Modern Linux: Event-based Applications User-mode Applications Kernel-mode Applications (BPF) U.E. Scheduler Kernel Kernel Events Hardware Events (incl. clock)slide 12:
Modern Linux is becoming Microkernel-ish User-mode Applications Kernel-mode Services & Drivers BPF BPF BPF Smaller Kernel Hardware The word “microkernel” has already been invoked by Jonathan Corbet, Thomas Graf, Greg Kroah-Hartman, ...slide 13:
slide 14:
BPFslide 15:
BPF 1992: Berkeley Packet Filter # tcpdump -d host 127.0.0.1 and port 80 (000) ldh [12] (001) jeq #0x800 jt 2 jf 18 (002) ld [26] (003) jeq #0x7f000001 jt 6 jf 4 (004) ld [30] (005) jeq #0x7f000001 jt 6 jf 18 (006) ldb [23] (007) jeq #0x84 jt 10 jf 8 (008) jeq #0x6 jt 10 jf 9 (009) jeq #0x11 jt 10 jf 18 (010) ldh [20] (011) jset #0x1fff jt 18 jf 12 (012) ldxb 4*([14]&0xf) (013) ldh [x + 14] (014) jeq #0x50 jt 17 jf 15 (015) ldh [x + 16] (016) jeq #0x50 jt 17 jf 18 (017) ret #262144 (018) ret A limited virtual machine for efficient packet filtersslide 16:
BPF 2019: aka extended BPF bpftrace XDP bpfconf BPF microconference & Facebook Katran, Google KRSI, Netflix flowsrus, and many moreslide 17:
BPF 2019 User-Defined BPF Programs SDN Configuration DDoS Mitigation Kernel Runtime Event Targets verifier sockets Intrusion Detection Container Security kprobes BPF Observability Firewalls Device Drivers uprobes tracepoints BPF actions perf_eventsslide 18:
BPF is now a technology name, and no longer an acronymslide 19:
BPF Internals BPF Instructions Verifier Interpreter JIT Compiler Registers Machine Code Execution Map Storage (Mbytes) Events BPF Context BPF Helpers Rest of Kernelslide 20:
Is BPF Turing complete?slide 21:
A New Type of Software Execution User model defined Compilation Security Failure mode Resource access User task yes any user based abort syscall, fault Kernel task static none panic direct BPF event yes JIT, CO-RE verified, JIT error message restricted helpersslide 22:
Example Use Case: BPF Observabilityslide 23:
BPF enables a new class of custom, efficient, and production safe performance analysis toolsslide 24:
BPF Perf Toolsslide 25:
Ubuntu Install BCC (BPF Compiler Collection): complex tools # apt install bcc bpftrace: custom tools (Ubuntu 19.04+) # apt install bpftrace These are default installs at Netflix, Facebook, etc.slide 26:
Example: BCC tcplife Which processes are connecting to which port?slide 27:
Example: BCC tcplife Which processes are connecting to which port? # ./tcplife PID COMM LADDR 22597 recordProg 127.0.0.1 3277 redis-serv 127.0.0.1 22598 curl 22604 curl 22624 recordProg 127.0.0.1 3277 redis-serv 127.0.0.1 22647 recordProg 127.0.0.1 3277 redis-serv 127.0.0.1 [...] LPORT RADDR 46644 127.0.0.1 28527 127.0.0.1 61620 52.205.89.26 44400 52.204.43.121 46648 127.0.0.1 28527 127.0.0.1 46650 127.0.0.1 28527 127.0.0.1 RPORT TX_KB RX_KB MS 0 0.23 0 0.28 1 91.79 1 121.38 0 0.22 0 0.27 0 0.21 0 0.26slide 28:
Example: BCC tcplife # tcplife -h ./usage: tcplife.py [-h] [-T] [-t] [-w] [-s] [-p PID] [-L LOCALPORT] [-D REMOTEPORT] Trace the lifespan of TCP sessions and summarize optional arguments: -h, --help show this help message and exit -T, --time include time column on output (HH:MM:SS) -t, --timestamp include timestamp on output (seconds) -w, --wide wide column output (fits IPv6 addresses) -s, --csv comma separated values output -p PID, --pid PID trace this PID only -L LOCALPORT, --localport LOCALPORT comma-separated list of local ports to trace. -D REMOTEPORT, --remoteport REMOTEPORT comma-separated list of remote ports to trace. examples: ./tcplife ./tcplife -t [...] # trace all TCP connect()s # include time column (HH:MM:SS)slide 29:
Example: BCC biolatency What is the distribution of disk I/O latency? Per second?slide 30:
Example: BCC biolatency What is the distribution of disk I/O latency? Per second? # ./biolatency -mT 1 5 Tracing block device I/O... Hit Ctrl-C to end. 06:20:16 msecs 0 ->gt; 1 2 ->gt; 3 4 ->gt; 7 8 ->gt; 15 16 ->gt; 31 32 ->gt; 63 64 ->gt; 127 06:20:17 msecs 0 ->gt; 1 2 ->gt; 3 4 ->gt; 7 [...] : count : 36 : 1 : 3 : 17 : 33 : 7 : 6 distribution |**************************************| |*** |***************** |********************************** |******* |****** : count : 96 : 25 : 29 distribution |************************************ |********* |***********slide 31:
slide 32:
Example: bpftrace readahead Is readahead polluting the cache?slide 33:
Example: bpftrace readahead Is readahead polluting the cache? # readahead.bt Attaching 5 probes... Readahead unused pages: 128 Readahead used page age (ms): @age_ms: [1] 2455 |@@@@@@@@@@@@@@@ [2, 4) 8424 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [4, 8) 4417 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ [8, 16) 7680 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ [16, 32) 4352 |@@@@@@@@@@@@@@@@@@@@@@@@@@ [32, 64) 0 | [64, 128) 0 | [128, 256) 384 |@@slide 34:
#!/usr/local/bin/bpftrace kprobe:__do_page_cache_readahead { @in_readahead[tid] = 1; } kretprobe:__do_page_cache_readahead { @in_readahead[tid] = 0; } kretprobe:__page_cache_alloc /@in_readahead[tid]/ @birth[retval] = nsecs; @rapages++; kprobe:mark_page_accessed /@birth[arg0]/ @age_ms = hist((nsecs - @birth[arg0]) / 1000000); delete(@birth[arg0]); @rapages--; END printf("\nReadahead unused pages: %d\n", @rapages); printf("\nReadahead used page age (ms):\n"); print(@age_ms); clear(@age_ms); clear(@birth); clear(@in_readahead); clear(@rapages);slide 35:
Observability Challenges libc no frame pointer JIT function tracing Broken off-CPU flame graph (no frame pointer)slide 36:
Reality Check Many of our perf wins are from CPU flame graphs not CLI tracingslide 37:
Stack depth (0 - max) CPU Flame Graphs Kernel Java JVM Alphabetical frame sort (A - Z)slide 38:
BPF-based CPU Flame Graphs Linux 2.6 Linux 4.9 perf record profile.py perf.data perf script stackcollapse-perf.pl flamegraph.pl flamegraph.plslide 39:
Observability of BPFslide 40:
Processes BPF top pmap strace gdb bpftool perf bpflistslide 41:
bpftool PID BPF ID Event # bpftool perf pid 1765 fd 6: prog_id 26 kprobe func blk_account_io_start offset 0 pid 1765 fd 8: prog_id 27 kprobe func blk_account_io_done offset 0 pid 1765 fd 11: prog_id 28 kprobe func sched_fork offset 0 pid 1765 fd 15: prog_id 29 kprobe func ttwu_do_wakeup offset 0 pid 1765 fd 17: prog_id 30 kprobe func wake_up_new_task offset 0 pid 1765 fd 19: prog_id 31 kprobe func finish_task_switch offset 0 pid 1765 fd 26: prog_id 33 tracepoint inet_sock_set_state pid 21993 fd 6: prog_id 232 uprobe filename /proc/self/exe offset 1781927 pid 21993 fd 8: prog_id 233 uprobe filename /proc/self/exe offset 1781920 pid 21993 fd 15: prog_id 234 kprobe func blk_account_io_done offset 0 pid 21993 fd 17: prog_id 235 kprobe func blk_account_io_start offset 0 pid 25440 fd 8: prog_id 262 kprobe func blk_mq_start_request offset 0 pid 25440 fd 10: prog_id 263 kprobe func blk_account_io_done offset 0slide 42:
# bpftool prog dump jited id 263 int trace_req_done(struct pt_regs * ctx): 0xffffffffc082dc6f: ; struct request *req = ctx->gt;di; push %rbp mov %rsp,%rbp sub $0x38,%rsp sub $0x28,%rbp mov %rbx,0x0(%rbp) 13: mov %r13,0x8(%rbp) 17: mov %r14,0x10(%rbp) 1b: mov %r15,0x18(%rbp) 1f: xor %eax,%eax 21: mov %rax,0x20(%rbp) 25: mov 0x70(%rdi),%rdi ; struct request *req = ctx->gt;di; 29: mov %rdi,-0x8(%rbp) ; tsp = bpf_map_lookup_elem((void *)bpf_pseudo_fd(1, -1), &req); 2d: movabs $0xffff96e680ab0000,%rdi 37: mov %rbp,%rsi 3a: add $0xfffffffffffffff8,%rsi ; tsp = bpf_map_lookup_elem((void *)bpf_pseudo_fd(1, -1), &req); 3e: callq 0xffffffffc39a49c1slide 43:
LPC 2019, Arnaldo Carvalho de Melo CPU profiling of BPF programsslide 44:
“We should be able to single-step execution... We should be able to take a core dump of all state.” – David S. Miller, LSFMM 2019 UNIVAC 1slide 45:
Futureslide 46:
Future Predictions More device drivers, incl. USB on BPF (ghk) Monitoring agents Intrusion detection systems TCP congestion controls CPU & container schedulers FS readahead policies CDN acceleratorslide 47:
Take Aways BPF is a new software type Start using BPF perf tools on Ubuntu: bcc, bpftraceslide 48:
Thanks BPF: Alexei Starovoitov, Daniel Borkmann, David S. Miller, Linus Torvalds, BPF community BCC: Brenden Blanco, Yonghong Song, Sasha Goldsthein, BCC community bpftrace: Alastair Robertson, Mary Marchini, Dan Xu, bpftrace community Canonical: BPF support, and libc-fp (thanks in advance) All photos credit myself; except slide 2 (Netflix) and 9 (KernelRecipes)