Systems Performance 2nd Ed.



BPF Performance Tools book

Recent posts:
Blog index
About
RSS

Ubuntu Masters 2019: Extended BPF, A New Type of Software

Keynote for Ubuntu Masters 2019 by Brendan Gregg, Netflix.

Video https://www.youtube.com/watch?v=7pmXdG8-7WU&feature=youtu.be

Description: "Extended BPF is a new type of software, and the first fundamental change to how kernels are used in 50 years. This new type of software is already in use by major companies: Netflix has 14 BPF programs running by default on all of its cloud servers, which run Ubuntu Linux. Facebook has 40 BPF programs running by default. Extended BPF is composed of an in-kernel runtime for executing a virtual BPF instruction set through a safety verifier and with JIT compilation. So far it has been used for software defined networking, performance tools, security policies, and device drivers, with more uses planned and more we have yet to think of. It is changing how we use and think about systems. This talk explores the past, present, and future of BPF, with BPF performance tools as a use case."

next
prev
1/48
next
prev
2/48
next
prev
3/48
next
prev
4/48
next
prev
5/48
next
prev
6/48
next
prev
7/48
next
prev
8/48
next
prev
9/48
next
prev
10/48
next
prev
11/48
next
prev
12/48
next
prev
13/48
next
prev
14/48
next
prev
15/48
next
prev
16/48
next
prev
17/48
next
prev
18/48
next
prev
19/48
next
prev
20/48
next
prev
21/48
next
prev
22/48
next
prev
23/48
next
prev
24/48
next
prev
25/48
next
prev
26/48
next
prev
27/48
next
prev
28/48
next
prev
29/48
next
prev
30/48
next
prev
31/48
next
prev
32/48
next
prev
33/48
next
prev
34/48
next
prev
35/48
next
prev
36/48
next
prev
37/48
next
prev
38/48
next
prev
39/48
next
prev
40/48
next
prev
41/48
next
prev
42/48
next
prev
43/48
next
prev
44/48
next
prev
45/48
next
prev
46/48
next
prev
47/48
next
prev
48/48

PDF: UM2019_BPF_a_new_type_of_software.pdf

Keywords (from pdftotext):

slide 1:
    Extended BPF
    A New Type of Software
    Brendan Gregg
    UbuntuMasters
    Oct 2019
    
slide 2:
    BPF
    
slide 3:
    50 Years, one (dominant) OS model
    Applications
    System Calls
    Kernel
    Hardware
    
slide 4:
    Origins: Multics,
    1960s
    Applications
    Supervisor
    Hardware
    Privilege
    Ring 0
    Ring 1
    Ring 2
    
slide 5:
    Modern Linux: A new OS model
    User-mode
    Applications
    Kernel-mode
    Applications (BPF)
    System Calls
    BPF Helper Calls
    Kernel
    Hardware
    
slide 6:
    50 Years, one process state model
    User
    preemption or time quantum expired
    On-CPU
    schedule
    Kernel
    resource I/O
    acquire lock
    Off-CPU
    swap out
    sleep
    wait for work
    Wait
    Block
    Sleep
    Idle
    Swapping
    Runnable
    wakeup
    swap in
    acquired
    wakeup
    work arrives
    Linux groups
    most sleep states
    
slide 7:
    BPF program state model
    On-CPU
    Off-CPU
    Enabled
    attach
    Loaded
    event fires
    program ended
    BPF
    helpers
    Kernel
    spin lock
    Spinning
    
slide 8:
    Netconf 2018
    Alexei Starvoitov
    
slide 9:
    Kernel Recipes 2019, Alexei Starovoitov
    ~40 active BPF programs on every Facebook server
    
slide 10:
    >gt;150k AWS EC2 Ubuntu server instances
    ~34% US Internet traffic at night
    >gt;130M subscribers
    ~14 active BPF programs on every instance (so far)
    
slide 11:
    Modern Linux: Event-based Applications
    User-mode
    Applications
    Kernel-mode
    Applications (BPF)
    U.E.
    Scheduler
    Kernel
    Kernel
    Events
    Hardware Events (incl. clock)
    
slide 12:
    Modern Linux is becoming Microkernel-ish
    User-mode
    Applications
    Kernel-mode
    Services & Drivers
    BPF
    BPF
    BPF
    Smaller
    Kernel
    Hardware
    The word “microkernel” has already been invoked by Jonathan Corbet, Thomas Graf, Greg Kroah-Hartman, ...
    
slide 13:
slide 14:
    BPF
    
slide 15:
    BPF 1992: Berkeley Packet Filter
    # tcpdump -d host 127.0.0.1 and port 80
    (000) ldh
    [12]
    (001) jeq
    #0x800
    jt 2
    jf 18
    (002) ld
    [26]
    (003) jeq
    #0x7f000001
    jt 6
    jf 4
    (004) ld
    [30]
    (005) jeq
    #0x7f000001
    jt 6
    jf 18
    (006) ldb
    [23]
    (007) jeq
    #0x84
    jt 10 jf 8
    (008) jeq
    #0x6
    jt 10 jf 9
    (009) jeq
    #0x11
    jt 10 jf 18
    (010) ldh
    [20]
    (011) jset
    #0x1fff
    jt 18 jf 12
    (012) ldxb
    4*([14]&0xf)
    (013) ldh
    [x + 14]
    (014) jeq
    #0x50
    jt 17 jf 15
    (015) ldh
    [x + 16]
    (016) jeq
    #0x50
    jt 17 jf 18
    (017) ret
    #262144
    (018) ret
    A limited
    virtual machine for
    efficient packet filters
    
slide 16:
    BPF 2019: aka extended BPF
    bpftrace
    XDP
    bpfconf
    BPF microconference
    & Facebook Katran, Google KRSI, Netflix flowsrus,
    and many more
    
slide 17:
    BPF 2019
    User-Defined BPF Programs
    SDN Configuration
    DDoS Mitigation
    Kernel
    Runtime
    Event Targets
    verifier
    sockets
    Intrusion Detection
    Container Security
    kprobes
    BPF
    Observability
    Firewalls
    Device Drivers
    uprobes
    tracepoints
    BPF
    actions
    perf_events
    
slide 18:
    BPF is now a technology name,
    and no longer an acronym
    
slide 19:
    BPF Internals
    BPF Instructions
    Verifier
    Interpreter
    JIT Compiler
    Registers
    Machine Code
    Execution
    Map Storage (Mbytes)
    Events
    BPF
    Context
    BPF
    Helpers
    Rest of
    Kernel
    
slide 20:
    Is BPF Turing complete?
    
slide 21:
    A New Type of Software
    Execution User
    model
    defined
    Compilation
    Security
    Failure
    mode
    Resource
    access
    User
    task
    yes
    any
    user
    based
    abort
    syscall,
    fault
    Kernel
    task
    static
    none
    panic
    direct
    BPF
    event
    yes
    JIT,
    CO-RE
    verified,
    JIT
    error
    message
    restricted
    helpers
    
slide 22:
    Example Use Case:
    BPF Observability
    
slide 23:
    BPF enables a new class of
    custom, efficient, and production safe
    performance analysis tools
    
slide 24:
    BPF
    Perf
    Tools
    
slide 25:
    Ubuntu Install
    BCC (BPF Compiler Collection): complex tools
    # apt install bcc
    bpftrace: custom tools (Ubuntu 19.04+)
    # apt install bpftrace
    These are default installs at Netflix, Facebook, etc.
    
slide 26:
    Example: BCC tcplife
    Which processes are connecting to which port?
    
slide 27:
    Example: BCC tcplife
    Which processes are connecting to which port?
    # ./tcplife
    PID
    COMM
    LADDR
    22597 recordProg 127.0.0.1
    3277 redis-serv 127.0.0.1
    22598 curl
    22604 curl
    22624 recordProg 127.0.0.1
    3277 redis-serv 127.0.0.1
    22647 recordProg 127.0.0.1
    3277 redis-serv 127.0.0.1
    [...]
    LPORT RADDR
    46644 127.0.0.1
    28527 127.0.0.1
    61620 52.205.89.26
    44400 52.204.43.121
    46648 127.0.0.1
    28527 127.0.0.1
    46650 127.0.0.1
    28527 127.0.0.1
    RPORT TX_KB RX_KB MS
    0 0.23
    0 0.28
    1 91.79
    1 121.38
    0 0.22
    0 0.27
    0 0.21
    0 0.26
    
slide 28:
    Example: BCC tcplife
    # tcplife -h
    ./usage: tcplife.py [-h] [-T] [-t] [-w] [-s] [-p PID] [-L LOCALPORT]
    [-D REMOTEPORT]
    Trace the lifespan of TCP sessions and summarize
    optional arguments:
    -h, --help
    show this help message and exit
    -T, --time
    include time column on output (HH:MM:SS)
    -t, --timestamp
    include timestamp on output (seconds)
    -w, --wide
    wide column output (fits IPv6 addresses)
    -s, --csv
    comma separated values output
    -p PID, --pid PID
    trace this PID only
    -L LOCALPORT, --localport LOCALPORT
    comma-separated list of local ports to trace.
    -D REMOTEPORT, --remoteport REMOTEPORT
    comma-separated list of remote ports to trace.
    examples:
    ./tcplife
    ./tcplife -t
    [...]
    # trace all TCP connect()s
    # include time column (HH:MM:SS)
    
slide 29:
    Example: BCC biolatency
    What is the distribution of disk I/O latency? Per second?
    
slide 30:
    Example: BCC biolatency
    What is the distribution of disk I/O latency? Per second?
    # ./biolatency -mT 1 5
    Tracing block device I/O... Hit Ctrl-C to end.
    06:20:16
    msecs
    0 ->gt; 1
    2 ->gt; 3
    4 ->gt; 7
    8 ->gt; 15
    16 ->gt; 31
    32 ->gt; 63
    64 ->gt; 127
    06:20:17
    msecs
    0 ->gt; 1
    2 ->gt; 3
    4 ->gt; 7
    [...]
    : count
    : 36
    : 1
    : 3
    : 17
    : 33
    : 7
    : 6
    distribution
    |**************************************|
    |***
    |*****************
    |**********************************
    |*******
    |******
    : count
    : 96
    : 25
    : 29
    distribution
    |************************************
    |*********
    |***********
    
slide 31:
slide 32:
    Example: bpftrace readahead
    Is readahead polluting the cache?
    
slide 33:
    Example: bpftrace readahead
    Is readahead polluting the cache?
    # readahead.bt
    Attaching 5 probes...
    Readahead unused pages: 128
    Readahead used page age (ms):
    @age_ms:
    [1]
    2455 |@@@@@@@@@@@@@@@
    [2, 4)
    8424 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
    [4, 8)
    4417 |@@@@@@@@@@@@@@@@@@@@@@@@@@@
    [8, 16)
    7680 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
    [16, 32)
    4352 |@@@@@@@@@@@@@@@@@@@@@@@@@@
    [32, 64)
    0 |
    [64, 128)
    0 |
    [128, 256)
    384 |@@
    
slide 34:
    #!/usr/local/bin/bpftrace
    kprobe:__do_page_cache_readahead
    { @in_readahead[tid] = 1; }
    kretprobe:__do_page_cache_readahead { @in_readahead[tid] = 0; }
    kretprobe:__page_cache_alloc
    /@in_readahead[tid]/
    @birth[retval] = nsecs;
    @rapages++;
    kprobe:mark_page_accessed
    /@birth[arg0]/
    @age_ms = hist((nsecs - @birth[arg0]) / 1000000);
    delete(@birth[arg0]);
    @rapages--;
    END
    printf("\nReadahead unused pages: %d\n", @rapages);
    printf("\nReadahead used page age (ms):\n");
    print(@age_ms); clear(@age_ms);
    clear(@birth); clear(@in_readahead); clear(@rapages);
    
slide 35:
    Observability Challenges
    libc no frame pointer
    JIT function tracing
    Broken off-CPU flame graph (no frame pointer)
    
slide 36:
    Reality Check
    Many of our perf wins are from CPU flame graphs
    not CLI tracing
    
slide 37:
    Stack depth (0 - max)
    CPU Flame Graphs
    Kernel
    Java
    JVM
    Alphabetical frame sort (A - Z)
    
slide 38:
    BPF-based CPU Flame Graphs
    Linux 2.6
    Linux 4.9
    perf record
    profile.py
    perf.data
    perf script
    stackcollapse-perf.pl
    flamegraph.pl
    flamegraph.pl
    
slide 39:
    Observability of BPF
    
slide 40:
    Processes
    BPF
    top
    pmap
    strace
    gdb
    bpftool
    perf
    bpflist
    
slide 41:
    bpftool
    PID
    BPF ID
    Event
    # bpftool perf
    pid 1765 fd 6: prog_id 26 kprobe func blk_account_io_start offset 0
    pid 1765 fd 8: prog_id 27 kprobe func blk_account_io_done offset 0
    pid 1765 fd 11: prog_id 28 kprobe func sched_fork offset 0
    pid 1765 fd 15: prog_id 29 kprobe func ttwu_do_wakeup offset 0
    pid 1765 fd 17: prog_id 30 kprobe func wake_up_new_task offset 0
    pid 1765 fd 19: prog_id 31 kprobe func finish_task_switch offset 0
    pid 1765 fd 26: prog_id 33 tracepoint inet_sock_set_state
    pid 21993 fd 6: prog_id 232 uprobe filename /proc/self/exe offset 1781927
    pid 21993 fd 8: prog_id 233 uprobe filename /proc/self/exe offset 1781920
    pid 21993 fd 15: prog_id 234 kprobe func blk_account_io_done offset 0
    pid 21993 fd 17: prog_id 235 kprobe func blk_account_io_start offset 0
    pid 25440 fd 8: prog_id 262 kprobe func blk_mq_start_request offset 0
    pid 25440 fd 10: prog_id 263 kprobe func blk_account_io_done offset 0
    
slide 42:
    # bpftool prog dump jited id 263
    int trace_req_done(struct pt_regs * ctx):
    0xffffffffc082dc6f:
    ; struct request *req = ctx->gt;di;
    push
    %rbp
    mov
    %rsp,%rbp
    sub
    $0x38,%rsp
    sub
    $0x28,%rbp
    mov
    %rbx,0x0(%rbp)
    13:
    mov
    %r13,0x8(%rbp)
    17:
    mov
    %r14,0x10(%rbp)
    1b:
    mov
    %r15,0x18(%rbp)
    1f:
    xor
    %eax,%eax
    21:
    mov
    %rax,0x20(%rbp)
    25:
    mov
    0x70(%rdi),%rdi
    ; struct request *req = ctx->gt;di;
    29:
    mov
    %rdi,-0x8(%rbp)
    ; tsp = bpf_map_lookup_elem((void *)bpf_pseudo_fd(1, -1), &req);
    2d:
    movabs $0xffff96e680ab0000,%rdi
    37:
    mov
    %rbp,%rsi
    3a:
    add
    $0xfffffffffffffff8,%rsi
    ; tsp = bpf_map_lookup_elem((void *)bpf_pseudo_fd(1, -1), &req);
    3e:
    callq 0xffffffffc39a49c1
    
slide 43:
    LPC 2019, Arnaldo Carvalho de Melo
    CPU profiling of BPF programs
    
slide 44:
    “We should be able to single-step execution...
    We should be able to take a core dump of all state.”
    – David S. Miller, LSFMM 2019
    UNIVAC 1
    
slide 45:
    Future
    
slide 46:
    Future Predictions
    More device drivers, incl. USB on BPF (ghk)
    Monitoring agents
    Intrusion detection systems
    TCP congestion controls
    CPU & container schedulers
    FS readahead policies
    CDN accelerator
    
slide 47:
    Take Aways
    BPF is a new software type
    Start using BPF perf tools on Ubuntu:
    bcc, bpftrace
    
slide 48:
    Thanks
    BPF: Alexei Starovoitov, Daniel Borkmann, David S. Miller, Linus Torvalds, BPF
    community
    BCC: Brenden Blanco, Yonghong Song, Sasha Goldsthein, BCC community
    bpftrace: Alastair Robertson, Mary Marchini, Dan Xu, bpftrace community
    Canonical: BPF support, and libc-fp (thanks in advance)
    All photos credit myself; except slide 2 (Netflix) and 9 (KernelRecipes)