Systems Performance 2nd Ed.



BPF Performance Tools book

Recent posts:
Blog index
About
RSS

eBPF Summit 2023: Fast by Friday: Why eBPF is Essential

Keynote by Brendan Gregg for eBPF Summit 2023 (online).

Video: https://www.youtube.com/watch?v=s1mobd8t_u0

Description: "It is not ok that we speed weeks, even months, trying to solve why software is slow. It should not take more than a week to identify the root cause or causes for a performance issue, such that any performance issue reported on a Monday should be solved by Friday, or sooner. This talk explores the dream of "fast by Friday," and shows how kernel technologies like eBPF, and performance methodologies, can get us there. The end goal is not more tools and metrics or having everyone learn eBPF bytecode. It's about efficient computing, and solving inefficiencies as quickly as possible to save cycles and carbon."

next
prev
1/29
next
prev
2/29
next
prev
3/29
next
prev
4/29
next
prev
5/29
next
prev
6/29
next
prev
7/29
next
prev
8/29
next
prev
9/29
next
prev
10/29
next
prev
11/29
next
prev
12/29
next
prev
13/29
next
prev
14/29
next
prev
15/29
next
prev
16/29
next
prev
17/29
next
prev
18/29
next
prev
19/29
next
prev
20/29
next
prev
21/29
next
prev
22/29
next
prev
23/29
next
prev
24/29
next
prev
25/29
next
prev
26/29
next
prev
27/29
next
prev
28/29
next
prev
29/29

PDF: eBPFSummit2023_FastByFriday.pdf

Keywords (from pdftotext):

slide 1:
    Fast by Friday
    Why eBPF is Essential
    Brendan Gregg
    eBPF Summit 2023
    Fast by Friday: Why eBPF is Essential
    
slide 2:
    What would it take
    to solve any computer performance issue
    in 5 days?
    eBPF Summit 2023
    Fast by Friday: Why eBPF is Essential
    
slide 3:
    Imagine solving the performance of anything
    Operating systems, kernels, web browsers, phones, applications,
    websites, microservices, processors, AI, etc., …
    Examples: Linux, Windows, Firefox, Google docs, Minecraft,
    Amazon.com, Intel GPUs, pytorch, etc., …
    Websites should load in the blink of an eye.
    eBPF Summit 2023
    Fast by Friday: Why eBPF is Essential
    
slide 4:
    A vision:
    "Fast by Friday":
    Any computer performance issue
    reported on Monday
    should be solved by Friday
    (or sooner)
    eBPF Summit 2023
    Fast by Friday: Why eBPF is Essential
    
slide 5:
    "Fast by Friday" is…
    A vision
    A way of thinking
    A call to action
    A methodology
    A practical deadline
    I want to completely understand the performance of everything…in 5 days
    eBPF Summit 2023
    Fast by Friday: Why eBPF is Essential
    
slide 6:
    Why
    eBPF Summit 2023
    Fast by Friday: Why eBPF is Essential
    
slide 7:
    Expected performance improvement for computing products
    Performance
    Product Performance: Hypothetical
    eBPF Summit 2023
    Fast by Friday: Why eBPF is Essential
    
slide 8:
    Example reality
    Performance
    Product Performance: Actual
    eBPF Summit 2023
    Fast by Friday: Why eBPF is Essential
    
slide 9:
    Example reality
    Performance
    Product Performance: Actual
    Bottleneck not
    found in time
    Not enough time to properly
    analyze performance of all
    new software/hardware
    options
    New bottleneck
    not found in time
    We, engineers, have to fix this!
    eBPF Summit 2023
    Fast by Friday: Why eBPF is Essential
    
slide 10:
    Problem: Computers are getting increasingly complex
    Just one example (computer hardware) of increasing complexity.
    Software is worse!
    Performance issues can now go unsolved for weeks, months, years
    Product decisions miss improvements as analysis and tuning takes too long
    eBPF Summit 2023
    Fast by Friday: Why eBPF is Essential
    
slide 11:
    A common scenario at product vendors
    Your product is probably the fastest
    But there's likely some config/tunable error
    It's the final week of the customer eval
    You have to make it fast by friday
    eBPF Summit 2023
    Fast by Friday: Why eBPF is Essential
    
slide 12:
    Why this matters
    Timely performance analysis allows faster and more efficient
    software/hardware/tuning options to be adopted
    Good for the environment: Less cycles, energy, carbon
    Good for innovation: Rewards investment in engineering
    Good for companies: Less compute expense
    Good for end-users: Lower latency, cheaper products
    eBPF Summit 2023
    Fast by Friday: Why eBPF is Essential
    
slide 13:
    How
    eBPF Summit 2023
    Fast by Friday: Why eBPF is Essential
    
slide 14:
    Definitions
    "Fast by Friday":
    Any computer performance issue
    reported on Monday
    should be solved by Friday
    (or sooner)
    Issues: bottlenecks, evaluations, etc.
    Solved by friday: root cause(s) known
    eBPF Summit 2023
    Fast by Friday: Why eBPF is Essential
    
slide 15:
    "Fast by Friday": Proposed Agenda
    Prior weeks:
    Preparation
    Monday:
    Tuesday:
    Wednesday:
    Thursday:
    Friday:
    Quantify, static tuning, load
    Checklists, elimination
    Profiling
    Latency, logs, critical path
    Efficiency, algorithms
    Post weeks:
    Case study, retrospective
    eBPF Summit 2023
    Fast by Friday: Why eBPF is Essential
    
slide 16:
    Prior weeks: Preparation
    Everything must work on Monday!
    Critical analysis tools ("crisis tools") must be
    preinstalled; E.g., Linux: procps, sysstat,
    linux-tools-common, bcc-tools, bpftrace, …
    Stack tracing and symbols should work for the
    kernel, libraries, and applications
    Tracing (host & distributed) must work
    The performance engineers must already have host
    SSH root access
    A functional diagram of the system must be known
    Source code should be available
    Example functional diagram
    Source: Lunar Module - LM10 Through LM14 Familiarization Manual" (1969):
    Current industry status: 1 out of 5
    eBPF Summit 2023
    Fast by Friday: Why eBPF is Essential
    
slide 17:
    Monday: Quantify, static tuning, load
    1. Quantify the problem
    Problem statement method
    2. Static performance tuning
    The system without load
    Check all hardware and software
    versions, past errors, config
    Covered in sysperf
    3. Load vs implementation
    Problem Statement method
    Source: Systems Performance 2nd edition, page 44
    Just a problem of load?
    Usually solved via basic monitoring
    and line charts
    Current industry status: 4 out of 5
    A familiar pattern of load
    Source: https://www.brendangregg.com/Slides/SREcon_2016_perf_checklists
    eBPF Summit 2023
    Fast by Friday: Why eBPF is Essential
    
slide 18:
    Tuesday: Checklists, elimination
    1. Recent issue checklist
    Often need new tools for ad hoc checks
    Can now be automated by AI auto-tuners
    (e.g., Intel Granulate)
    2. Elimination: Subsystems it isn't
    It's impossible to deep-dive everything in
    one week, need to narrow down
    New tools to exonerate components
    Dashboards of health check traffic lights
    Include experiments: microbenchmarks
    Generic system diagram
    Current industry status: 2 out of 5
    eBPF Summit 2023
    Fast by Friday: Why eBPF is Essential
    
slide 19:
    New observability tools often need eBPF
    eBPF is a superpower that can answer
    any software performance question,
    in-situ and immediately
    eBPF Summit 2023
    Fast by Friday: Why eBPF is Essential
    
slide 20:
    Tuesday (cont.): eBPF Tools
    Current eBPF tools
    *snoop, *top, *stat, *count, *slower, *dist
    Supports later methodologies
    Workload characterization, latency analysis, off-CPU
    analysis, USE method, etc.
    Future elimination tools
    *health, *diagnosis
    Supports "fast by friday"
    Open source & in the target code repo
    They should not be in bcc/bpftrace or proprietary.
    Linux subsystem health tools should be in Linux, like
    unit tests, ideally written by the developers!
    Current eBPF performance tools
    Source: BPF Performance Tools, cover art [Gregg 2019]
    eBPF Summit 2023
    Fast by Friday: Why eBPF is Essential
    
slide 21:
    Wednesday: Profiling
    1. CPU Flame Graphs
    More efficient with eBPF
    eBPF runtime stack walkers
    2. Off-CPU Flame Graphs
    Impractical without eBPF
    CPU flame graph
    Solves most performance issues
    Needs preparation!
    Current industry status: 3 out of 5
    Off-CPU/waker time flame graph
    eBPF Summit 2023
    Fast by Friday: Why eBPF is Essential
    
slide 22:
    Thursday: Latency, logs, critical path
    1. Latency drilldowns
    Latency histograms
    Latency heat maps
    Latency outliers
    Drill down to origin of latency
    2. Logs, event tracing
    Latency heat maps
    Source: https://www.brendangregg.com/HeatMaps/latency.html
    Custom event logs
    3. Critical path analysis
    Multi-threaded tracing
    Distributed tracing across a distributed
    environment
    Current industry status: 3 out of 5
    Distributed tracing
    Source: https://www.brendangregg.com/Slides/Monitorama2015_NetflixInstanceAnalysis
    eBPF Summit 2023
    Fast by Friday: Why eBPF is Essential
    
slide 23:
    Thursday: Latency, logs, critical path
    eBPF Tools
    1. Latency drilldowns
    Latency histograms
    *dist
    Latency heat maps
    *slower
    Latency outliers
    Drill down to origin of latency
    2. Logs, event tracing
    Custom event logs
    Latency heat maps
    Source: https://www.brendangregg.com/HeatMaps/latency.html
    *snoop, bpftrace
    3. Critical path analysis
    Multi-threaded tracing
    Distributed tracing across a distributed
    environment
    "Zero instrumentation"
    Current industry status: 3 out of 5
    (when faster uprobes is done;
    current status: https://dont-ship.it)
    Distributed tracing
    Source: https://www.brendangregg.com/Slides/Monitorama2015_NetflixInstanceAnalysis
    eBPF Summit 2023
    Fast by Friday: Why eBPF is Essential
    
slide 24:
    Friday: Efficiency, algorithms
    1. Is the target efficient?
    A largely unsolved problem
    Cycles/carbon per request
    Compare with similar products
    New efficiency tools (eBPF?)
    Modeling, theory
    Protocol
    CIFS
    iSCSI
    FTP
    NFSv3
    NFSv4
    Cycles(k)
    per 1k read
    Example efficiency comparisons (made up)
    2. Use faster algorithms?
    Big O Notation
    Current industry status: 1 out of 5
    eBPF Summit 2023
    Fast by Friday: Why eBPF is Essential
    Source: Systems Performance 2nd Edition, page 175
    
slide 25:
    Post weeks: Case study, retrospective
    1. Document as a case study
    JIRA, wiki, gist
    External blog/talk
    Repetition?
    Add to Tuesday's "Recent issue checklist"
    2. Retrospective
    How to debug it faster by friday?
    A new way of thinking: If it took over 1 week to solve,
    that's a failure.
    Current industry status: 1 out of 5
    eBPF Summit 2023
    Fast by Friday: Why eBPF is Essential
    Example blog post: https://www.brendangregg.com/blog
    
slide 26:
    "Fast by Friday": My current industry ratings (5 == best)
    Prior weeks:
    Preparation
    Monday:
    Tuesday:
    Wednesday:
    Thursday:
    Friday:
    Quantify, static tuning, load
    Checklists, elimination
    Profiling
    Latency, logs, critical path
    Efficiency, algorithms
    Post weeks:
    Case study, retrospective
    We are not currently good at this
    eBPF Summit 2023
    Fast by Friday: Why eBPF is Essential
    
slide 27:
    "Fast by Friday": eBPF is Essential
    Prior weeks:
    Preparation
    Monday:
    Tuesday:
    Wednesday:
    Thursday:
    Friday:
    Quantify, static tuning, load
    Checklists, elimination
    Profiling
    Latency, logs, critical path
    Efficiency, algorithms
    Post weeks:
    Case study, retrospective
    eBPF Summit 2023
    Fast by Friday: Why eBPF is Essential
    eBPF
    
slide 28:
    Take Aways
    "Fast by Friday": Any computer performance issue reported on
    Monday should be solved by Friday (or sooner)
    eBPF is essential for such fast in-situ production analysis
    It will take all of us many years: OS changes, kernel support,
    new tools, methodologies
    eBPF Summit 2023
    Fast by Friday: Why eBPF is Essential
    
slide 29:
    Thanks
    Jesper Dangaard Brouer
    eBPF: Alexei Starovoitov (Meta), Daniel Borkmann (Isovalent), David S. Miller (Red Hat),
    Jakub Kicinski (Meta), Yonghong Song (Meta), Andrii Nakryiko (Meta), Martin KaFai
    Lau (Meta), John Fastabend (Isovalent), Quentin Monnet (Isovalent), Jesper Dangaard
    Brouer (Red Hat), Andrey Ignatov (Meta), Stanislav Fomichev (Google), Joe Stringer
    (Isolavent), KP Singh (Google), Dave Thaler (Microsoft), Chris Wright (Red Hat), Linus
    Torvalds, and many more in the BPF community
    eBPF Summit 2023
    Fast by Friday: Why eBPF is Essential