Systems Performance 2nd Ed.



BPF Performance Tools book

Recent posts:
Blog index
About
RSS

Kernel Recipes 2023: Fast By Friday: Why Kernel Superpowers are Essential

Talk by Brendan Gregg for Kernel Recipes 2023

Video: https://www.youtube.com/watch?v=XudHNF4k_x0

Description: "It is not ok that we speed weeks, even months, trying to solve why software is slow. Companies waste money on compute costs, users are unhappy with latency, and product evaluations run out of investigation time. It should not take more than a week to identify the root cause or causes for a performance issue, such that any performance issue reported on a Monday should be solved by Friday, or sooner. The kernel superpowers we have been building are essential for this dream, and allow us to explore performance analysis methodologies to achieve this that were previously a fantasy.

This talk explores the dream of "fast by Friday," and shows how kernel technologies like eBPF, and performance methodologies, can get us there. The end goal is not more tools and metrics or having everyone learn eBPF bytecode. It's about efficient computing, and solving inefficiencies as quickly as possible. It's about saving cycles and carbon.

To be fast by Friday requires observability tools to work on Monday, and right now for many Linux environments that means /proc based tools and Ftrace, sometimes perf, and rarely the eBPF tracing tools: bcc and bpftrace. This and other current and future technical challenges will be discussed, including eBPF stack walking, runtime behavior and uprobes, compiler optimization defaults, OS default packages, and non-CPU targets (GPUs, accelerators)."

next
prev
1/47
next
prev
2/47
next
prev
3/47
next
prev
4/47
next
prev
5/47
next
prev
6/47
next
prev
7/47
next
prev
8/47
next
prev
9/47
next
prev
10/47
next
prev
11/47
next
prev
12/47
next
prev
13/47
next
prev
14/47
next
prev
15/47
next
prev
16/47
next
prev
17/47
next
prev
18/47
next
prev
19/47
next
prev
20/47
next
prev
21/47
next
prev
22/47
next
prev
23/47
next
prev
24/47
next
prev
25/47
next
prev
26/47
next
prev
27/47
next
prev
28/47
next
prev
29/47
next
prev
30/47
next
prev
31/47
next
prev
32/47
next
prev
33/47
next
prev
34/47
next
prev
35/47
next
prev
36/47
next
prev
37/47
next
prev
38/47
next
prev
39/47
next
prev
40/47
next
prev
41/47
next
prev
42/47
next
prev
43/47
next
prev
44/47
next
prev
45/47
next
prev
46/47
next
prev
47/47

PDF: KernelRecipes2023_FastByFriday.pdf

Keywords (from pdftotext):

slide 1:
    Fast by Friday
    Why Kernel Superpowers are Essential
    Brendan Gregg
    Kernel Recipes 2023
    Fast by Friday: Why Kernel Superpowers are Essential
    
slide 2:
    What would it take
    to solve any computer performance issue
    in 5 days?
    Kernel Recipes 2023
    Fast by Friday: Why Kernel Superpowers are Essential
    
slide 3:
    Imagine solving the performance of anything
    Operating systems, kernels, web browsers, phones, applications,
    websites, microservices, processors, AI, etc., …
    Examples: Linux, Windows, Firefox, Google docs, Minecraft,
    Amazon.com, Intel GPUs, pytorch, etc., …
    Websites should load in the blink of an eye.
    Kernel Recipes 2023
    Fast by Friday: Why Kernel Superpowers are Essential
    
slide 4:
    Why
    Timely performance analysis allows faster and more efficient
    software/hardware/tuning options to be adopted
    Good for the environment: Less cycles, energy, carbon
    Good for innovation: Rewards investment in engineering
    Good for companies: Less compute expense
    Good for end-users: Lower latency, cheaper products
    Kernel Recipes 2023
    Fast by Friday: Why Kernel Superpowers are Essential
    
slide 5:
    A vision:
    "Fast by Friday":
    Any computer performance issue
    reported on Monday
    should be solved by Friday
    (or sooner)
    Kernel Recipes 2023
    Fast by Friday: Why Kernel Superpowers are Essential
    
slide 6:
    Definitions
    "Fast by Friday":
    Any computer performance issue
    reported on Monday
    should be solved by Friday
    (or sooner)
    Issues: any performance analysis task, especially SW/HW evaluations
    Solved by friday: doesn't mean fixed, it means root cause(s) known
    Kernel Recipes 2023
    Fast by Friday: Why Kernel Superpowers are Essential
    
slide 7:
    "Fast by Friday" is…
    A vision
    A way of thinking
    A call to action
    A methodology
    A practical deadline
    I want to completely understand the performance of everything…in 5 days
    Kernel Recipes 2023
    Fast by Friday: Why Kernel Superpowers are Essential
    
slide 8:
    The first of three activities
    1. Found
    2. Fixed
    3. Deployed
    Performance root cause(s) known
    Fix developed
    Fixed everywhere
    "Fast by Friday" focuses on (1) as it's often the biggest obstacle.
    Yes, even for the Linux kernel. Show me a 2x perf fix and I'll show you comparies running it by Friday.
    If the wasted cores paper was widely applicable, I'd have a pretty good example.
    Kernel Recipes 2023
    Fast by Friday: Why Kernel Superpowers are Essential
    
slide 9:
    The Problem
    Kernel Recipes 2023
    Fast by Friday: Why Kernel Superpowers are Essential
    
slide 10:
    Expected performance improvement for computing products
    Performance
    Product Performance: Hypothetical
    Kernel Recipes 2023
    Fast by Friday: Why Kernel Superpowers are Essential
    
slide 11:
    Example reality
    Performance
    Product Performance: Actual
    Kernel Recipes 2023
    Fast by Friday: Why Kernel Superpowers are Essential
    
slide 12:
    Example reality: 3 issues
    Performance
    Product Performance: Actual
    Not enough time
    to properly analyze
    all new software/
    hardware/compiler
    options (e.g., icx!)
    Regression not
    solved in time
    Amount
    of lost
    performance
    Bottleneck not
    found in time
    We, engineers, have to fix this!
    Kernel Recipes 2023
    Fast by Friday: Why Kernel Superpowers are Essential
    
slide 13:
    Problem: Computers are getting increasingly complex
    Just one example (computer hardware) of increasing complexity.
    Software is worse!
    Performance issues can now go unsolved for weeks, months, years
    Product decisions miss improvements as analysis and tuning takes too long
    Kernel Recipes 2023
    Fast by Friday: Why Kernel Superpowers are Essential
    
slide 14:
    Analogy: Car performance
    You build the world's fastest car, but the customer says: "it isn't"
    You investigate and discover:
    They were sent the wrong car
    … with flat tires
    They also weren't told how to drive it
    … unbalanced wheels
    … and left economy enabled
    … a minor engine issue
    … and didn't use the turbo button
    … and older firmware
    This may take too long to debug and the customer may leave.
    Computers are like this too!
    Kernel Recipes 2023
    Fast by Friday: Why Kernel Superpowers are Essential
    
slide 15:
    A common scenario at product vendors
    Your product is probably the fastest
    But there's likely some config/tunable error
    It's the final week of the customer eval
    You have to make it fast by friday
    Kernel Recipes 2023
    Fast by Friday: Why Kernel Superpowers are Essential
    
slide 16:
    How
    Kernel Recipes 2023
    Fast by Friday: Why Kernel Superpowers are Essential
    
slide 17:
    "Fast by Friday": Proposed Agenda
    Prior weeks:
    Preparation
    Monday:
    Tuesday:
    Wednesday:
    Thursday:
    Friday:
    Quantify, static tuning, load
    Checklists, elimination
    Profiling
    Latency, logs, critical path
    Efficiency, algorithms
    Post weeks:
    Case study, retrospective
    Kernel Recipes 2023
    Fast by Friday: Why Kernel Superpowers are Essential
    
slide 18:
    Prior weeks: Preparation
    Everything must work on Monday!
    Critical analysis tools ("crisis tools") must be
    preinstalled; E.g., Linux: procps, sysstat,
    linux-tools-common, bcc-tools, bpftrace, …
    Stack tracing and symbols should work for the
    kernel, libraries, and applications
    Tracing (host & distributed) must work
    The performance engineers must already have host
    SSH root access
    A functional diagram of the system must be known
    Source code should be available
    Example functional diagram
    Source: Lunar Module - LM10 Through LM14 Familiarization Manual" (1969):
    Current industry status: 1 out of 5
    Kernel Recipes 2023
    Fast by Friday: Why Kernel Superpowers are Essential
    
slide 19:
    Prior weeks: "Crisis Tools"
    No time to "apt-get update; apt-get
    install…" during a perf crisis.
    Ftrace is great as it's usually there;
    my Ftrace/perf tools:
    Source: Systems Performance 2nd Edition, page 131-132
    Kernel Recipes 2023
    Fast by Friday: Why Kernel Superpowers are Essential
    https://github.com/brendangregg/perf-tools
    
slide 20:
    Monday: Quantify, static tuning, load
    1. Quantify the problem
    Problem statement method
    2. Static performance tuning
    The system without load
    Check all hardware, software
    versions, past errors, config
    Covered in sysperf
    3. Load vs implementation
    Problem Statement method
    Source: Systems Performance 2nd edition, page 44
    Just a problem of load?
    Usually solved via basic monitoring
    and line charts
    Current industry status: 4 out of 5
    A familiar pattern of load
    Source: https://www.brendangregg.com/Slides/SREcon_2016_perf_checklists
    Kernel Recipes 2023
    Fast by Friday: Why Kernel Superpowers are Essential
    
slide 21:
    Monday (cont.): End-of-day Status
    If still unsolved, we now know:
    - It’s a real issue, of this magnitude, affecting these systems
    - It’s not just config
    - It’s not just load
    Kernel Recipes 2023
    Fast by Friday: Why Kernel Superpowers are Essential
    
slide 22:
    Tuesday: Checklists, elimination
    1. Recent issue checklist
    Often need new tools for ad hoc checks
    Can now be automated by AI auto-tuners
    (e.g., Intel Granulate)
    2. Elimination: Subsystems it isn't
    It's impossible to deep-dive everything in
    one week, need to narrow down
    New tools to exonerate components
    Dashboards of health check traffic lights
    Include experiments: microbenchmarks
    Generic system diagram
    Current industry status: 2 out of 5
    Kernel Recipes 2023
    Fast by Friday: Why Kernel Superpowers are Essential
    
slide 23:
    New observability tools often need kernel superpowers
    We need new tools for broad and deep custom performance
    analysis, ideally that can be developed and run in-situ by
    Friday. No restarts.
    eBPF is a kernel superpower that makes this possible.
    (e.g., show me how much workload A queued behind workload B: This is not just queue latency
    histograms, but needs programmatic filters.)
    Ftrace/perf/perf+eBPF also have kernel superpowers in the
    hands of wizards.
    eBPF
    Kernel Recipes 2023
    Fast by Friday: Why Kernel Superpowers are Essential
    Ftrace
    perf
    
slide 24:
    Tuesday (cont.): eBPF Tools
    Current eBPF tools
    *snoop, *top, *stat, *count, *slower, *dist
    Supports later methodologies
    Workload characterization, latency analysis, off-CPU
    analysis, USE method, etc.
    Future elimination tools
    *health, *diagnosis
    Supports "fast by friday"
    Analyzes existing dynamic workload
    Open source & in the target code repo
    E.g., Linux subsystem tools should be in Linux, like unit
    tests, accepted by maintainers, and ideally written by
    the developers! E.g., dctcphealth should ideally be
    written by the dctcp author: Daniel Borkmann!
    This ensures they are accurate and maintained.
    They should not be in bcc/bpftrace or proprietary.
    Kernel Recipes 2023
    Fast by Friday: Why Kernel Superpowers are Essential
    Current eBPF performance tools
    Source: BPF Performance Tools, cover art [Gregg 2019]
    
slide 25:
    Tuesday (cont.): Health Tool Example 1/2
    I wrote the ZFS L2ARC (second level cache) so I should write the
    health check tool, or at least share thoughts for others to follow:
    I designed it to either help or do nothing, so shouldn’t be an issue, but... It could burn CPU for
    scanning, memory for metadata, and disk I/O throughput for caching, and not providing a net
    win, especially if someone set the record size to very small. Plus there could be outright bugs
    by new: There was that ARC bug I talked about at the last KR.
    Experimental is easiest: It’s a cache, so turn it off! Are things now faster or slower?
    Accurate observability is hard: Measure CPU burn (profiling or eBPF tracing), disk I/O, and
    impact of L2ARC kernel metadata preventing app WSS from caching, but measuring WSS is
    hard, and my website is overdue an update www.brendangregg.com/wss.html
    Rough observability: From kernel counters: Is the L2ARC in use? Is the recsize 
slide 26:
    Tuesday (cont.): Health Tool Example 2/2
    I wrote the ZFS L2ARC (second level cache) so I should write the
    health check tool, or at least share thoughts for others to follow:
    I designed it to either help or do nothing, so shouldn’t be an issue, but... It could burn CPU for
    scanning, memory for metadata, and disk I/O throughput for caching, and not providing a net
    In summary,
    a practical L2ARC health tool could:
    win, especially if someone set the record size to very small. Plus there could be outright bugs
    by new:
    There was
    that ARCto
    bug
    I talkedfor
    about
    at the lastresource
    KR.
    1. Use
    kernel
    counters
    check
    possible
    contention
    - Experimental
    is easiest: It’sthresholds,
    a cache, so turn
    it off!
    Are things
    now faster
    or slower?issue”.
    versus handpicked
    and
    report
    “good”
    or “maybe
    - Accurate observability is hard: Measure CPU burn (profiling or eBPF tracing), disk I/O, and
    2. If
    maybe,
    prompt
    an invasive
    that from
    disables
    impact
    of L2ARC
    kernelfor
    metadata
    preventingtest
    app WSS
    caching,the
    but L2ARC
    measuringwhile
    WSS is
    monitoring
    systemic
    throughput.
    Report “good” or “bad” and quantify.
    hard, and my website
    is overdue
    an update www.brendangregg.com/wss.html
    - Rough observability: From kernel counters: Is the L2ARC in use? Is the recsize 
slide 27:
    Tuesday (cont.): Health Tool Points
    An ugly half-good tool is better than nothing
    Sharing thoughts can let others write it (Documentation/*/health.txt)
    Reporting "maybe" is ok
    Not an C64 diagnostics cart: Has to analyze exsiting workloads
    Test hierarchy: safe ->gt; violent, only progress if needed, can prompt
    Be pragmatic: eBPF, perf, Ftrace, /proc, use anything
    Current tools: "Here's data, you figure it out"
    Health tools: "I figured it out"
    Kernel Recipes 2023
    Fast by Friday: Why Kernel Superpowers are Essential
    
slide 28:
    Tuesday (cont.): End-of-day Status
    If still unsolved, we now know:
    - It’s a real issue, of this magnitude, affecting these systems
    - It’s not just config
    - It’s not just load
    - It’s not a recent issue
    - It’s caused by these components
    Kernel Recipes 2023
    Fast by Friday: Why Kernel Superpowers are Essential
    
slide 29:
    Wednesday: Profiling
    1. CPU Flame Graphs
    More efficient with eBPF
    eBPF runtime stack walkers
    2. CPI Flame Graphs
    Needs PMCs PEBS on Intel for accuracy
    3. Off-CPU Flame Graphs
    CPU flame graph
    Impractical without eBPF
    Solves most performance issues
    Needs preparation!
    Current industry status: 3 out of 5
    Off-CPU/waker time flame graph
    Kernel Recipes 2023
    Fast by Friday: Why Kernel Superpowers are Essential
    
slide 30:
    Wednesday (cont.): End-of-day Status
    If still unsolved, we now know:
    - It’s a real issue, of this magnitude, affecting these systems
    - It’s not just config
    - It’s not just load
    - It’s not a recent issue
    - It’s caused by these components
    - It’s caused by these codepaths
    Kernel Recipes 2023
    Fast by Friday: Why Kernel Superpowers are Essential
    
slide 31:
    Thursday: Latency, logs, critical path, HW
    1. Latency drilldowns
    Latency histograms
    Latency heat maps
    Latency outliers
    2. Logs, event tracing
    Latency heat maps
    Source: https://www.brendangregg.com/HeatMaps/latency.html
    Custom event logs
    3. Critical path analysis
    Multi-threaded tracing
    Distributed tracing across a distributed
    environment
    4. Hardware counters
    Current industry status: 3 out of 5
    Distributed tracing
    Source: https://www.brendangregg.com/Slides/Monitorama2015_NetflixInstanceAnalysis
    Kernel Recipes 2023
    Fast by Friday: Why Kernel Superpowers are Essential
    
slide 32:
    Thursday: Latency, logs, critical path, HW
    1. Latency drilldowns
    Latency histograms
    Latency heat maps
    Latency outliers
    eBPF Tools
    *dist
    *slower
    2. Logs, event tracing
    Custom event logs
    Latency heat maps
    Source: https://www.brendangregg.com/HeatMaps/latency.html
    *snoop, bpftrace
    3. Critical path analysis
    Multi-threaded tracing
    Distributed tracing across a distributed
    "Zero instrumentation"
    environment
    4. Hardware counters
    Current industry status: 3 out of 5
    Kernel Recipes 2023
    (when faster uprobes is done;
    currently: https://dont-ship.it)
    perf & its
    subcommands
    Fast by Friday: Why Kernel Superpowers are Essential
    Distributed tracing
    Source: https://www.brendangregg.com/Slides/Monitorama2015_NetflixInstanceAnalysis
    
slide 33:
    Thursday (cont.): End-of-day Status
    If still unsolved, we now know:
    - It’s a real issue, of this magnitude, affecting these systems
    - It’s not just config
    - It’s not just load
    - It’s not a recent issue
    - It’s caused by these components
    - It’s caused by these codepaths
    - Latency has this distribution, over time, and these outliers
    - Latency is coming from this specific component
    - It's not a low-level hardware issue
    Kernel Recipes 2023
    Fast by Friday: Why Kernel Superpowers are Essential
    
slide 34:
    Friday: Efficiency, algorithms
    1. Is the target efficient?
    A largely unsolved problem
    Cycles/carbon per request
    Compare with similar products
    New efficiency tools (eBPF?)
    System efficiency equals the
    least efficient component
    Modeling, theory
    Protocol
    CIFS
    iSCSI
    FTP
    NFSv3
    NFSv4
    Cycles(k)
    per 1k read
    Example efficiency comparisons (made up)
    2. Use faster algorithms?
    Big O Notation
    Current industry status: 1 out of 5
    Kernel Recipes 2023
    Source: Systems Performance 2nd Edition, page 175
    Fast by Friday: Why Kernel Superpowers are Essential
    
slide 35:
    Friday (cont.): End-of-day Status
    If still unsolved, we now know:
    - It’s a real issue, of this magnitude, affecting these systems
    - It’s not just config
    - It’s not just load
    - It’s not a recent issue
    - It’s caused by this component
    - It’s caused by these codepaths
    - Latency has this distribution, over time, and these outliers
    - Latency is coming from this specific component
    - It's not a low-level hardware issue
    - The code is efficient already. There is no “problem”!
    Kernel Recipes 2023
    Fast by Friday: Why Kernel Superpowers are Essential
    
slide 36:
    Post weeks: Case study, retrospective
    1. Document as a case study
    JIRA, wiki, gist
    External blog/talk
    Including (redacted) flame graphs is great: You may
    find overlooked perf issues years later from them.
    Repetition?
    Add to Tuesday's "Recent issue checklist"
    2. Retrospective
    How to debug it faster by friday?
    Current industry status: 1 out of 5
    Kernel Recipes 2023
    Fast by Friday: Why Kernel Superpowers are Essential
    Example blog post: https://www.brendangregg.com/blog
    
slide 37:
    "Fast by Friday": My current industry ratings (5 == best)
    Prior weeks:
    Preparation
    Monday:
    Tuesday:
    Wednesday:
    Thursday:
    Friday:
    Quantify, static tuning, load
    Checklists, elimination
    Profiling
    Latency, logs, critical path
    Efficiency, algorithms
    Post weeks:
    Case study, retrospective
    We are not currently good at this
    Kernel Recipes 2023
    Fast by Friday: Why Kernel Superpowers are Essential
    
slide 38:
    "Fast by Friday": Linux Kernel Superpowers
    Prior weeks:
    Preparation
    Monday:
    Tuesday:
    Wednesday:
    Thursday:
    Friday:
    Quantify, static tuning, load
    Checklists, elimination
    Profiling
    Latency, logs, critical path
    Efficiency, algorithms
    Post weeks:
    Case study, retrospective
    Kernel Recipes 2023
    Fast by Friday: Why Kernel Superpowers are Essential
    eBPF
    perf
    Ftrace
    
slide 39:
    What Needs to Change
    Kernel Recipes 2023
    Fast by Friday: Why Kernel Superpowers are Essential
    
slide 40:
    A way of thinking, a call for action
    Consider perf wins that took weeks as room for improvement
    New tracing tools needed: *diagnose, *health
    Crisis tools should be installed by default in enterprise distros
    Stack walking should work by default for everything
    Kernel Recipes 2023
    Fast by Friday: Why Kernel Superpowers are Essential
    
slide 41:
    Stack walking, frame pointers, and eBPF walking
    Frame pointers already enabled at major companies.
    Fedora first distro to offer it?
    Can't we be smarter if needed?
    NOP/__fentry__ style rewrites (Rostedt)? Options with LD/ELF.
    Reasons FPs were
    disabled in 2004:
    - i386
    - gdb doesn't
    need them
    - gcc vs icc
    eBPF custom runtime
    stack walkers (Java, etc.)
    Yes, multiple people are
    doing this. They should ship
    as open source with the
    runtime code.
    https://gcc.gnu.org/legacy-ml/gcc-patches/2004-08/msg01033.html
    Kernel Recipes 2023
    Fast by Friday: Why Kernel Superpowers are Essential
    
slide 42:
    Summary
    Kernel Recipes 2023
    Fast by Friday: Why Kernel Superpowers are Essential
    
slide 43:
    "Fast by Friday" Summary
    Prior weeks: Preparation
    Day 1:
    Day 2:
    Day 3:
    Day 4:
    Day 5:
    Quantify, static tuning, load
    Checklists, elimination
    Profiling
    Latency, logs, critical path
    Efficiency, algorithms
    Fast by Friday:
    Any computer performance
    issue reported on Monday
    should be solved by Friday
    (or sooner)
    Post weeks: Case study, retrospective
    Kernel Recipes 2023
    Fast by Friday: Why Kernel Superpowers are Essential
    
slide 44:
    "Fixed by Friday" (a different talk) sample
    Performance Mantras:
    Don't do it
    Do it, but don't do it again
    Do it less
    Do it later
    Do it when they're not looking
    Do it concurrently
    Do it cheaper
    Fixed by Friday:
    Any known performance bug
    reported on Monday
    should have a fix by Friday
    (or sooner)
    AFAIK these mantras are from Craig Hanson and Pat Crain (I'm still looking for a reference)
    Kernel Recipes 2023
    Fast by Friday: Why Kernel Superpowers are Essential
    
slide 45:
    Take Aways
    "Fast by Friday": Any computer performance issue reported on
    Monday should be solved by Friday (or sooner)
    Kernel superpowers, especially eBPF, are essential for such
    fast in-situ production analysis
    It will take all of us many years: OS changes, kernel support,
    new tools, methodologies. How can you help? One step at a time!
    Kernel Recipes 2023
    Fast by Friday: Why Kernel Superpowers are Essential
    
slide 46:
    Q&A
    Kernel Recipes 2023
    Fast by Friday: Why Kernel Superpowers are Essential
    
slide 47:
    Thanks
    Jesper Dangaard Brouer
    eBPF: Alexei Starovoitov (Meta), Daniel Borkmann (Isovalent), David S. Miller (Red Hat),
    Jakub Kicinski (Meta), Yonghong Song (Meta), Andrii Nakryiko (Meta), Thomas Graf
    (Isovalent), Martin KaFai Lau (Meta), John Fastabend (Isovalent), Quentin Monnet
    (Isovalent), Jesper Dangaard Brouer (Red Hat), Andrey Ignatov (Meta), Stanislav
    Fomichev (Google), Joe Stringer (Isolavent), KP Singh (Google), Dave Thaler
    (Microsoft), Liz Rice (Isovalent), Chris Wright (Red Hat), Linus Torvalds, and many
    more in the BPF community
    Ftrace: Steven Rostedt (Google) and the Ftrace community
    Perf: Arnaldo Carvalho de Melo (Red Hat) and the
    perf community
    Kernel Recipes 10th edition!
    Kernel Recipes 2023
    Fast by Friday: Why Kernel Superpowers are Essential