Brendan Gregg's Homepage

G'Day. I use this site to share and bookmark various things, mostly my work with computers. While I currently work on large scale cloud computing performance at Intel (previously Netflix), this site reflects my own opinions and work from over the years. I have a personal blog, and I'm also on Mastodon and Twitter. Here is my bio and anti-bio.

This page lists everything: Documentation, Videos, Software, Misc. For a short selection of most popular content, see my Overview page.

Documentation

Documents I've written, in approximately reverse chronological order:

A page to summarize my Linux Performance related material.
The post The Return of the Frame Pointers to explain how we're finally fixing stack walking and thus profiling in major Linux distros (Fedora, Ubuntu) after 20 years, with a summary of current and future stack walkers. (2024)
The documentary eBPF: Unlocking the Kernel has been released, with interviews from the key players from the origin of the technology in 2014 (blog, youtube). (2024)
An early look at a new performance analysis methodology, "Fast by Friday," presented as an eBPF Summit keynote and then updated for Kernel Recipes (KR slides, PDF, youtube). (2023)
A post on eBPF Observability Tools Are Not Security Tools to help the new wave of eBPF-based security products not shoot themselves in the foot. (2023)
Slides for my YOW! talk on Visualizing Performance: The Developers' Guide to Flame Graphs, summarizing their state in 2022: Over 80 implementations, over 400 related projects, etc. The slides include many implementation screenshots as a visual tour. (slides, PDF, youtube). (2022)
My SREcon APAC 2022 keynote on Computing Performance: What's on the Horizon, covering processors, memory, disks, networking, hypervisors, and more. To keynote the first USENIX conference in my home country Australia was an honour, and a highlight of my career. (slides, PDF, youtube) (2022).
Slides for my YOW! talk on Computing Performance 2021: What's on the Horizon (slides, PDF).
I made an appearance at IntelON to talk about Processor Benchmarking for the Netflix cloud (slides, PDF, youtube). (2021)
The story of ZFS Is Mysteriously Eating My CPU including a hard-to-believe flame graph.
Rough notes for Analyzing a High Rate of Paging, using a mix of traditional, Ftrace, and BPF performance tools.
Rough notes for Slack's Secret STDERR Messages, where I used eBPF and other tools to debug Slack's mysterious crashes (2021).
Slides for my eBPF Summit keynote Performance Wins with eBPF: Getting Started (slides, PDF, youtube) (2021).
Slides for my Facebook Systems@Scale talk Performance Wins with BPF: Getting Started (slides, PDF, video) (2021).
My plenary talk on Computing Performance: On the Horizon for the USENIX LISA 2021 conference, covering the present and future of processors, memory, disks, networking, hypervisors, and more. (blog, slides, PDF, youtube) (2021).
A post on How To Add eBPF Observability To Your Product aimed at those adding it to commercial or internal observability platforms (2021).
A 122-slide talk on BPF Internals (eBPF) for the USENIX LISA 2021 conference, where I show all the steps from the high-level bpftrace language to machine code (blog, slides, PDF, youtube) (2021).
The story of the most surprising software demo I've been given: An Unbelievable Demo.
As it has become an FAQ, a post on What is Observability, a term myself and other performance engineers have used since before it was popular (2021).
A post on BPF binaries: BTF, CO-RE, and the future of eBPF perf tools (2020).
Slides for my eBPF summit keynote Performance Wins with BPF: Getting Started, where I showed how to get started the easy way, as I kept seeing people start the hard way waste their own time (slides, PDF, youtube) (2020).
Slides for my YOW2020 talk Linux Systems Performance, delivered online (slides, PDF, youtube) (2020).
My latest book website: Systems Performance: Enterprise and the Cloud, Second Edition and blog post about it. The publisher is Addison-Wesley, and the draft book is over 800 pages. This updates the Linux and cloud content, and adds chapters on perf, Ftrace, and BPF (2020).
In my post BPF Theremin, Tetris, and Typewriters I share the source for the BPF theremin I've demoed during conference keynotes, which turns WiFi signal strength into sound. (2019)
My book website: BPF Performance Tools: Linux System and Application Observability and blog post about it. The publisher is Addison-Wesley, and the book is 880 pages, including over 100 new BPF tools to analyze all the things (2019).
Slides for my AWS re:Invent 2019 talk BPF Performance Analysis at Netflix (slides, youtube).
Slides for my Ubuntu Masters 2019 keynote Extended BPF: A New Type of Software, where I first discussed how BPF is a new type of software and a fundamental change to the 50-year old kernel model, and how we observe this new software that's already in production at major companies (slides, PDF, youtube).
An article for opensource.com: An introduction to bpftrace for Linux, summarizing the syntax and including a one-liners tutorial (PDF) (2019).
Slides for my USENIX LISA 2019 talk on Linux Systems Performance, covering observability, methodologies, benchmarking, profiling, tracing, and tuning. It's intended for everyone as a tour of fundamentals, and some companies have indicated they will use it for new hire training. (slides, PDF, youtube).
Slides for my LSFMM 2019 keynote on BPF Observability, which was also summarized by lwn.net (slides, PDF).
Slides for my SCaLE17x talk eBPF Perf Tools, although I spent half the talk demonstrating how to instrument Minecraft using eBPF live, which is not captured by the slides (slides, PDF, youtube) (2019).
Slides for my YOW2018 keynote Cloud Performance Root Cause Analysis at Netflix which I delivered in Sydney, Brisbane, and Melbourne (slides, PDF, youtube).
Slides for my YOW2018 CTO summit keynote Working at Netflix (slides, PDF, youtube).
Slides for my All Things Open (ATO) talk Linux Performance 2018 (it was not videoed) (slides, PDF) (2018).
I posted FlameScope Pattern Recognition, showing how to interpret the subsecond-offset heatmap view of profile data, and also spoke about this with Martin Spier at the LinkedIn performance engineering meetup (slides, PDF, youtube) (2018).
Linux gets a high-level eBPF front-end: bpftrace (DTrace 2.0) for Linux 2018. The repository has tools and docs I developed, including a bpftrace one-liners tutorial, a bpftrace reference guide, and an internals development guide (2018).
To help everyone make fewer mistakes when benchmarking, I published Evaluating the Evaluation: A Benchmarking Checklist (2018).
At NetConf 2018 I gave a talk/discussion on BPF Observability, attended by top contributors of the Linux TCP/IP stack (slides, PDF).
Slides from my PerconaLive 2018 keynote on summarizing Linux Performance 2018 (slides, PDF, youtube).
For the Netflix TechBlog I posted Netflix FlameScope (PDF), showing the new open source performance analysis tool for visualizing and exploring profiles with flame graphs, that Martin Spier and I have been working on (2018).
A post on KPTI/KAISER Meltdown Initial Performance Regressions, showing the estimated performance losses on Linux for simulated workloads (2018).
A page on Working Set Size Estimation, covering what I know and have developed on the topic. It includes new tools that use Linux referenced and idle page flags for page-based WSS estimation (2018-).
A post on AWS EC2 Virtualization 2017, explaining how virtualization performance has been improving over the years, and details of the new AWS Nitro hypervisor (2017).
Slides for my AWS re:Invent 2017 talk on How Netflix Tunes EC2 Instances for Performance, describing instance selection, EC2 features, kernel tuning, and observability (slides, youtube, PDF).
A post on 7 tools for analyzing performance in Linux with bcc/BPF for opensource.com, which is also my first post on eBPF for the Fedora/Red Hat family of operating systems (PDF) (2017).
A post on Brilliant Jerks in Engineering which describes two fictional jerks, one selfless and one selfish, to explore their behaviors, the damage caused, and the use of a "no brilliant jerks" policy (2017).
Slides for my USENIX LISA 2017 talk Linux Container Performance Analysis (slides, youtube, PDF).
Slides for my Kernel Recipes 2017 talk on Using Linux perf at Netflix, focusing on fixing CPU profiling (slides, youtube, PDF).
For EuroBSDcon 2017, I gave the closing keynote on System Performance Analysis Methodologies (BSD), and I used FreeBSD for many examples. In this talk I introduced a FreeBSD static perf tools diagram, the BSD perf analysis in 60 seconds checklist, and the tstate.d tool for thread state analysis (slides, youtube, PDF).
Slides for my OSSNA 2017 talk Performance Analysis Superpowers with Linux BPF. This talk was not videoed. (slides, PDF).
A post on Solaris to Linux Migration 2017. Solaris was an OS that was once in widespread use, but development now seems to have ceased. I've been asked about Solaris to Linux migrations, but I've never seen a resource that covers them both in depth (maybe because no one else has the expertise or inclination to write it), so this is my summary.
A post on Linux Load Averages: Solving the Mystery, where I dug up the patch that added the uninterruptible sleep state to Linux load averages, and measured code paths in that state as an off-CPU flame graph (2017).
My slides for PMCs on the Cloud at the SBSRE meetup, with the Netflix perf team (slides, youtube, meetup, PDF) (2017).
Slides for my 2017 USENIX ATC Performance Analysis Superpowers with Linux eBPF, updating my earlier talk (slides, youtube, PDF).
Slides for my 2017 USENIX ATC talk Visualizing Performance with Flame Graphs, with updates and challenges (slides, youtube, PDF).
Slides for my Velocity talk Performance Analysis Superpowers with Linux eBPF (slides, youtube, PDF) (2017).
A post on Working at Netflix 2017, following on from my earlier posts. I expanded on mundane topics like how many meetings I have each week, as I was getting asked this recently. (2017).
CPU Utilization is Wrong: a post explaining the growing problem of memory stall cycles dominating %CPU (2017).
In my post The PMCs of EC2: Measuring IPC, I demonstrate measuring Performance Monitoring Counters (PMCs) for the first time in the AWS EC2 cloud, something I'd been working on with Amazon and others for a while. PMCs are crucial for understanding CPU behavior, including memory I/O, stall cycles, and cache misses. (2017).
A post about my DockerCon 2017 talk on Container Performance Analysis, where I showed how to find bottlenecks in the host vs the container, how to profiler container apps, and dig deeper into the kernel (blog, slides, PDF, youtube). (2017).
Slides for my SCaLE15x talk on Linux 4.x Tracing: Performance Analysis with bcc/BPF, where I also included a ply demo for the first time (slides, youtube, PDF) (2017).
At the IO Visor (iovisor) Summit 2017 I led a discussion on BPF Tools for performance and observability, covering strategy, current tools, and future challenges (slides, PDF).
Slides for BSidesSF 2017 with Alex Maestretti on Linux Monitoring at Scale with eBPF (slides, youtube, PDF).
I posted Where has my disk space gone? Flame graphs for file systems and Flame Graphs vs Tree Maps vs Sunburst (2017).
Slides for my Linux.conf.au 2017 talk on BPF: Tracing and More, summarizing other uses for enhanced BPF. (slides, youtube, PDF).
A post on Golang bcc/eBPF Function Tracing, where I figured out how to trace functions and arguments for different Go compilers (2017).
A page to summarize eBPF Tools using Linux eBPF and the bcc front end for advanced observability and tracing tools (2016+).
Slides for my USENIX LISA 2016 talk Linux 4.x Tracing Tools: Using BPF Superpowers, which focused on bcc tools I've developed. This included my demo Give me 15 minutes and I'll change your view of Linux tracing. (slides, PDF, youtube demo, youtube talk).
DTrace for Linux 2016, announcing that the Linux kernel now has similar raw capabilities as DTrace in Linux 4.9 via enhanced BPF. I've been heavily involved in this project, especially as the number one user, and it was great to reach this milestone.
Slides for a talk at the first sysdig conference on Designing Tracing Tools (slides, youtube, PDF) (2016).
I wrote the original bcc/BPF end user tutorial, Python developer tutorial, and reference guide (2016).
Several posts introducing new Linux bcc/BPF tracing tools: bcc/BPF Tracing Security Capabilities, bcc/BPF MySQL Slow Query Tracing, bcc/BPF ext4 Latency Tracing, Linux bcc/BPF Run Queue (Scheduler) Latency, Linux bcc/BPF Node.js USDT Tracing, Linux bcc tcptop, Linux 4.9's Efficient BPF-based Profiler, Linux bcc/BPF tcplife: TCP Lifespans (2016).
My JavaOne 2016 slides for Java Performance Analysis on Linux with Flame Graphs (slides only: slides, PDF), and a follow-up post on Java Warmup Analysis with Flame Graphs.
A post on gdb Debugging Full Example (Tutorial): ncurses where I shared a full debugging session, including all output and dead ends. It includes a little ftrace and BPF (2016).
My keynote slides for ACM Applicative 2016 on System Methodology: Holistic Performance Analysis on Modern Systems, where I used several different operating systems as examples (slides, PDF, youtube).
A post demonstrating new capabilities by llnode for Node.js Memory Leak Analysis (2016).
A post on Linux Hist Triggers in Linux 4.7 demonstrating this new tracing feature (2016).
For PerconaLive 2016, slides for my Linux Systems Performance 2016 overview talk (slides, PDF, video).
The Flame Graph article for ACMQ/CACM that defines flame graphs, describes their origin, explains how to interpret them, and discusses possible future developments (2016).
For SREcon 2016 Santa Clara, slides for my Performance Checklists for SREs talk, which was also my first talk about my recent SRE (Site Reliability Engineering) work at Netflix (blog, slides, PDF, youtube, usenix).
A post on Working at Netflix 2016, as a follow-on from my 2015 post. This is still worth talking about: freedom and responsibility, outstanding and professional coworkers, etc. It differs from many other Silicon Valley companies.
For Facebook's Performance @Scale conference, slides for my Linux BPF Superpowers talk, which introduced the tracing capabilities of this new feature in the Linux 4.x series (slides, PDF, video) (2016).
More Linux BPF/bcc posts showing how to analyze off-CPU time and drill deeper: Linux eBPF Off-CPU Flame Graph, Linux Wakeup and Off-Wake Profiling, Who is waking the waker? (Linux chain graph prototype) (2016).
A post on Unikernel Profiling: Flame Graphs from dom0, where I showed that observability was indeed possible with some engineering work, as some were believing otherwise (2016).
Slides for my Broken Linux Performance Tools SCaLE14x talk, similar to my earlier QCon talk but for Linux only, and including a bit more advice (slides, PDF, video) (2016).
Linux Performance Analysis in 60,000 Milliseconds shows the first ten commands one can use (video, PDF). Written by myself and the performance engineering team at Netflix (2015).
Slides for my QConSF 2015 talk Broken Performance Tools highlighting common pitfalls with system metrics, tools, and methodologies for generic Linux/Unix systems (slides, PDF, video) (2015).
At JavaOne 2015, I gave a talk on Java Mixed-Mode Flame Graphs, utilizing the new -XX:+PreserveFramePointer feature in JDK8u60 (blog, slides, youtube, PDF) (2015).
Using eBPF via bcc, tcpconnect and tcpaccept (bcc), and eBPF Stack Trace Hack (bcc), which show some new Linux tracing performance tools I've developed that use eBPF via the bcc frontend (2015-6).
For the Netflix Tech Blog I posted Java in Flames (PDF) with Martin Spier, which shows mixed-mode flame graphs using the new -XX:+PreserveFramePointer JDK option. Great to see all CPU consumers in one visualization (2015).
A summary, and recommendations, for navigating the different Linux tracers: Choosing a Linux Tracer (2015).
Some posts on uprobes: Linux uprobe: User-Level Dynamic Tracing, which demonstrates uprobes via my uprobe tool, and Hacking Linux USDT with Ftrace (2015).
My Netflix Instance Performance Analysis Requirements talk for Monitorama 2015, where I showed desirable and undesirable features of these products. This is intended for the numerous vendors who keep trying to sell me these products, and, for customers who can use this talk as a source of feature requests. (slides, PDF, vimeo)
My first post on Linux eBPF, which is bringing in-kernel maps to Linux tracing (2015).
My slides for Linux Performance Tools tutorial at Velocity 2015, which was an expanded version of my earlier talk on the same topic. This is the most detailed version I've done (slides, PDF, youtube).
Slides from an internal Netflix presentation that was published on RxNetty vs Tomcat Performance Results, where I had used many tools, including tracing and PMCs, to explain the performance difference between these frameworks (slides, PDF) (2015).
My Linux Profiling at Netflix talk for SCALE 13x (2015), where I covered getting CPU profiling to work, including for Java and Node.js, and a tour of other perf_events features (slides, PDF, youtube).
A post about Working at Netflix, describing the culture. This is worth writing about, as Netflix is pioneering with company culture as well as technology, and showing that culture can be engineered to be positive (2015).
For LISA2014, my Linux Performance Analysis: New Tools and Old Secrets talk, where I covered ftrace and perf_events tools I've recently developed (slides, PDF, youtube, USENIX).
My talk on Performance Tuning Linux Instances on EC2 from AWS re:Invent 2014, where I covered how Netflix selects, tunes, and then observes the performance of Linux cloud instances (slides, youtube).
A post introducing Differential Flame Graphs, which can be used for performance regression analysis. I also wrote about CPI flame graphs, which uses the differential flame graph code, and pmcstat on FreeBSD (2014).
My Flame Graphs on FreeBSD talk, for the FreeBSD Dev and Vendor Summit 2014. I summarized the different types, with the FreeBSD commands to create them (slides, PDF, youtube).
For my first BSD conference, MeetBSDCA 2014, my Performance Analysis for BSD talk, where I discussed 5 facets: observability, methodologies, benchmarking, profiling, and tuning (slides, PDF, youtube).
Contributed documentation on the DTrace on FreeBSD wiki page: the initial one-liners list (PDF) and a 12 part tutorial (PDF).
My Linux Performance Tools talk for LinuxCon Europe 2014, where I summarized observability, benchmarking, tuning, static perf tuning tools, and tracing. This was an updated version of my earlier LinuxCon talk on the same topic (slides, PDF, youtube).
For the 2014 Tracing Summit, my From DTrace to Linux talk, summarizing what Linux can learn from DTrace (slides, PDF, youtube).
My Surge 2014 talk From Clouds to Roots, on how Netflix does root cause performance analysis on a Linux cloud. It's my most comprehensive performance analysis talk to date. Instead of just focusing on low-level tools, I provided context and then showed the full path from clouds to roots. (slides, PDF, youtube).
My post The MSRs of EC2 was the first to show that CPU Model Specific Registers were available in the cloud and could be used to measure interesting details such as the real CPU clock rate and temperature of instances (Xen guests). Weeks after my post (and possibly inspired by it) a security researcher found an MSR vulnerability in EC2 that required a cloud-wide reboot (2014).
Slides from my LinuxCon North America 2014 talk Linux Performance Tools, which summarizes performance observability, benchmarking, and tuning tools, and illustrates their role on Linux system functional diagrams (slides,PDF).
Posts summarizing Linux Java CPU Flame Graphs and Node.js CPU Flame Graphs (2014).
My lwn.net article Ftrace: The Hidden Light Switch (2014).
Posts describing perf-tools based on Linux perf_events and ftrace (both core Linux kernel tracers): perf Hacktogram, iosnoop, iosnoop Latency Heat Maps, opensnoop, execsnoop, tcpretrans (2014).
Posts about Linux perf_events: perf CPU Sampling, perf Static Tracepoints, perf Heat Maps, perf Counting, perf Kernel Line Tracing (2014).
A page for perf Examples with perf_events, the standard Linux profiler. Page includes one-liners and flame graphs.
I wrote a warning post titled strace Wow Much Syscall, which discusses strace(1) for production use, includes an interesting example, and many bad strace-related jokes (2014).
Two posts to explain Xen modes on AWS EC2: What Color is Your Xen and Xen Feature Detection (2014).
The Benchmark Paradox: a short post explaining a seeming paradox in benchmark evaluations (2014).
My Analyzing OS X Systems Performance with the USE Method talk at MacIT 2014 (slides, PDF).
At SCaLE12x (2014) I gave the keynote on What Linux can learn from Solaris perf. and vice-versa (slides,PDF, youtube).
The Case of the Clumsy Kernel (PDF): a kernel performance analysis article for USENIX ;login (2013).
My USENIX/LISA 2013 slides Blazing Performance with Flame Graphs, was two talks in one: part 1 covered the commonly used CPU flame graphs, and part 2 covered various advanced flame graphs (slides, PDF, youtube).
A page of ktap Examples for the lua-based Linux dynamic tracing tool, including one liners and tools (no longer maintained) (2013).
The TSA Method, a performance analysis methodology for identifying issues causing poor application performance. This is a thread-oriented methodology, and is complementary to the resource-oriented USE Method. It has solved countless issues.
A page of my Performance Analysis Methodology summaries, and links.
Systems Performance: Enterprise and the Cloud, Prentice Hall, 2013 (ISBN 0133390098). This book covers new developments in systems performance: in particular, dynamic tracing and cloud computing. It also introduces many new methodologies to help a wide audience get started. It leads with Linux examples from Ubuntu, Fedora, and CentOS, and also covers illumos distributions. Covering two different kernels provides additional perspective that enhances the reader's understanding of each. The book is 635 pages plus appendices.
My slides for a brief talk on The New Systems Performance, where I summarized how the topic has changed from the 1990's to today (July 2013, slides, PDF, youtube).
Active Benchmarking: a methodology for successful benchmarking, and an example of its use for Bonnie++.
My OSCON 2013 slides for Open Source Systems Performance, where I provided a unique perspective I'm best positioned to give about both open- and close-sourcing software, and what this means for systems performance analysis (slides, PDF, youtube).
Visualizing distributions using Frequency Trails, explained in the Introduction, then using them for Detecting Outliers, measuring Modes and Modality, and What the Mean Really Means.
My slides for Stop the Guessing: Performance Methodologies for Production Systems talk at Velocity 2013 (slides, PDF, youtube).
The very popular slide deck for my Linux Performance Analysis and Tools talk at SCaLE11x (2013), which includes lesser known tools such as perf's dynamic tracing and static trace points. I've been told people want slide 16 on a coffee cup! (slides, PDF, youtube).
A summary of Virtualization Performance: Zones, KVM, Xen, focusing on I/O path overheads (2013).
The Thinking Methodically about Performance article for ACMQ (2012), and CACM, based on my earlier USE Method articles.
USENIX/LISA 2012 slides on Performance Analysis Methodology, summarizing ten methods and anti-methods (slides , PDF, youtube).
My FISL'13 slides on The USE Method for systems performance analysis, including some other methods for comparison (slides, youtube, blog) (2012).
For illumosday and zfsday, my slides for DTracing the Cloud (slides, PDF, youtube) and ZFS Performance Analysis and Tools (slides, PDF, youtube).
My SCaLE10x talk slides Performance Analysis: new tools and concepts from the cloud, with examples (slides, youtube, PDF) (2012).
The introduction of a new visualization type: Subsecond Offset Heat Maps, which allow behavior within a second to be seen.
The USE Method, which I developed for identifying common system bottlenecks and errors, and have used successfully for many years in enterprise and cloud performance environments. Based on the USE method: the Linux Performance Checklist, the Solaris Performance Checklist, the SmartOS Performance Checklist, the Mac OS X Performance Checklist, the FreeBSD Performance Checklist, and the Unix 7th Edition Performance Checklist. There is also the USE Method Rosetta Stone of Performance Checklists.
The Flame Graph visualization and separate pages on using them for CPU Flame Graphs including how to fix stack traces and symbols for Java and Node.js; different techniques for Memory Flame Graphs including allocator tracing and page fault flame graphs; and different techniques for Off-CPU Flame Graphs, including block I/O flame graphs, wakeup flame graphs, off-wake flame graphs, and chain graphs.
Colony Graphs, a visualization of computer life forms, and their use for Visualizing the Cloud, Process Snapshots and Process Execution.
Demonstrations of different visualizations for Device Utilization, which was described as blog post of the year (2011).
Narrow topics in operating system performance: Activity of the ZFS ARC.
A long post about Using SystemTap on the Ubuntu and CentOS Linux distributions, written in late 2011.
An introduction the technique of Off-CPU Performance Analysis, which can identify the cause of high latency due to blocking events.
Top 10 DTrace Scripts for Mac OS X performance analysis and troubleshooting, written to reach the broader Mac OS X community. This includes step by step instructions on how to find and run the Terminal application and sudo.
A series of blog posts on File System latency, using MySQL as an example application (1, 2, 3, 4, 5) (2011).
MySQL Query Latency using DTrace (2011).
A series of blog posts on the DTrace pid provider, going beyond what was covered in the DTrace book (2011).
The DTrace book with Jim Mauro (Prentice Hall, 2011; ISBN 0132091518). A sample chapter on File Systems is online. This 1152 page book took over a year to write, including the research, development and testing of dozens of new DTrace scripts and one-liners, and soliciting input from many experts. Solaris was used as the primary OS for examples, with additional examples from Mac OS X and FreeBSD. The most difficult challenge for using a dynamic tracing tool (DTrace, SystemTap, etc.) is knowing what to do with it. This book provides over one hundred use cases (scripts), which will be invaluable even after the example code becomes out of date.
A page on Heat Maps, and a demonstration of Latency Heat Maps which includes example software to generate them.
Slides for my Percona Live New York 2011 talk on Breaking Down MySQL/Percona Query Latency With DTrace (PDF).
The Visualizations for Performance Analysis slide deck, USENIX/LISA 2010. This describes two different approaches (methodologies) for systems performance: workload analysis and latency analysis, the metrics used, and then introduces a variety of heat map visualizations. This talk ends by describing the challenges of cloud computing, and how heat maps are well suited for the scale of data (slides,PDF, youtube).
Slides by Jim Mauro and myself for our How to Build Better Applications With Oracle Solaris DTrace (PDF) talk at Oracle OpenWorld 2010.
An article for ACMQ, also published by CACM, on Visualizing System latency using latency heat maps (2010). This includes interesting latency heat maps I had found using Sun Storage Analytics, including the Rainbow Pterodactyl and the Icy Lake. This was pioneering work, and it took several years for latency heat maps to appear in other performance analysis products.
My DTrace Cheatsheet, summarizing probes, variables, and actions. Inspired by the mdb cheatsheet by Jonathan Adams. (2009).
The post 7410 hardware update, and analyzing the HyperTransport, where I used PICs (aka PMCs) to analyze the CPU interconnect and other busses, and find a significant performance win (2009).
A post about the ZFS separate intent log: SLOG Screenshots, including a discussion on the technology and latency heat maps from Analytics showing the improvements (2009).
A series of posts on performance testing a line of storage appliances (1, 2, 3). I wrote these in 2009, when I was often saving benchmarking mishaps. They were very successful (and thanks to those who read them) as the calls for help were greatly reduced.
For the Front Range OpenSolaris User Group (FROSUG), in Denver, Colorado, my slides for my Little Shop of Performance Horrors talk, where I discussed things going wrong instead of right. It was a lot of fun, and people showed up despite a massive snow storm (slides, PDF, youtube) (2009).
Posts showing new (and interesting) performance visualizations using Sun Storage Analytics, especially latency heat maps: Latency Art: Rainbow Pterodactyl, Heat Map Analytics, Latency Art: X marks the spot (2009).
Visualizing DRAM Latency using latency heat maps from Sun Storage Analytics (2009).
Demonstrating the Kstat/DTrace-based Analytics tool from the Sun Storage 7000 product: JBOD Analytics Example, Networking Analytics Example, NFS Analytics example (2008-9).
Posts demonstrating the performance limits of the Sun Storage 7000 series of ZFS-based storage appliance I was working on, as well as our Analytics observability tool: A quarter million NFS IOPS, Up to 2 Gbytes/sec NFS, 1 Gbyte/sec NFS, streaming from disk, My Sun Storage 7410 perf limits, CIFS at 1 Gbyte/sec, My Sun Storage 7310 perf limits, Hybrid Storage Pool: Top Speeds (2008-9).
The storage appliance dashboard where I used weather icons to highlight performance issues and convey ambiguity for certain metrics (2008).
Slides for Fishworks Analytics (PDF) for CEC2008 with Bryan Cantrill, where we launched a storage appliance performance analysis tool that was many years ahead of the industry. A real-time dynamic tracing GUI, latency heat maps, etc.
Slides for Fishworks Overview (PDF) at CEC2008 with Cindi McGuire, where we introduced the first ZFS-based storage appliance, the Sun Storage 7000 series. Fishworks was the team that developed it. We worked at a private site, setup to mimic a San Francisco startup.
The original ZFS L2ARC post (2008) and later L2ARC Screenshots (2009). Since code changes were public each night, my block comment in usr/src/uts/common/fs/zfs/arc.c (added in Nov 2007) disguised the then-secret intent of this technology by listing "short-stroked disks" as the first intended device, instead of SSDs.
I wrote the original Sun ZFS Storage 7000 admin guide and online help, and while doing so created the most advanced content system within Sun Microsystems: A content wiki that could auto-generate Sun-styled PDFs and other formats, allowing versions to be built in seconds instead of the usual 2-week process. This 2010 version has many updates from other staff (I've yet to find my original 2008 PDF) (2008).
A post on the DTraceToolkit in MacOS X, which included (and updated) 44 of my DTraceToolkit tools in /usr/bin by default (2008).
A post announcing the DTraceToolkit ver 0.99, a major release were I added many language provider tools (2007).
My Solaris Performance: Introduction slides from May 2007, covering Solaris performance features and observability. This includes two of my methodologies for performance analysis: the "By-Layer Strategy" and the "3-Metric Strategy" (back when I spelled utilization with an "s"). The latter strategy is what I later called the USE Method (slides, PDF).
Slides for a Sun Microsystems talk on Virtualization: Zones (OS containers) (slides, PDF) (2007).
Slide decks (PDFs) from DTrace talks in 2007: DTrace Intro, DTraceToolkit, and DTrace Java.
Posts demonstrating the DTrace Bourne shell (sh) provider and iSCSI DTrace Provider (2007).
A post on DTracing Off-CPU Time, where I used DTrace and my DTraceToolkit to analyze gnome-terminal startup latency (2007).
Slides for the OSCON 2007 talk Observability Matters: How DTrace Helped Twitter by Adam Leventhal and myself. At the time, Twitter was a one-year old startup running on Solaris with crippling performance issues, and Sun's top engineering team (including Adam and myself) visited the SF headquarters to help. This findings by the team (slides, PDF) (2007).
A post on AMD64 PICs, CPI, where I used Solaris cpustat and PICs (aka PMCs) to measure cycles per instruction, and summarized tuning strategies for low or high CPI (2007).
A post on Colortrace, where I developed a DTrace tool for flow tracing across different OS stack layers and colorizing the output (2007).
A post on DTracing vim Latency, where I showed how I improved vim startup time by 340x using DTrace and my DTraceToolkit (2006).
The design document for the DTrace Network Providers I developed, and a CEC2006 demos page showing the first live demonstrations of my DTrace TCP provider, at the CEC conference (2006).
My post DTrace meets JavaScript, where I announced the first DTrace provider for JavaScript I had been developing, which was for the Mozilla Spider Monkey engine, and then JavaScript Provider ver 2.0 (2006-7).
The companion book to Solaris Internals 2nd Edition: Solaris Performance and Tools, with Richard McDougall and Jim Mauro (Prentice Hall, 2006; ISBN 0131568191). These chapters began during development of Solaris Internals 2nd Edition, and were later split into a separate companion volume. It worked well: a reference book on internals, and a companion book for practitioners on performance. 444 pages.
The post How much CPU does a down interface chew?, solving a mysterious performance issue on Solaris (2006).
A 233-slide deck for a DTrace workshop that I developed and delivered in London, that summarized Solaris performance and DTrace analysis. As part of this workshop I created performance labs for the students to solve (not included in the slides) (slides, PDF) (2006).
My Solaris 10 Zones page from 2005, where I developed models for configuring Zones with Resource Controls. This was pioneering work for the performance isolation of containers (nowadays the realm of Linux cgroups). I was not a Sun employee at the time. Sun later based their official docs on my work, without attribution (they were not allowed to include my home page URL in the official Sun references).
DTrace vs truss, a page to explain the virtues of DTrace over truss, the Solaris syscall tracer (2005).
Analysis of prstat vs top using DTrace, and showing why these perform differently (2005).
Another DTrace case study: DTracing SMC, the Solaris Management Console, which took 30 minutes to startup (yes, minutes). I found various issues using DTrace, including 12 Million mostly 1 byte sequential read()s. (2005).
A DTrace case study: DTracing Lost CPU, demonstrating DTrace and my new DTrace tools. I published this and others before Solaris 10 was released, at a time when there were no other websites on using DTrace, and so these became important first use cases from a customer (2005).
DTrace where I shared the first publicly available DTrace tools and later my DTraceToolkit. I created this page in 2004 and it was also the first public website on DTrace.
I had two prior professional blogs (see my blog archive for archived posts): blogs.oracle.com/brendan (formally blogs.sun.com/brendan), where I discussed performance, DTrace, and the ZFS storage appliance (2006-2010); and dtrace.org/blogs/brendan, where I continued posting about cloud performance and DTrace (2010-2014). Posts from my prior personal blog at bdgregg.blogspot.com is also in the archive.
My old and unmaintained Unix and Sun Solaris material is labeled as being in my Crypt, and kept online for historical interest only. The Crypt containers a few extra things not listed above. (circa 2003-2005.)

Videos

The documentary eBPF: Unlocking the Kernel with interviews with myself and others on the origin of eBPF in 2014. To see all the familar faces and hear their familiar voices discussing the points we were making back then brings me right back, and it lets you experience it as well (youtube, blog). (30 mins) (2024)
I made an unplanned appearance at GopherConAU to give a lightning talk on Golang Flame Graphs. (5 mins) (2023)
Early talks about a new performance analysis methodology, "Fast by Friday," at Kernel Recipes and the eBPF Summit (KR youtube, slides). (43 mins) (2023)
Video for my YOW! talk on Visualizing Performance: The Developers' Guide to Flame Graphs, summarizing their state in 2022: Over 80 implementations, over 400 related projects, etc. (youtube, slides) (49 mins) (2022)
At LSFMMBPF I led two group discussions in a meeting of eBPF lead developers. These low-level discussions among developers aren't normally videoed, so this is a rare view into the room where it happens. BPF observability tools update (38 mins) and Developing BPF guidelines (25 mins). (2022).
My SREcon APAC 2022 keynote on Computing Performance: What's on the Horizon, covering processors, memory, disks, networking, hypervisors, and more (youtube, slides) (2022).
My video appearance (with pandemic hair) at Intel InnovatiON 2021 about Processor Benchmarking for the Netflix cloud (video, slides). (2021).
Video for my eBPF Summit keynote Performance Wins with eBPF: Getting Started (youtube, slides) (17 mins) (2021).
Video for my Facebook Systems@Scale talk Performance Wins with BPF: Getting Started (video, slides) (21 mins) (2021).
My USENIX LISA2021 online plenary video on Computing Performance: On the Horizon, covering the present and future of performance. (youtube, blog) (41 mins) (2021).
My USENIX LISA2021 online talk video on BPF Internals (eBPF) showing high-level to machine code (youtube, blog) (39 mins) (2021).
My eBPF summit online keynote video on Performance Wins with BPF: Getting Started. This was seven months into a California lockdown and, like many, I needed a haircut (youtube, slides) (19 mins).
My YOW2020 online talk video for Linux Systems Performance, my first online conference talk (youtube, slides) (46 mins) (2020).
My AWS re:Invent 2019 talk video BPF Performance Analysis at Netflix includes my BPF theremin demo (youtube, slides) (57 mins).
My Ubuntu Masters 2019 keynote Extended BPF: A New Type of Software, where I first discussed how BPF is a new type of software and a fundamental change to the 50-year old kernel model, and how we observe this new software that's already in production at major companies (youtube, slides) (31 mins).
My USENIX LISA 2019 talk on Linux Systems Performance, covering observability, methodologies, benchmarking, profiling, tracing, and tuning. Some companies have indicated they will use this video it for new hire training. (youtube, slides) (40 mins).
My SCaLE17x talk eBPF Perf Tools where I instrumented Minecraft using eBPF live (youtube, slides) (60 mins) (2019).
My YOW2018 keynote Cloud Performance Root Cause Analysis at Netflix which I delivered in Sydney, Brisbane, and Melbourne (youtube, slides) (59 mins).
My YOW2018 CTO summit keynote Working at Netflix, which was my first non-strictly-technical talk (youtube, slides) (28 mins).
A LinkedIn Performance meetup talk on FlameScope with my colleague Martin Spier (youtube, slides) (24 mins) (2018).
My PerconaLive 2018 keynote on summarizing Linux Performance 2018 (slides, PDF, youtube) (20 mins).
I made two short videos for the launch of Netflix FlameScope: FlameScope Intro (1 min) and FlameScope Examples (11 mins) (2018).
I gave an lightning talk at SCALE16x on CPU utilization is WRONG, which was Ignite-style: auto advancing slides (youtube) (5 mins) (2018).
My AWS re:Invent 2017 talk on How Netflix Tunes EC2 Instances for Performance, describing instance selection, EC2 features, kernel tuning, and observability (youtube, slides) (56 mins).
My USENIX LISA 2017 talk Linux Container Performance Analysis (youtube, slides) (42 mins).
My Kernel Recipes 2017 talk on Using Linux perf at Netflix, focusing on fixing CPU profiling (youtube, slides) (51 mins).
My Kernel Recipes 2017 talk on Performance Analysis with BPF, including a 14 minute demo (youtube, slides) (42 mins).
For EuroBSDcon 2017, I gave the closing keynote on System Performance Analysis Methodologies (BSD) using FreeBSD for many examples, and introduced some new content (youtube, slides) (60 mins).
My talk on PMCs on the Cloud at the SBSRE meetup, with my Netflix cloud performance team colleagues (youtube, meetup). (16 mins of 85) (2017).
My USENIX ATC 2017 talk Performance Analysis Superpowers with Linux eBPF, updating my earlier talk (youtube slides) (39 mins).
My USENIX ATC 2017 talk Visualizing Performance with Flame Graphs, with updates and challenges (youtube, slides) (61 mdns).
My Velocity 2017 talk Performance Analysis Superpowers with Linux eBPF (youtube, slides) (43 mins).
My DockerCon 2017 talk on Container Performance Analysis, where I showed how to find bottlenecks in the host vs the container, how to profiler container apps, and dig deeper into the kernel (youtube, slides). (42 mins).
My SCaLE15x talk on Linux 4.x Tracing: Performance Analysis with bcc/BPF, including a ply demo (youtube, slides) (64 mins) (2017).
My BSidesSF 2017 talk with Alex Maestretti on Linux Monitoring at Scale with eBPF, including our diagram of events to monitor for intrusion detection (youtube, slides) (28 mins).
My Linux.conf.au 2017 talk on BPF: Tracing and More, where I summarized other uses for enhanced BPF. (youtube, slides) (46 mins).
My USENIX/LISA 2016 full talk Linux 4.x Tracing: Using BPF Superpowers (youtube, slides) (44 mins).
At LISA 2016, my Give me 15 minutes and I'll change your view of Linux tracing demo, showing ftrace, perf, bcc/BPF (youtube) (18 mins).
My talk at the first sysdig conference on Designing Tracing Tools (youtube, slides) (2016) (46 mins).
My keynote talk for ACM Applicative 2016 on System Methodology: Holistic Performance Analysis on Modern Systems, where I used several different operating systems as examples (youtube, slides) (57 mins).
For PerconaLive 2016, my Linux Systems Performance 2016 summary of this topic in 50 minutes (youtube, slides) (50 mins).
I gave the closing address at SREcon16 Santa Clara on Performance Checklists for SREs, which was also my first talk about my SRE (Site Reliability Engineering) work at Netflix (blog, youtube, usenix, slides) (61 mins).
For Facebook's Performance @Scale conference, my Linux BPF Superpowers talk video where I introduced the tracing capabilities of this new feature in the Linux 4.x series (facebook, slides) (2016) (34 mins).
Broken Linux Performance Tools for SCaLE14x, focusing on Linux problems with a bit more advice (youtube, slides) (2016) (1 hr).
For QConSF 2015 my Broken Performance Tools talk highlighting common pitfalls with system metrics, tools, and methodologies for generic Linux/Unix systems; good quality video is synced with the slides on the infoq site (infoq, slides) (2015) (50 mins).
My JavaOne 2015 talk on Java Mixed-Mode Flame Graphs introducing -XX:+PreserveFramePointer; as it wasn't officially recorded this is a capture from my laptop (youtube, slides) (2015).
My Monitorama 2015 talk Netflix Instance Performance Analysis Requirements, where I showed the different features that are desirable and undesirable for an instance analysis product, aimed at both vendors and customers (blog, vimeo, slides) (34 mins).
My Velocity 2015 tutorial Linux Performance Tools, which summarizes performance observability, benchmarking, tuning, static performance tuning, and tracing tools. This is an expanded and more complete version of an earlier talk of mine, and I was able to include some live demos of tools and methodology. It should be useful for everyone working on Linux systems. (youtube, slides) (100 mins).
At SCALE13x (2015), my Linux Profiling at Netflix talk on perf_events CPU profiling and features (blog, youtube, slides) (59 mins).
For USENIX LISA14, I gave a talk about my Linux perf-tools collection, which is based on ftrace and perf_events: Linux Performance Analysis: New Tools and Old Secrets (youtube, slides, USENIX) (43 mins).
My AWS re:Invent 2014 talk Performance Tuning EC2 Instances: selection, Linux tuning, observability (youtube, slides) (45 mins).
I was interviewed by BSD Now about BSD and benchmarking in Episode 065: 8,000,000 Mogofoo-ops (youtube) (2014) (28 mins).
My FreeBSD dev summit 2014 talk on Flame Graphs on FreeBSD. Since this talk wasn't videoed, I captured it using screenflow from my laptop. It's better than nothing, and shows my live demos well. (post, youtube) (53 mins).
My MeetBSDCA 2014 talk Performance Analysis, summarizing 5 facets of perf analysis on BSD (youtube, slides) (53 mins).
My popular talk on Linux Performance Tools, which quickly summarizes performance observability, benchmarking, tuning, static tuning, and tracing tools. This was for LinuxCon Europe 2014 (youtube, slides) (49 mins).
For the Tracing Summit 2014, my talk From DTrace to Linux summarized lessons Linux tracing can learn (youtube, slides) (61 mins).
My Surge 2014 talk From Clouds to Roots, showing how Netflix does perf analysis and the tools involved (youtube, slides) (56 mins).
My SCaLE12x (2014) keynote on What Linux can learn from Solaris performance and vice-versa (youtube, slides) (60 mins).
Deirdré Straughan has a youtube playlist named Brendan Gregg's Best, which has many of my talks (many of which she filmed).
My plenary session at USENIX/LISA 2013: Blazing Performance with Flame Graphs (youtube, slides, usenix) (90 mins).
A talk for BayLISA October 2013 to describe and launch the Systems Performance book (60 mins).
A lightning talk for Surge 2013 on Benchmarking Gone Wrong, which includes the craziest line graph I've ever seen (~5 mins).
The New Systems Performance, a meetup talk I gave in 2013 about modern systems performance (23 mins).
My OSCON 2013 talk on Open Source Systems Performance, a tale of three parts (youtube, slides) (32 mins).
My Stop the Guessing: Performance Methodologies for Production Systems talk at Velocity 2013 (youtube, slides) (46 mins).
At SCaLE11x (2013) I gave a talk on Linux Performance Analysis and Tools, summarizing basic to advanced analysis tools, and including some methodologies (youtube, slides, blog) (60 mins).
My LISA 2012 talk on Performance Analysis Methodology named and summarized 10 methods (youtube, usenix, slides, blog) (86 mins).
ZFS: Performance Analysis and Tools at zfsday was probably my best talk of 2012 (youtube, slides, blog) (43 mins).
At illumosday I gave a talk on DTracing the Cloud, showing what can be done (youtube, slides) (44 mins).
At FISL'13 (2012) I gave a talk on The USE Method for systems performance analysis, including some other methods for comparison (youtube, slides, blog) (56 mins).
At dtrace.conf(12) I gave an unconference-style talk on various Visualizations (youtube, blog) (35 mins).
An online talk on Benchmarking the Cloud with demos on Solaris/illumos (vimeo) (58 mins).
My SCaLE10x talk (2012) on Performance Analysis: new tools and concepts from the cloud, with examples (youtube, slides, PDF) (1 hr).
Short talks on performance tools for Solaris-based operating systems: vmstat, mpstat, and load averages, filmed during 2011.
A talk for BayLISA in 2011 on Dynamic Tracing and DTrace (vimeo) (1.5 hrs).
My extended Percona Live New York 2011 talk: Breaking Down MySQL/Percona Query Latency With DTrace (youtube, blog) (90 mins).
My LISA 2010 talk on Visualizations for Performance, which explains the need for heat maps (youtube, usenix, slides, blog) (80 mins).
At Oracle Open World 2010, I gave a talk on How to Build Better Applications with DTrace, which is on youtube: part 1, part 2. (64 mins).
I've given many technical talks of things going right. This was about things going wrong: Little Shop of Performance Horrors at FROSUG in Colorado, 2009 (youtube, blog) (2.5 hours).
Shouting in the Datacenter (youtube, blog) was a video that Bryan Cantrill and I made on the spur of the moment on New Year's Eve 2008, which went viral (1M+ views). I've had many emails about it: It has spawned an industry of sound proofing data centers (2 mins). There is also a making of (youtube) video (5 mins).

Software

The following are my spare time software projects, and are open source with no warranty – use at your own risk. Some are computer security tools, which may be illegal to own or run in your country if they are misidentified as cracking tools.

I've also developed software as a professional kernel engineer, which isn't listed below (e.g., the ZFS L2ARC).

Linux - tracing

eBPF Tools using Linux eBPF and the bcc front end for advanced observability and tracing tools.
bcc tools (github), BPF compiler collection, for which I'm a major contributor, especially for performance tools.
bpftrace tools (github) a high-level BPF tracing language, for which I'm a major contributor.
perf Examples for perf_events, the standard Linux profiler. Page including one-liners and flame graphs.
perf-tools (github) is a collection of ftrace- and perf_events-based performance analysis tools for Linux.
ktap Examples for the lua-based Linux dynamic tracing tool, including one liners and tools (no longer maintained).
msr-cloud-tools model specific register observability tools intended for cloud instances.

FreeBSD/OS X/Solaris - DTrace

DTrace Tools for FreeBSD.
DTrace book scripts from the DTrace book, which demonstrates many new uses of dynamic tracing.
DTraceToolkit a collection of over 200 scripts, with man pages and example files (no longer maintained).
DTrace Tools original versions of iosnoop, opensnoop, bitesize.d, execsnoop, shellsnoop, tcpsnoop, iotop, ...

Unix/Linux - C

Dump2PNG visualizes file data as a PNG (uses libpng). An experimental tool intended for core dump analysis. screenshot.
nicstat network interface stats for Solaris (uses Kstat). example. There is also a Perl version, and Tim Cook added Linux support.

Unix/Linux/Windows - Perl

FlameGraph: a visualization for sampled stack traces, used for performance analysis. See the Flame Graphs page for an explanation.
HeatMap: an program for generating interactive SVG heat maps from trace data. See the page about it.
Chaosreader: A forensics and network troubleshooting tools that extracts and reassembles application data from sniffed TCP/UDP sessions in tcpdump or snoop logs. Supports HTTP transfers, FTP transfers, SMTP emails, telnet sessions, etc. This example output was created by Chaosreader to link to the extracted HTTP sections, telnet sessions, and FTP files found in a snoop log. This can also create telnet replay programs that play back sessions in realtime: example. Created in 2003. download code (github).
Perl modules: Net::SnoopLog for snoop packet logs (RFC1761), Net::TcpDumpLog for tcpdump/libpcap logs, Algorithm::Hamming::Perl.
FreqCount is a simple frequency counter. Useful for processing logs (most common IP addr, port, etc..). example.
PortPing is a version of ping that connects using ssh (or other ports), not ICMP. Good for checking firewalls. example.
MTUfinder tests different sized HTTP requests to a web server, highlighting MTU size problems. example.
Specials is a collection of "special" programs for system administrators. Mostly Perl.

Unix/Linux - Bourne/Korn Shell

DtkshDemos a collection of X11 dtksh scripts. They include xvmstat - a GUI version of vmstat, and xplot - a generic data plotter. Written for any OS with dtksh.
total is a simple awk script to sum a field (example); field prints a field (example). These exist for convenience at the shell.

Windows - Delphi

Quick Text Toaster v1.0 An editor I wrote many years ago to grab text from corrupted files. Works with executables, documents, etc.

MSDOS - QBASIC

QBASIC CRO v1.2 I still find this old program amusing. It is a digital (on/off) CRO that samples the parallel port at 1KHz. screenshot.

Other

Guessing Game is written in awk C C++ csh Fortran java ksh Pascal Perl QBASIC sh and more as a language comparison.
The Crypt has some of my older Solaris and Unix software, including the K9Toolkit collection of kstat-based performance tools, Psio for disk I/O by-process, and CacheKit for hardware and software cache analysis.

Misc

Recommended Reading: A list of my favourite technology books.
Other Sites: Other interesting places on the web.
Photos: Some photos I've taken.
Games: My favourite computer games.
Overview: A summary of my popular content.