Systems Performance 2nd Ed.



BPF Performance Tools book

Recent posts:
Blog index
About
RSS

SCALE13x (2015): Linux Profiling at Netflix

Talk for Southern California Linux Expo 13x (2015) by Brendan Gregg, Netflix.

Video: https://www.youtube.com/watch?v=_Ik8oiQvWgo

Description: "Profiling can show what your Linux kernel and appliacations are doing in detail, across all software stack layers. This talk shows how we are using Linux perf_events (aka "perf") and flame graphs at Netflix to understand CPU usage in detail, to optimize our cloud usage, solve performance issues, and identify regressions. This will be more than just an intro: profiling difficult targets, including Java and Node.js, will be covered, which includes ways to resolve JITed symbols and broken stacks. Included are the easy examples, the hard, and the cutting edge."

next
prev
1/84
next
prev
2/84
next
prev
3/84
next
prev
4/84
next
prev
5/84
next
prev
6/84
next
prev
7/84
next
prev
8/84
next
prev
9/84
next
prev
10/84
next
prev
11/84
next
prev
12/84
next
prev
13/84
next
prev
14/84
next
prev
15/84
next
prev
16/84
next
prev
17/84
next
prev
18/84
next
prev
19/84
next
prev
20/84
next
prev
21/84
next
prev
22/84
next
prev
23/84
next
prev
24/84
next
prev
25/84
next
prev
26/84
next
prev
27/84
next
prev
28/84
next
prev
29/84
next
prev
30/84
next
prev
31/84
next
prev
32/84
next
prev
33/84
next
prev
34/84
next
prev
35/84
next
prev
36/84
next
prev
37/84
next
prev
38/84
next
prev
39/84
next
prev
40/84
next
prev
41/84
next
prev
42/84
next
prev
43/84
next
prev
44/84
next
prev
45/84
next
prev
46/84
next
prev
47/84
next
prev
48/84
next
prev
49/84
next
prev
50/84
next
prev
51/84
next
prev
52/84
next
prev
53/84
next
prev
54/84
next
prev
55/84
next
prev
56/84
next
prev
57/84
next
prev
58/84
next
prev
59/84
next
prev
60/84
next
prev
61/84
next
prev
62/84
next
prev
63/84
next
prev
64/84
next
prev
65/84
next
prev
66/84
next
prev
67/84
next
prev
68/84
next
prev
69/84
next
prev
70/84
next
prev
71/84
next
prev
72/84
next
prev
73/84
next
prev
74/84
next
prev
75/84
next
prev
76/84
next
prev
77/84
next
prev
78/84
next
prev
79/84
next
prev
80/84
next
prev
81/84
next
prev
82/84
next
prev
83/84
next
prev
84/84

PDF: SCALE2015_Linux_perf_profiling.pdf

Keywords (from pdftotext):

slide 1:
    Feb	
      2015	
      
    Linux Profiling at Netflix
    using perf_events (aka "perf")
    Brendan Gregg
    Senior Performance Architect
    Performance Engineering Team
    bgregg@netflix.com @brendangregg
    
slide 2:
    This Talk
    • This talk is about Linux profiling using perf_events
    – How to get CPU profiling to work, and overcome gotchas
    – A tour of perf_events and its features
    • This is based on our use of perf_events at Netflix
    
slide 3:
    • Massive Amazon EC2 Linux cloud
    – Tens of thousands of instances
    – Autoscale by ~3k each day
    – CentOS and Ubuntu, Java and Node.js
    • FreeBSD for content delivery
    – Approx 33% of the US Internet traffic at night
    • Performance is critical
    – Customer satisfaction: >gt;50M subscribers
    – $$$ price/performance
    – Develop tools for cloud-wide analysis, and
    make them open source: NetflixOSS
    – Use server tools as needed
    
slide 4:
    Agenda
    Why We Need Linux Profiling
    Crash Course
    CPU Profiling
    Gotchas
    Stacks (gcc, Java)
    Symbols (Node.js, Java)
    Guest PMCs
    PEBS
    Overheads
    5. Tracing
    
slide 5:
    1. Why We Need Linux Profiling
    
slide 6:
    Why We Need Linux Profiling
    • Our primary motivation is simple:
    Understand CPU usage quickly and completely
    
slide 7:
    Netflix Vector
    Quickly:
    1. Observe high CPU usage
    2. Generate a perf_events-based flame graph
    
slide 8:
    Flame Graph
    Completely:
    Java	
      
    JVM	
      
    (C++)	
      
    Kernel	
      
    (C)	
      
    
slide 9:
    Value for Netflix
    • Uses for CPU profiling:
    – Help with incident response
    – Non-regression testing
    – Software evaluations
    – Identify performance tuning targets
    – Part of CPU workload characterization
    • Built into Netflix Vector
    – A near real-time instance analysis tool (will be NetflixOSS)
    
slide 10:
    Workload Characterization
    • For CPUs:
    Who
    Why
    What
    How
    
slide 11:
    Workload Characterization
    • For CPUs:
    Who: which PIDs, programs, users
    Why: code paths, context
    What: CPU instructions, cycles
    How: changing over time
    • Can you currently answer them? How?
    
slide 12:
    CPU Tools
    Who
    Why
    How
    What
    
slide 13:
    CPU Tools
    Who
    Why
    top,	
      htop!
    perf record -g!
    flame	
      graphs	
      
    How
    What
    monitoring	
      
    perf stat -a -d!
    
slide 14:
    Most companies & monitoring products today
    Who
    Why
    top,	
      htop!
    perf record -g!
    flame	
      Graphs	
      
    How
    What
    monitoring	
      
    perf stat -a -d!
    
slide 15:
    Re-setting Expectations
    • That was pretty good… 20 years ago.
    • Today you should easily understand why CPUs are used:
    – A profile of all CPU consumers and code paths
    – Visualized effectively
    – This should be easy to do
    • Best done using:
    – A perf_events CPU profile of stack traces
    – Visualized as a flame graph
    – This will usually mean some sysadmin/devops work, to get
    perf_events working, and to automate profiling
    
slide 16:
    Recent Example
    1. Poor performance, and 1
    CPU at 100%
    2. perf_events flame graph
    shows JVM stuck
    compiling
    
slide 17:
    2. Crash Course
    
slide 18:
    perf_events
    The main Linux profiler, used via the "perf" command
    Add from linux-tools-common, etc.
    Source code & docs in Linux: tools/perf
    Supports many profiling/tracing features:
    – CPU Performance Monitoring Counters (PMCs)
    – Statically defined tracepoints
    – User and kernel dynamic tracing
    – Kernel line and local variable tracing
    – Efficient in-kernel counts and filters
    – Stack tracing, libunwind
    – Code annotation
    • Some bugs in the past; has been stable for us
    perf_events	
      
    ponycorn	
      
    
slide 19:
    A Multitool of Subcommands
    # perf!
    usage: perf [--version] [--help] COMMAND [ARGS]!
    The most commonly used perf commands are:!
    annotate
    Read perf.data (created by perf record) and display annotated code!
    archive
    Create archive with object files with build-ids found in perf.data!
    bench
    General framework for benchmark suites!
    buildid-cache
    Manage build-id cache.!
    buildid-list
    List the buildids in a perf.data file!
    diff
    Read two perf.data files and display the differential profile!
    evlist
    List the event names in a perf.data file!
    inject
    Filter to augment the events stream with additional information!
    kmem
    Tool to trace/measure kernel memory(slab) properties!
    kvm
    Tool to trace/measure kvm guest os!
    list
    List all symbolic event types!
    lock
    Analyze lock events!
    probe
    Define new dynamic tracepoints!
    record
    Run a command and record its profile into perf.data!
    report
    Read perf.data (created by perf record) and display the profile!
    sched
    Tool to trace/measure scheduler properties (latencies)!
    script
    Read perf.data (created by perf record) and display trace output!
    stat
    Run a command and gather performance counter statistics!
    test
    Runs sanity tests.!
    timechart
    Tool to visualize total system behavior during a workload!
    top
    System profiling tool.!
    See 'perf help COMMAND' for more information on a specific command.!
    
slide 20:
    perf Command Format
    • perf instruments using stat or record. This has three
    main parts: action, event, scope.
    • e.g., profiling on-CPU stack traces:
    Ac+on:	
      record	
      stack	
      traces	
      
    perf record -F 99 -a -g -- sleep 10!
    Scope:	
      all	
      CPUs	
      
    Event:	
      99	
      Hertz	
      
    Note:	
      sleep	
      10	
      is	
      a	
      dummy	
      command	
      to	
      set	
      the	
      duraOon	
      
    
slide 21:
    perf Actions
    • Count events (perf stat …)
    – Uses an efficient in-kernel counter, and prints the results!
    • Sample events (perf record …)
    – Records details of every event to a dump file (perf.data)
    • Timestamp, CPU, PID, instruction pointer, …
    – This incurs higher overhead, relative to the rate of events!
    – Include the call graph (stack trace) using -g!
    • Other actions include:
    – List events (perf list)
    – Report from a perf.data file (perf report)
    – Dump a perf.data file as text (perf script)
    – top style profiling (perf top)
    
slide 22:
    perf Actions: Workflow
    list	
      events	
      
    count	
      events	
      
    capture	
      stacks	
      
    perf list!
    perf stat!
    perf record!
    Typical	
      
    Workflow	
      
    perf.data	
      
    text	
      UI	
      
    dump	
      profile	
      
    perf report!
    flame	
      graph	
      
    visualizaOon	
      
    perf script!
    stackcollapse-perf.pl!
    flamegraph.pl!
    
slide 23:
    perf Events
    • Custom timers
    – e.g., 99 Hertz (samples per second)
    • Hardware events
    – CPU Performance Monitoring Counters (PMCs)
    • Tracepoints
    – Statically defined in software
    • Dynamic tracing
    – Created using uprobes (user) or kprobes (kernel)
    – Can do kernel line tracing with local variables (needs kernel
    debuginfo)
    
slide 24:
    perf Events: Map
    
slide 25:
    perf Events: List
    # perf list!
    List of pre-defined events (to be used in -e):!
    cpu-cycles OR cycles
    instructions
    cache-references
    cache-misses
    branch-instructions OR branches
    branch-misses
    bus-cycles
    stalled-cycles-frontend OR idle-cycles-frontend
    stalled-cycles-backend OR idle-cycles-backend
    […]!
    cpu-clock
    task-clock
    page-faults OR faults
    context-switches OR cs
    cpu-migrations OR migrations
    […]!
    L1-dcache-loads
    L1-dcache-load-misses
    L1-dcache-stores
    […] !
    skb:kfree_skb
    skb:consume_skb
    skb:skb_copy_datagram_iovec
    net:net_dev_xmit
    net:net_dev_queue
    net:netif_receive_skb
    net:netif_rx
    […]!
    [Hardware event]!
    [Hardware event]!
    [Hardware event]!
    [Hardware event]!
    [Hardware event]!
    [Hardware event]!
    [Hardware event]!
    [Hardware event]!
    [Hardware event]!
    [Software event]!
    [Software event]!
    [Software event]!
    [Software event]!
    [Software event]!
    [Hardware cache event]!
    [Hardware cache event]!
    [Hardware cache event]!
    [Tracepoint event]!
    [Tracepoint event]!
    [Tracepoint event]!
    [Tracepoint event]!
    [Tracepoint event]!
    [Tracepoint event]!
    [Tracepoint event]!
    
slide 26:
    perf Scope
    System-wide: all CPUs (-a)
    Target PID (-p PID)
    Target command (…)
    Specific CPUs (-c …)
    User-level only (gt;:u)
    Kernel-level only (gt;:k)
    A custom filter to match variables (--filter …)
    The following one-liner tour includes some complex action,
    event, and scope combinations.
    
slide 27:
    One-Liners: Listing Events
    # Listing all currently known events:!
    perf list!
    # Searching for "sched" tracepoints:!
    perf list | grep sched!
    # Listing sched tracepoints:!
    perf list 'sched:*'!
    
slide 28:
    One-Liners: Counting Events
    # CPU counter statistics for the specified command:!
    perf stat command!
    # Detailed CPU counter statistics (includes extras) for the specified command:!
    perf stat -d command!
    # CPU counter statistics for the specified PID, until Ctrl-C:!
    perf stat -p PID!
    # CPU counter statistics for the entire system, for 5 seconds:!
    perf stat -a sleep 5!
    # Various CPU last level cache statistics for the specified command:!
    perf stat -e LLC-loads,LLC-load-misses,LLC-stores,LLC-prefetches command!
    # Count system calls for the specified PID, until Ctrl-C:!
    perf stat -e 'syscalls:sys_enter_*' -p PID!
    # Count scheduler events for the specified PID, for 10 seconds:!
    perf stat -e 'sched:*' -p PID sleep 10!
    # Count block device I/O events for the entire system, for 10 seconds:!
    perf stat -e 'block:*' -a sleep 10!
    # Show system calls by process, refreshing every 2 seconds:!
    perf top -e raw_syscalls:sys_enter -ns comm!
    
slide 29:
    One-Liners: Profiling Events
    # Sample on-CPU functions for the specified command, at 99 Hertz:!
    perf record -F 99 command!
    # Sample on-CPU functions for the specified PID, at 99 Hertz, until Ctrl-C:!
    perf record -F 99 -p PID!
    # Sample CPU stack traces for the specified PID, at 99 Hertz, for 10 seconds:!
    perf record -F 99 -p PID -g -- sleep 10!
    # Sample CPU stack traces for the entire system, at 99 Hertz, for 10 seconds:!
    perf record -F 99 -ag -- sleep 10!
    # Sample CPU stack traces, once every 10,000 Level 1 data cache misses, for 5 s:!
    perf record -e L1-dcache-load-misses -c 10000 -ag -- sleep 5!
    # Sample CPU stack traces, once every 100 last level cache misses, for 5 seconds:!
    perf record -e LLC-load-misses -c 100 -ag -- sleep 5 !
    # Sample on-CPU kernel instructions, for 5 seconds:!
    perf record -e cycles:k -a -- sleep 5 !
    # Sample on-CPU user instructions, for 5 seconds:!
    perf record -e cycles:u -a -- sleep 5!
    
slide 30:
    One-Liners: Reporting
    # Show perf.data in an ncurses browser (TUI) if possible:!
    perf report!
    # Show perf.data with a column for sample count:!
    perf report -n!
    # Show perf.data as a text report, with data coalesced and percentages:!
    perf report --stdio!
    # List all raw events from perf.data:!
    perf script!
    # List all raw events from perf.data, with customized fields:!
    perf script -f comm,tid,pid,time,cpu,event,ip,sym,dso!
    # Dump raw contents from perf.data as hex (for debugging):!
    perf script -D!
    # Disassemble and annotate instructions with percentages (needs some debuginfo):!
    perf annotate --stdio!
    
slide 31:
    And More
    • perf can also probe and record dynamic tracepoints, and
    custom CPU PMCs
    • This can get a little advanced
    • I'll start with CPU profiling, then gotchas
    
slide 32:
    3. CPU Profiling
    
slide 33:
    CPU Profiling
    • Record stacks at a timed interval: simple and effective
    – Pros: Low (deterministic) overhead
    – Cons: Coarse accuracy, but usually sufficient
    stack	
      
    samples:	
      
    syscall	
      
    on-­‐CPU	
      
    off-­‐CPU	
      
    block	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      interrupt	
      
    Ome	
      
    
slide 34:
    perf Screenshot
    • Sampling full stack traces at 99 Hertz:
    # perf record -F 99 -ag -- sleep 30!
    [ perf record: Woken up 9 times to write data ]!
    [ perf record: Captured and wrote 2.745 MB perf.data (~119930 samples) ]!
    # perf report -n --stdio!
    1.40%
    java [kernel.kallsyms]
    [k] _raw_spin_lock
    --- _raw_spin_lock!
    |--63.21%-- try_to_wake_up!
    |--63.91%-- default_wake_function!
    |--56.11%-- __wake_up_common!
    __wake_up_locked!
    ep_poll_callback!
    __wake_up_common!
    __wake_up_sync_key!
    |--59.19%-- sock_def_readable!
    […78,000 lines truncated…]!
    
slide 35:
    perf Reporting
    • perf report summarizes by combining common paths
    • Previous output truncated 78,000 lines of summary
    • The following is what a mere 8,000 lines looks like…
    
slide 36:
    perf report
    
slide 37:
    … as a Flame Graph
    
slide 38:
    Flame Graphs
    git clone --depth 1 https://github.com/brendangregg/FlameGraph!
    cd FlameGraph!
    perf record -F 99 -a –g -- sleep 30!
    perf script | ./stackcollapse-perf.pl |./flamegraph.pl >gt; perf.svg!
    • Flame Graphs:
    – x-axis: alphabetical stack sort, to maximize merging
    – y-axis: stack depth
    – color: random, or hue can be a dimension
    • e.g., software type, or difference between two profiles for
    non-regression testing ("differential flame graphs")
    – interpretation: top edge is on-CPU, beneath it is ancestry
    • Just a Perl program to convert perf stacks into SVG
    – Includes JavaScript: open in a browser for interactivity
    • Easy to get working
    http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html
    
slide 39:
    4. Gotchas
    
slide 40:
    When you try to use perf
    Stacks don't work (missing)
    Symbols don't work (hex numbers)
    Can't profile Java
    Can't profile Node.js/io.js
    PMCs aren't available
    Dynamic tracing function arguments don't work
    perf locks up
    
slide 41:
    How to really get started
    Get "perf" to work
    Get stack walking to work
    Fix symbol translation
    Get IPC to work
    Test perf under load
    This is my actual checklist.
    
slide 42:
    How to really get started
    Get "perf" to work
    Get stack walking to work
    Fix symbol translation
    Get IPC to work
    Test perf under load
    This is my actual checklist.
    Install	
      perf-­‐tools-­‐common	
      
    and	
      perf-­‐tools-­‐`uname	
      -­‐r`	
      
    packages;	
      Or	
      compile	
      in	
      the	
      
    Linux	
      source:	
      tools/perf	
      
    The	
      "gotchas"…	
      
    
slide 43:
    Gotcha #1 Broken Stacks
    perf record -F 99 -a –g -- sleep 30!
    perf report -n --stdio!
    Start by testing stacks:
    1. Take a CPU profile
    2. Run perf report
    3. If stacks are often 
slide 44:
    Identifying Broken Stacks
    28.10%
    12.06%
    sed libc-2.19.so
    --- re_search_internal!
    |--12.25%-- 0x3!
    0x100007!
    sed
    sed
    --- re_search_internal!
    |--96.78%-- re_search_stub!
    rpl_re_search!
    match_regex!
    do_subst!
    execute_program!
    process_files!
    main!
    __libc_start_main!
    --3.22%-- rpl_re_search!
    match_regex!
    do_subst!
    execute_program!
    process_files!
    main!
    __libc_start_main!
    [.] re_search_internal
    broken	
      
    [.] re_search_internal
    not	
      broken	
      
    
slide 45:
    Identifying Broken Stacks
    78.50%
    71.79%
    sed libc-2.19.so
    |--3.65%-- 0x7f2516d5d10d!
    |--2.19%-- 0x7f2516d0332f!
    |--1.22%-- 0x7f2516cffbd2!
    |--1.22%-- 0x7f2516d5d5ad!
    [.] 0x00000000000dd7d4!
    broken	
      
    sed
    sed
    [.] 0x000000000001afc1!
    |--11.65%-- 0x40a447!
    0x40659a!
    0x408dd8!
    0x408ed1!
    0x402689!
    0x7fa1cd08aec5!
    |--1.33%-- 0x40a4a1!
    |--60.01%-- 0x40659a!
    0x408dd8!
    0x408ed1!
    0x402689!
    0x7fa1cd08aec5!
    probably	
      not	
      broken	
      
    missing	
      symbols,	
      but	
      
    that's	
      another	
      problem	
      
    
slide 46:
    Broken Stacks Flame Graph
    Broken	
      Java	
      stacks	
      
    (missing	
      frame	
      pointer)	
      
    Java	
      ==	
      green	
      
    system	
      ==	
      red	
      
    C++	
      ==	
      yellow	
      
    
slide 47:
    Fixing Broken Stacks
    Either:
    A. Fix frame pointer-based stack walking (the default)
    – Pros: simple, supports any system stack walker
    – Cons: might cost a little extra CPU to make available
    B. Use libunwind and DWARF: perf record -g dwarf!
    – Pros: more debug info
    – Cons: not on older kernels, and inflates instance size
    C. Use a custom walker (probably needs kernel support)
    – Pros: full stack walking (e.g., unwind inlined frames) & args
    – Cons: custom kernel code, can cost more CPU when in use
    Our current preference is (A)
    – So how do we fix the frame pointer…
    
slide 48:
    gcc -fno-omit-frame-pointer
    • Once upon a time, x86 had fewer registers, and the frame
    pointer register was reused for general purpose to improve
    performance. This breaks system stack walking.
    • gcc provides -fno-omit-frame-pointer to fix this
    – Please make this the default in gcc!
    
slide 49:
    JDK-8068945
    • Java's compilers also reuse the frame pointer, and
    unfortunately there's no -XX:no-omit-frame-pointer (yet)
    • I hacked hotspot to fix the frame pointer, and published the
    patch as a prototype suggestion (JDK-8068945)
    --- openjdk8clean/hotspot/src/cpu/x86/vm/x86_64.ad 2014-03-04 …!
    +++ openjdk8/hotspot/src/cpu/x86/vm/x86_64.ad 2014-11-08 …!
    @@ -166,10 +166,9 @@!
    // 3) reg_class stack_slots( /* one chunk of stack-based "registers" */ )!
    //!
    -// Class for all pointer registers (including RSP)!
    +// Class for all pointer registers (including RSP, excluding RBP)!
    reg_class any_reg(RAX, RAX_H,!
    RDX, RDX_H,!
    RBP, RBP_H,!
    RDI, RDI_H,!
    RSI, RSI_H,!
    RCX, RCX_H,!
    [...]!
    Remove	
      RBP	
      from	
      
    register	
      pools	
      
    
slide 50:
    JDK-8068945
    --- openjdk8clean/hotspot/src/cpu/x86/vm/macroAssembler_x86.cpp 2014-03-04…!
    +++ openjdk8/hotspot/src/cpu/x86/vm/macroAssembler_x86.cpp 2014-11-07 …!
    @@ -5236,6 +5236,7 @@!
    // We always push rbp, so that on return to interpreter rbp, will be!
    // restored correctly and we can correct the stack.!
    push(rbp);!
    mov(rbp, rsp);!
    // Remove word for ebp!
    framesize -= wordSize;!
    --- openjdk8clean/hotspot/src/cpu/x86/vm/c1_MacroAssembler_x86.cpp …!
    +++ openjdk8/hotspot/src/cpu/x86/vm/c1_MacroAssembler_x86.cpp …!
    [...]!
    Fix	
      x86-­‐64	
      funcOon	
      
    prologues	
      
    • We've been using our patched OpenJDK for profiling
    • To do: make this an option (-XX:MoreFramePointer), and (at
    some point) fix for invokedynamic
    – See "A hotspot patch for stack profiling (frame pointer)" on the
    hotspot compiler dev mailing list, being discussed now
    
slide 51:
    Broken Java Stacks
    # perf script!
    […]!
    java 4579 cpu-clock: !
    ffffffff8172adff tracesys ([kernel.kallsyms])!
    7f4183bad7ce pthread_cond_timedwait@@GLIBC_2…!
    java 4579 cpu-clock: !
    7f417908c10b [unknown] (/tmp/perf-4458.map)!
    java 4579 cpu-clock: !
    7f4179101c97 [unknown] (/tmp/perf-4458.map)!
    java 4579 cpu-clock: !
    7f41792fc65f [unknown] (/tmp/perf-4458.map)!
    a2d53351ff7da603 [unknown] ([unknown])!
    java 4579 cpu-clock: !
    7f4179349aec [unknown] (/tmp/perf-4458.map)!
    java 4579 cpu-clock: !
    7f4179101d0f [unknown] (/tmp/perf-4458.map)!
    java 4579 cpu-clock: !
    7f417908c194 [unknown] (/tmp/perf-4458.map)!
    […]!
    • Check with "perf script"
    to see stack samples
    • These are 1 or 2 levels
    deep (junk values)
    
slide 52:
    Fixed Java Stacks
    # perf script!
    […]!
    java 8131 cpu-clock: !
    7fff76f2dce1 [unknown] ([vdso])!
    7fd3173f7a93 os::javaTimeMillis() (/usr/lib/jvm…!
    7fd301861e46 [unknown] (/tmp/perf-8131.map)!
    7fd30184def8 [unknown] (/tmp/perf-8131.map)!
    7fd30174f544 [unknown] (/tmp/perf-8131.map)!
    7fd30175d3a8 [unknown] (/tmp/perf-8131.map)!
    7fd30166d51c [unknown] (/tmp/perf-8131.map)!
    7fd301750f34 [unknown] (/tmp/perf-8131.map)!
    7fd3016c2280 [unknown] (/tmp/perf-8131.map)!
    7fd301b02ec0 [unknown] (/tmp/perf-8131.map)!
    7fd3016f9888 [unknown] (/tmp/perf-8131.map)!
    7fd3016ece04 [unknown] (/tmp/perf-8131.map)!
    7fd30177783c [unknown] (/tmp/perf-8131.map)!
    7fd301600aa8 [unknown] (/tmp/perf-8131.map)!
    7fd301a4484c [unknown] (/tmp/perf-8131.map)!
    7fd3010072e0 [unknown] (/tmp/perf-8131.map)!
    7fd301007325 [unknown] (/tmp/perf-8131.map)!
    7fd301007325 [unknown] (/tmp/perf-8131.map)!
    7fd3010004e7 [unknown] (/tmp/perf-8131.map)!
    7fd3171df76a JavaCalls::call_helper(JavaValue*,…!
    7fd3171dce44 JavaCalls::call_virtual(JavaValue*…!
    7fd3171dd43a JavaCalls::call_virtual(JavaValue*…!
    7fd31721b6ce thread_entry(JavaThread*, Thread*)…!
    7fd3175389e0 JavaThread::thread_main_inner() (/…!
    7fd317538cb2 JavaThread::run() (/usr/lib/jvm/nf…!
    7fd3173f6f52 java_start(Thread*) (/usr/lib/jvm/…!
    7fd317a7e182 start_thread (/lib/x86_64-linux-gn…!
    • With JDK-8068945
    stacks are full, and go
    all the way to
    start_thread()
    • This is what the CPUs
    are really running:
    inlined frames are not
    present
    
slide 53:
    Fixed Stacks Flame Graph
    Java	
      
    (no	
      symbols)	
      
    
slide 54:
    Gotcha #2 Missing Symbols
    • Missing symbols should be obvious in perf report/script:
    71.79%
    12.06%
    sed
    sed
    |--11.65%-- 0x40a447!
    0x40659a!
    0x408dd8!
    0x408ed1!
    0x402689!
    0x7fa1cd08aec5!
    sed
    sed
    --- re_search_internal!
    |--96.78%-- re_search_stub!
    rpl_re_search!
    match_regex!
    do_subst!
    execute_program!
    process_files!
    main!
    __libc_start_main!
    [.] 0x000000000001afc1!
    broken	
      
    [.] re_search_internal
    not	
      broken	
      
    
slide 55:
    Fixing Symbols
    • For installed packages:
    A. Add a -dbgsym package, if available
    B. Recompile from source
    • For JIT (Java, Node.js, …):
    A. Create a /tmp/perf-PID.map file. perf already looks for this
    B. Or use one of the new symbol loggers (see lkml)
    # perf script!
    Failed to open /tmp/perf-8131.map, continuing without symbols!
    […]!
    java 8131 cpu-clock: !
    7fff76f2dce1 [unknown] ([vdso])!
    7fd3173f7a93 os::javaTimeMillis() (/usr/lib/jvm…!
    7fd301861e46 [unknown] (/tmp/perf-8131.map)!
    […]!
    
slide 56:
    perf JIT symbols: Java, Node.js
    • Using the /tmp map file for symbol translation:
    – Pros: simple, can be low overhead (snapshot on demand)
    – Cons: stale symbols
    – Map format is "START SIZE symbolname"
    – Another gotcha: If perf is run as root, chown root gt;
    • Java
    – https://github.com/jrudolph/perf-map-agent
    – Agent attaches and writes the map file on demand (previous
    versions attached on Java start, and wrote continually)
    • Node.js
    – node --perf_basic_prof writes the map file continually
    – Available from 0.11.13+
    – Currently has a file growth issue; see my patch in
    https://code.google.com/p/v8/issues/detail?id=3453
    
slide 57:
    Java: Stacks & Symbols
    flamegraph.pl --color=java!
    Kernel	
      
    JVM	
      
    Java	
      
    
slide 58:
    Java: Inlining
    • Disabling inlining:
    – -XX:-Inline
    – Many more Java frames
    – 80% slower (in this case)
    • Not really necessary
    – Inlined flame graphs often
    make enough sense
    – Or tune -XX:MaxInlineSize
    and -XX:InlineSmallCode a
    little to reveal more frames,
    without costing much perf
    – Can even go faster!
    No	
      inlining	
      
    
slide 59:
    Node.js: Stacks & Symbols
    • Covered previously on the Netflix Tech Blog
    http://techblog.netflix.com/2014/11/nodejs-in-flames.html
    
slide 60:
    Gotcha #3 Guest PMCs
    • Using PMCs from a Xen guest (currently):
    # perf stat -a -d sleep 5!
    Performance counter stats for 'system wide':!
    10003.718595 task-clock (msec)
    2.000 CPUs utilized
    323 context-switches
    0.032 K/sec
    17 cpu-migrations
    0.002 K/sec
    233 page-faults
    0.023 K/sec
    gt; cycles
    gt; stalled-cycles-frontend !
    gt; stalled-cycles-backend !
    gt; instructions
    gt; branches
    gt; branch-misses
    gt; L1-dcache-loads
    gt; L1-dcache-load-misses
    gt; LLC-loads
    gt; LLC-load-misses
    5.001607197 seconds time elapsed!
    [100.00%]!
    [100.00%]!
    [100.00%]!
    
slide 61:
    Guest PMCs
    • Without PMCs, %CPU is ambiguous. We can't measure:
    – Instructions Per Cycle (IPC)
    – CPU cache hits/misses
    – MMU TLB hits/misses
    – Branch misprediction
    – Stall cycles
    • Should be fixable: hypervisors can expose PMCs
    – At the very least, enough PMCs for IPC to work:
    INST_RETIRED.ANY_P & CPU_CLK_UNHALTED.THREAD_P
    • In the meantime
    – I'm using a physical server for PMC analysis
    – Also some MSRs on the cloud
    
slide 62:
    MSRs
    • Model Specific Registers (MSRs) may be exposed when
    PMCs are not
    • Better than nothing. Can solve some issues.
    # ./showboost!
    CPU MHz
    : 2500!
    Turbo MHz
    : 2900 (10 active)!
    Turbo Ratio : 116% (10 active)!
    CPU 0 summary every 5 seconds...!
    TIME
    C0_MCYC
    C0_ACYC
    17:28:03
    17:28:08
    17:28:13
    17:28:18
    17:28:23
    [...]!
    UTIL
    33%
    35%
    36%
    37%
    38%
    RATIO
    116%
    116%
    116%
    115%
    115%
    MHz!
    2900!
    2900!
    2900!
    2899!
    2899!
    – showboost is from my msr-cloud-tools collection (on github)
    
slide 63:
    Gotcha #4 Instruction Profiling
    # perf annotate -i perf.data.noplooper --stdio!
    Percent |
    Source code & Disassembly of noplooper!
    --------------------------------------------------------!
    Disassembly of section .text:!
    00000000004004ed gt;:!
    0.00 :
    4004ed:
    push
    %rbp!
    0.00 :
    4004ee:
    mov
    %rsp,%rbp!
    20.86 :
    4004f1:
    nop!
    0.00 :
    4004f2:
    nop!
    0.00 :
    4004f3:
    nop!
    0.00 :
    4004f4:
    nop!
    19.84 :
    4004f5:
    nop!
    0.00 :
    4004f6:
    nop!
    0.00 :
    4004f7:
    nop!
    0.00 :
    4004f8:
    nop!
    18.73 :
    4004f9:
    nop!
    0.00 :
    4004fa:
    nop!
    0.00 :
    4004fb:
    nop!
    0.00 :
    4004fc:
    nop!
    19.08 :
    4004fd:
    nop!
    0.00 :
    4004fe:
    nop!
    0.00 :
    4004ff:
    nop!
    0.00 :
    400500:
    nop!
    21.49 :
    400501:
    jmp
    4004f1 gt;!
    16	
      NOPs	
      in	
      a	
      loop	
      
    	
      
    Let's	
      profile	
      instrucOons	
      
    to	
      see	
      which	
      are	
      hot!	
      
    	
      
    (have	
      I	
      lost	
      my	
      mind?)	
      
    
slide 64:
    Instruction Profiling
    • Even distribution (A)? Or something else?
    (A)	
      
    (B)	
      
    (C)	
      
    (D)	
      
    
slide 65:
    Instruction Profiling
    # perf annotate -i perf.data.noplooper --stdio!
    Percent |
    Source code & Disassembly of noplooper!
    --------------------------------------------------------!
    Disassembly of section .text:!
    00000000004004ed gt;:!
    0.00 :
    4004ed:
    push
    %rbp!
    0.00 :
    4004ee:
    mov
    %rsp,%rbp!
    20.86 :
    4004f1:
    nop!
    0.00 :
    4004f2:
    nop!
    0.00 :
    4004f3:
    nop!
    0.00 :
    4004f4:
    nop!
    19.84 :
    4004f5:
    nop!
    0.00 :
    4004f6:
    nop!
    0.00 :
    4004f7:
    nop!
    0.00 :
    4004f8:
    nop!
    18.73 :
    4004f9:
    nop!
    0.00 :
    4004fa:
    nop!
    0.00 :
    4004fb:
    nop!
    0.00 :
    4004fc:
    nop!
    Go	
      home	
      instrucOon	
      
    19.08 :
    4004fd:
    nop!
    0.00 :
    4004fe:
    nop!
    pointer,	
      you're	
      drunk	
      
    0.00 :
    4004ff:
    nop!
    0.00 :
    400500:
    nop!
    21.49 :
    400501:
    jmp
    4004f1 gt;!
    
slide 66:
    PEBS
    • I believe this is sample "skid", plus parallel and out-oforder execution of micro-ops: the sampled IP is the
    resumption instruction, not what is currently executing.
    • PEBS may help: Intel's Precise Event Based Sampling
    • perf_events has support:
    – perf record -e cycles:pp!
    – The 'p' can be specified multiple times:
    0 - SAMPLE_IP can have arbitrary skid
    1 - SAMPLE_IP must have constant skid
    2 - SAMPLE_IP requested to have 0 skid
    3 - SAMPLE_IP must have 0 skid
    – … from tools/perf/Documentation/perf-list.txt
    
slide 67:
    Gotcha #5 Overhead
    • Overhead is relative to the rate of events instrumented
    • perf stat does in-kernel counts, with relatively low
    CPU overhead
    • perf record writes perf.data, which has slightly
    higher CPU overhead, plus file system and disk I/O
    • Test before use
    – In the lab
    – Run perf stat to understand rate, before perf record
    • Also consider --filter, to filter events in-kernel
    
slide 68:
    5. Tracing
    
slide 69:
    Profiling vs Tracing
    • Profiling takes samples. Tracing records every event.
    • There are many tracers for Linux (SystemTap, ktap, etc),
    but only two in mainline: perf_events and ftrace
    Linux	
      Tracing	
      Stack	
      
    one-­‐liners:	
      
    front-­‐end	
      tools:	
      
    many	
      
    perf,	
      perf-­‐tools	
      
    tracing	
      frameworks:	
      
    perf_events,	
      krace,	
      eBPF,	
      …	
      
    tracing	
      instrumentaOon:	
      
    tracepoints,	
      kprobes,	
      uprobes	
      
    
slide 70:
    Tracing Example
    # perf record -e block:block_rq_insert -a!
    ^C[ perf record: Woken up 1 times to write data ]!
    [ perf record: Captured and wrote 0.172 MB perf.data (~7527 samples) ]!
    # perf script!
    […]!
    java 9940 [015] 1199510.044783: block_rq_insert: 202,1 R 0 () 4783360 + 88 [java]!
    java 9940 [015] 1199510.044786: block_rq_insert: 202,1 R 0 () 4783448 + 88 [java]!
    java 9940 [015] 1199510.044786: block_rq_insert: 202,1 R 0 () 4783536 + 24 [java]!
    java 9940 [000] 1199510.065195: block_rq_insert: 202,1 R 0 () 4864088 + 88 [java]!
    […]!
    
slide 71:
    Tracing Example
    # perf record -e block:block_rq_insert -a!
    ^C[ perf record: Woken up 1 times to write data ]!
    [ perf record: Captured and wrote 0.172 MB perf.data (~7527 samples) ]!
    # perf script!
    […]!
    java 9940 [015] 1199510.044783: block_rq_insert: 202,1 R 0 () 4783360 + 88 [java]!
    java 9940 [015] 1199510.044786: block_rq_insert: 202,1 R 0 () 4783448 + 88 [java]!
    java 9940 [015] 1199510.044786: block_rq_insert: 202,1 R 0 () 4783536 + 24 [java]!
    java 9940 [000] 1199510.065195: block_rq_insert: 202,1 R 0 () 4864088 + 88 [java]!
    […]!
    process	
      PID	
      [CPU]	
      Omestamp:	
      eventname:	
      
    format	
      string	
      
    include/trace/events/block.h: java 9940 [015] 1199510.044783: block_rq_insert: 202,1 R 0 ()
    4783360 + 88 [java]!
    DECLARE_EVENT_CLASS(block_rq,!
    [...]!
    TP_printk("%d,%d %s %u (%s) %llu + %u [%s]",!
    kernel	
      source	
      
    MAJOR(__entry->gt;dev), MINOR(__entry->gt;dev),!
    __entry->gt;rwbs, __entry->gt;bytes, __get_str(cmd),!
    may	
      be	
      the	
      
    (unsigned long long)__entry->gt;sector,!
    only	
      docs	
      
    __entry->gt;nr_sector, __entry->gt;comm)!
    
slide 72:
    perf Block I/O Latency Heat Map
    • Trace data may only make sense when visualized
    • e.g., block I/O latency from perf_events trace data:
    SSD	
      I/O	
      
    (fast,	
      with	
      queueing)	
      
    HDD	
      I/O	
      
    (random,	
      modes)	
      
    hop://www.brendangregg.com/blog/2014-­‐07-­‐01/perf-­‐heat-­‐maps.html	
      
    
slide 73:
    One-Liners: Static Tracing
    # Trace new processes, until Ctrl-C:!
    perf record -e sched:sched_process_exec -a!
    # Trace all context-switches with stack traces, for 1 second:!
    perf record -e context-switches –ag -- sleep 1!
    # Trace CPU migrations, for 10 seconds:!
    perf record -e migrations -a -- sleep 10!
    # Trace all connect()s with stack traces (outbound connections), until Ctrl-C:!
    perf record -e syscalls:sys_enter_connect –ag!
    # Trace all block device (disk I/O) requests with stack traces, until Ctrl-C:!
    perf record -e block:block_rq_insert -ag!
    # Trace all block device issues and completions (has timestamps), until Ctrl-C:!
    perf record -e block:block_rq_issue -e block:block_rq_complete -a!
    # Trace all block completions, of size at least 100 Kbytes, until Ctrl-C:!
    perf record -e block:block_rq_complete --filter 'nr_sector >gt; 200'!
    # Trace all block completions, synchronous writes only, until Ctrl-C:!
    perf record -e block:block_rq_complete --filter 'rwbs == "WS"'!
    # Trace all block completions, all types of writes, until Ctrl-C:!
    perf record -e block:block_rq_complete --filter 'rwbs ~ "*W*"'!
    # Trace all ext4 calls, and write to a non-ext4 location, until Ctrl-C:!
    perf record -e 'ext4:*' -o /tmp/perf.data -a!
    
slide 74:
    Tracepoint Variables
    • Some previous one-liners used variables with --filter
    • The ftrace interface has a way to print them:
    # cat /sys/kernel/debug/tracing/events/block/block_rq_insert/format !
    name: block_rq_insert!
    ID: 884!
    format:!
    !field:unsigned short common_type; !offset:0;!size:2; !signed:0;!
    !field:unsigned char common_flags; !offset:2;!size:1; !signed:0;!
    !field:unsigned char common_preempt_count;
    !offset:3;!size:1; !signed:0;!
    !field:int common_pid;
    !offset:4;!size:4; !signed:1;!
    variables	
      
    !field:dev_t dev;
    !offset:8;!size:4; !signed:0;!
    !field:sector_t sector; !offset:16;
    !size:8; !signed:0;!
    !field:unsigned int nr_sector; !offset:24;
    !size:4; !signed:0;!
    !field:unsigned int bytes;!offset:28;
    !size:4; !signed:0;!
    (format	
      string	
      
    !field:char rwbs[8];!offset:32;
    !size:8; !signed:1;!
    !field:char comm[16];!offset:40;
    !size:16; !signed:1;!
    internals)	
      
    !field:__data_loc char[] cmd; !offset:56;
    !size:4; !signed:1;!
    print fmt: "%d,%d %s %u (%s) %llu + %u [%s]", ((unsigned int) ((REC->gt;dev) >gt;>gt; 20)),
    ((unsigned int) ((REC->gt;dev) & ((1U gt;rwbs, REC->gt;bytes, __get_str(cmd),
    (unsigned long long)REC->gt;sector, REC->gt;nr_sector, REC->gt;comm!
    
slide 75:
    One-Liners: Dynamic Tracing
    # Add a tracepoint for the kernel tcp_sendmsg() function entry (--add optional):!
    perf probe --add tcp_sendmsg!
    # Remove the tcp_sendmsg() tracepoint (or use --del):!
    perf probe -d tcp_sendmsg!
    # Add a tracepoint for the kernel tcp_sendmsg() function return:!
    perf probe 'tcp_sendmsg%return'!
    # Show avail vars for the tcp_sendmsg(), plus external vars (needs debuginfo):!
    perf probe -V tcp_sendmsg --externs!
    # Show available line probes for tcp_sendmsg() (needs debuginfo):!
    perf probe -L tcp_sendmsg!
    # Add a tracepoint for tcp_sendmsg() line 81 with local var seglen (debuginfo):!
    perf probe 'tcp_sendmsg:81 seglen'!
    # Add a tracepoint for do_sys_open() with the filename as a string (debuginfo):!
    perf probe 'do_sys_open filename:string'!
    # Add a tracepoint for myfunc() return, and include the retval as a string:!
    perf probe 'myfunc%return +0($retval):string'!
    # Add a tracepoint for the user-level malloc() function from libc:!
    perf probe -x /lib64/libc.so.6 malloc!
    # List currently available dynamic probes:!
    perf probe -l!
    
slide 76:
    One-Liners: Advanced Dynamic Tracing
    # Add a tracepoint for tcp_sendmsg(), with three entry regs (platform specific):!
    perf probe 'tcp_sendmsg %ax %dx %cx'!
    # Add a tracepoint for tcp_sendmsg(), with an alias ("bytes") for %cx register:!
    perf probe 'tcp_sendmsg bytes=%cx'!
    # Trace previously created probe when the bytes (alias) var is greater than 100:!
    perf record -e probe:tcp_sendmsg --filter 'bytes >gt; 100'!
    # Add a tracepoint for tcp_sendmsg() return, and capture the return value:!
    perf probe 'tcp_sendmsg%return $retval'!
    # Add a tracepoint for tcp_sendmsg(), and "size" entry argument (debuginfo):!
    perf probe 'tcp_sendmsg size'!
    # Add a tracepoint for tcp_sendmsg(), with size and socket state (debuginfo):!
    perf probe 'tcp_sendmsg size sk->gt;__sk_common.skc_state'!
    # Trace previous probe when size >gt; 0, and state != TCP_ESTABLISHED(1) (debuginfo):!
    perf record -e probe:tcp_sendmsg --filter 'size >gt; 0 && skc_state != 1' -a!
    Kernel debuginfo is an onerous requirement for the Netflix cloud
    We can use registers instead (as above). But which registers?
    
slide 77:
    The Rosetta Stone of Registers
    One server instance with kernel debuginfo, and -nv (dry run, verbose):
    # perf probe -nv 'tcp_sendmsg size sk->gt;__sk_common.skc_state'!
    […]!
    Added new event:!
    Writing event: p:probe/tcp_sendmsg tcp_sendmsg+0 size=%cx:u64 skc_state=+18(%si):u8!
    probe:tcp_sendmsg (on tcp_sendmsg with size skc_state=sk->gt;__sk_common.skc_state)!
    You can now use it in all perf tools, such as:!
    !perf record -e probe:tcp_sendmsg -aR sleep 1!
    All other instances (of the same kernel version):
    Copy-­‐n-­‐paste!	
      
    # perf probe 'tcp_sendmsg+0 size=%cx:u64 skc_state=+18(%si):u8'!
    Failed to find path of kernel module.!
    Added new event:!
    probe:tcp_sendmsg
    (on tcp_sendmsg with size=%cx:u64 skc_state=+18(%si):u8)!
    You can now use it in all perf tools, such as:!
    !perf record -e probe:tcp_sendmsg -aR sleep 1!
    Masami Hiramatsu was investigating a way to better automate this
    
slide 78:
    perf_events Scripting
    • perf also has a scripting interface (Perl or Python)
    – These run perf and post-process trace data in user-space
    – Data path has some optimizations
    • Kernel buffering, and dynamic (optimal) number of buffer reads
    • But may still not be enough for high volume events
    • Andi Kleen has scripted perf for his PMC tools
    – https://github.com/andikleen/pmu-tools
    – Includes toplev for applying "Top Down" methodology
    • I've developed my own tools for perf & ftrace
    – https://github.com/brendangregg/perf-tools
    – Each tool has a man page and examples file
    – These are unsupported temporary hacks; their internals
    should be rewritten when kernel features arrive (e.g., eBPF)
    
slide 79:
    perf-tools: bitesize
    • Block (disk) I/O size distributions:
    # ./bitesize !
    Tracing block I/O size (bytes), until Ctrl-C...!
    ^C!
    Kbytes
    : I/O
    Distribution!
    ->gt; 0.9
    : 0
    1.0 ->gt; 7.9
    : 38
    8.0 ->gt; 63.9
    : 10108
    |######################################|!
    64.0 ->gt; 127.9
    : 13
    128.0 ->gt;
    : 1
    […]!
    • This automates perf with a set of in-kernel filters and
    counts for each bucket, to reduce overhead
    • Will be much easier with eBPF
    
slide 80:
    eBPF
    • Extended BPF: programs on tracepoints
    – High-performance filtering: JIT
    – In-kernel summaries: maps
    Time	
      
    • e.g., in-kernel latency heat map (showing bimodal):
    Low	
      
    latency	
      
    cache	
      
    hits	
      
    High	
      
    latency	
      
    device	
      
    I/O	
      
    
slide 81:
    Linux Profiling Future
    • eBPF is integrating, and provides the final missing piece
    of tracing infrastructure: efficient kernel programming
    – perf_events + eBPF?
    – ftrace + eBPF?
    – Other tracers + eBPF?
    • At Netflix, the future is Vector, and more self-service
    automation of perf_events
    
slide 82:
    Summary & Your Action Items
    • Short term: get full CPU profiling to work
    A. Automate perf CPU profiles with flame graphs. See this talk!
    B. … or use Netflix Vector when it is open sourced
    C. … or ask performance monitoring vendors for this
    – Most importantly, you should expect that full CPU profiles are
    available at your company. The ROI is worth it.
    • Long term: PMCs & tracing
    – Use perf_events to profile
    other targets: CPU cycles,
    file system I/O, disk I/O,
    memory usage, …
    • Go forth and profile!
    The	
      "real"	
      checklist	
      reminder:	
      
    1. Get	
      "perf"	
      to	
      work	
      
    2. Get	
      stack	
      walking	
      to	
      work	
      
    3. Fix	
      symbol	
      translaOon	
      
    4. Get	
      IPC	
      to	
      work	
      
    5. Test	
      perf	
      under	
      load	
      
    
slide 83:
    Links & References
    perf_events
    • Kernel source: tools/perf/Documentation
    • https://perf.wiki.kernel.org/index.php/Main_Page
    • http://www.brendangregg.com/perf.html
    • http://web.eece.maine.edu/~vweaver/projects/perf_events/
    • Mailing list http://vger.kernel.org/vger-lists.html#linux-perf-users
    perf-tools: https://github.com/brendangregg/perf-tools
    PMU tools: https://github.com/andikleen/pmu-tools
    perf, ftrace, and more: http://www.brendangregg.com/linuxperf.html
    Java frame pointer patch
    http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2014-December/016477.html
    • https://bugs.openjdk.java.net/browse/JDK-8068945
    Node.js: http://techblog.netflix.com/2014/11/nodejs-in-flames.html
    Methodology: http://www.brendangregg.com/methodology.html
    Flame graphs: http://www.brendangregg.com/flamegraphs.html
    Heat maps: http://www.brendangregg.com/heatmaps.html
    eBPF: http://lwn.net/Articles/603983/
    
slide 84:
    Thanks
    Questions?
    http://slideshare.net/brendangregg
    http://www.brendangregg.com
    bgregg@netflix.com
    – Performance and Reliability Engineering
    • @brendangregg