Systems Performance 2nd Ed.



BPF Performance Tools book

Recent posts:
Blog index
About
RSS

JavaOne 2015: Java Mixed-Mode Flame Graphs

Video: https://www.youtube.com/watch?v=BHA65BqlqSk

Slides for JavaOne 2015 talk by Brendan Gregg, Netflix.

Description: "At Netflix we dreamed of one visualization to show all CPU consumers: Java methods, GC, JVM internals, system libraries, and the kernel. With the help of Oracle this is now possible on x86 systems using system profilers (eg, Linux perf_events) and the new JDK option -XX:+PreserveFramePointer. This lets us create Java mixed-mode CPU flame graphs, exposing all CPU consumers. We can also use system profilers to analyze memory page faults, TCP events, storage I/O, and scheduler events, also with Java method context. This talk describes the background for this work, instructions generating Java mixed-mode flame graphs, and examples from our use at Netflix where Java on x86 is the primary platform for the Netflix cloud."

next
prev
1/92
next
prev
2/92
next
prev
3/92
next
prev
4/92
next
prev
5/92
next
prev
6/92
next
prev
7/92
next
prev
8/92
next
prev
9/92
next
prev
10/92
next
prev
11/92
next
prev
12/92
next
prev
13/92
next
prev
14/92
next
prev
15/92
next
prev
16/92
next
prev
17/92
next
prev
18/92
next
prev
19/92
next
prev
20/92
next
prev
21/92
next
prev
22/92
next
prev
23/92
next
prev
24/92
next
prev
25/92
next
prev
26/92
next
prev
27/92
next
prev
28/92
next
prev
29/92
next
prev
30/92
next
prev
31/92
next
prev
32/92
next
prev
33/92
next
prev
34/92
next
prev
35/92
next
prev
36/92
next
prev
37/92
next
prev
38/92
next
prev
39/92
next
prev
40/92
next
prev
41/92
next
prev
42/92
next
prev
43/92
next
prev
44/92
next
prev
45/92
next
prev
46/92
next
prev
47/92
next
prev
48/92
next
prev
49/92
next
prev
50/92
next
prev
51/92
next
prev
52/92
next
prev
53/92
next
prev
54/92
next
prev
55/92
next
prev
56/92
next
prev
57/92
next
prev
58/92
next
prev
59/92
next
prev
60/92
next
prev
61/92
next
prev
62/92
next
prev
63/92
next
prev
64/92
next
prev
65/92
next
prev
66/92
next
prev
67/92
next
prev
68/92
next
prev
69/92
next
prev
70/92
next
prev
71/92
next
prev
72/92
next
prev
73/92
next
prev
74/92
next
prev
75/92
next
prev
76/92
next
prev
77/92
next
prev
78/92
next
prev
79/92
next
prev
80/92
next
prev
81/92
next
prev
82/92
next
prev
83/92
next
prev
84/92
next
prev
85/92
next
prev
86/92
next
prev
87/92
next
prev
88/92
next
prev
89/92
next
prev
90/92
next
prev
91/92
next
prev
92/92

PDF: JavaOne2015_MixedModeFlameGraphs.pdf

Keywords (from pdftotext):

slide 1:
    Oct	
      2015	
      
    Java Mixed-Mode
    Flame Graphs
    Brendan Gregg Senior Performance Architect
    
slide 2:
    Understanding Java CPU usage
    quickly and completely
    
slide 3:
    Quickly
    • Via SSH and open source tools (covered in this talk)
    • Or using Netflix Vector GUI (also open source):
    1. Observe high CPU usage
    2. Generate a flame graph
    
slide 4:
    Completely
    Java Mixed-Mode Flame Graph
    via Linux perf_events:
    Kernel
    Java
    JVM
    
slide 5:
    Messy House Fallacy
    Fallacy:	
      my	
      code	
      is	
      a	
      mess,	
      I	
      bet	
      yours	
      is	
      
    immaculate,	
      therefore	
      the	
      bug	
      must	
      be	
      mine	
      
    	
      
    Reality:	
      everyone's	
      code	
      is	
      terrible	
      and	
      buggy	
      
    • Don't overlook system code: kernel, libraries, etc.
    
slide 6:
    Context
    
slide 7:
    • Over 60 million subscribers
    – Just launched in Spain!
    • AWS EC2 Linux cloud
    • FreeBSD CDN
    • Awesome place to work
    
slide 8:
    Cloud
    • Tens of thousands of AWS EC2 instances
    • Mostly running Java applications (Oracle JVM)
    Linux	
      (usually	
      Ubuntu)	
      
    OpMonal	
      Apache,	
      
    memcached,	
      Node.js,	
      
    …	
      
    Atlas,	
      S3	
      log	
      rotaMon,	
      
    sar,	
      Trace,	
      perf,	
      stap,	
      
    perf-­‐tools	
      
    Vector,	
      pcp	
      
    Java	
      (JDK	
      8)	
      
    GC	
      and	
      
    thread	
      
    dump	
      
    logging	
      
    Tomcat	
      
    ApplicaMon	
      war	
      files,	
      
    plaYorm,	
      base	
      servlet	
      
    hystrix,	
      metrics	
      (Servo),	
      
    health	
      check	
      
    
slide 9:
    Why we need CPU profiling
    • Improving performance
    • Cost savings
    – ASGs often scale on load
    average (CPU), so CPU
    usage is proportional to cost
    Auto	
      Scaling	
      
    Group	
      
    Scaling	
      Policy	
      
    CloudWatch,	
      Servo	
      
    – Identify tuning targets
    – Incident response
    – Non-regression testing
    – Software evaluations
    – CPU workload
    characterization
    loadavg,	
      latency,	
      …	
      
    	
      
    Instance	
      
    Instance	
      
    Instance	
      
    
slide 10:
    The Problem with Profilers
    
slide 11:
    Java Profilers
    Kernel,
    libraries,
    JVM
    Java
    
slide 12:
    Java Profilers
    • Visibility
    – Java method execution
    – Object usage
    – GC logs
    – Custom Java context
    • Typical problems:
    – Sampling often happens at safety/yield points (skew)
    – Method tracing has massive observer effect
    – Misidentifies RUNNING as on-CPU (e.g., epoll)
    – Doesn't include or profile GC or JVM CPU time
    – Tree views not quick (proportional) to comprehend
    • Inaccurate (skewed) and incomplete profiles
    
slide 13:
    System Profilers
    Kernel
    TCP/IP
    Java
    JVM
    Locks
    Time
    epoll
    Idle
    thread
    
slide 14:
    System Profilers
    • Visibility
    – JVM (C++)
    – GC (C++)
    – libraries (C)
    – kernel (C)
    • Typical problems (x86):
    – Stacks missing for Java
    – Symbols missing for Java methods
    • Other architectures (e.g., SPARC) have fared better
    • Profile everything except Java
    
slide 15:
    Workaround
    • Capture both Java and system profiles, and examine
    side by side
    Java
    System
    • An improvement, but Java context is often crucial for
    interpreting system profiles
    
slide 16:
    Solution
    Java Mixed-Mode Flame Graph
    Kernel
    Java
    JVM
    
slide 17:
    Solution
    • Fix system profiling
    – Only way to see it all
    Kernel
    • Visibility is everything:
    – Java methods
    – JVM (C++)
    – GC (C++)
    – libraries (C)
    – kernel (C)
    Java
    JVM
    • Minor Problems:
    – 0-3% CPU overhead to enable frame pointers (usually 
slide 18:
    Simple Production Example
    1. Poor performance,
    and one CPU at 100%
    2. perf_events flame
    graph shows JVM
    stuck compiling
    
slide 19:
    Another System Example
    Exception handling consuming CPU
    
slide 20:
    DEMO
    FlameGraph_tomcat01.svg
    
slide 21:
    Exonerating The System
    • From last week:
    - Frequent thread creation/
    destruction assumed to be
    consuming CPU resources.
    Recode application?
    - A flame graph quantified this
    CPU time: near zero
    - Time mostly other Java methods
    
slide 22:
    Profiling GC
    GC internals, visualized:
    
slide 23:
    CPU Profiling
    
slide 24:
    CPU Profiling
    • Record stacks at a timed interval: simple and effective
    – Pros: Low (deterministic) overhead
    – Cons: Coarse accuracy, but usually sufficient
    stack
    samples:
    syscall
    on-CPU
    time
    off-CPU
    block
    interrupt
    
slide 25:
    Stack Traces
    • A code path snapshot. e.g., from jstack(1):
    $ jstack 1819
    […]
    "main" prio=10 tid=0x00007ff304009000
    nid=0x7361 runnable [0x00007ff30d4f9000]
    java.lang.Thread.State: RUNNABLE
    running
    at Func_abc.func_c(Func_abc.java:6)
    codepath
    at Func_abc.func_b(Func_abc.java:16)
    start
    at Func_abc.func_a(Func_abc.java:23)
    at Func_abc.main(Func_abc.java:27)
    running
    parent
    g.parent
    g.g.paren
    
slide 26:
    System Profilers
    • Linux
    – perf_events (aka "perf")
    • Oracle Solaris
    – DTrace
    • OS X
    – Instruments
    • Windows
    – XPerf
    • And many others…
    
slide 27:
    Linux perf_events
    • Standard Linux profiler
    – Provides the perf command (multi-tool)
    – Usually pkg added by linux-tools-common, etc.
    • Features:
    – Timer-based sampling
    – Hardware events
    – Tracepoints
    – Dynamic tracing
    • Can sample stacks of (almost) everything on CPU
    – Can miss hard interrupt ISRs, but these should be near-zero. They can
    be measured if needed (I wrote my own tools)
    
slide 28:
    perf record Profiling
    • Stack profiling on all CPUs at 99 Hertz, then dump:
    # perf record -F 99 -ag -- sleep 30
    [ perf record: Woken up 9 times to write data ]
    [ perf record: Captured and wrote 2.745 MB perf.data (~119930 samples) ]
    # perf script
    […]
    bash 13204 cpu-clock:
    459c4c dequote_string (/root/bash-4.3/bash)
    465c80 glob_expand_word_list (/root/bash-4.3/bash)
    466569 expand_word_list_internal (/root/bash-4.3/bash)
    465a13 expand_words (/root/bash-4.3/bash)
    43bbf7 execute_simple_command (/root/bash-4.3/bash)
    435f16 execute_command_internal (/root/bash-4.3/bash)
    one
    435580 execute_command (/root/bash-4.3/bash)
    stack
    43a771 execute_while_or_until (/root/bash-4.3/bash)
    sample
    43a636 execute_while_command (/root/bash-4.3/bash)
    436129 execute_command_internal (/root/bash-4.3/bash)
    435580 execute_command (/root/bash-4.3/bash)
    420cd5 reader_loop (/root/bash-4.3/bash)
    41ea58 main (/root/bash-4.3/bash)
    7ff2294edec5 __libc_start_main (/lib/x86_64-linux-gnu/libc-2.19.so)
    [… ~47,000 lines truncated …]
    
slide 29:
    perf report Summary
    • Generates a call tree and combines samples:
    # perf report -n -stdio
    […]
    # Overhead
    Samples Command
    Shared Object
    Symbol
    # ........ ............ ....... ................. .............................
    20.42%
    bash [kernel.kallsyms] [k] xen_hypercall_xen_version
    --- xen_hypercall_xen_version
    check_events
    call tree
    summary
    |--44.13%-- syscall_trace_enter
    tracesys
    |--35.58%-- __GI___libc_fcntl
    |--65.26%-- do_redirection_internal
    do_redirections
    execute_builtin_or_function
    execute_simple_command
    [… ~13,000 lines truncated …]
    
slide 30:
    Flame Graphs
    
slide 31:
    perf report Verbosity
    • Despite summarizing, output is still verbose
    # perf report -n -stdio
    […]
    # Overhead
    Samples Command
    Shared Object
    Symbol
    # ........ ............ ....... ................. .............................
    20.42%
    bash [kernel.kallsyms] [k] xen_hypercall_xen_version
    --- xen_hypercall_xen_version
    check_events
    |--44.13%-- syscall_trace_enter
    tracesys
    |--35.58%-- __GI___libc_fcntl
    |--65.26%-- do_redirection_internal
    do_redirections
    execute_builtin_or_function
    execute_simple_command
    [… ~13,000 lines truncated …]
    
slide 32:
    Full perf report Output
    
slide 33:
    … as a Flame Graph
    
slide 34:
    Flame Graphs
    git clone --depth 1 https://github.com/brendangregg/FlameGraph
    cd FlameGraph
    perf record -F 99 -a –g -- sleep 30
    perf script | ./stackcollapse-perf.pl | ./flamegraph.pl >gt; perf.svg
    • Flame Graphs:
    – x-axis: alphabetical stack sort, to maximize merging
    – y-axis: stack depth
    – color: random (default), or a dimension
    • Currently made from Perl + SVG + JavaScript
    – Multiple d3 versions are being developed
    • Easy to get working
    – http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html
    – Above commands are Linux; see URL for other OSes
    
slide 35:
    Linux perf_events Workflow
    list events
    count events
    capture stacks
    perf list
    perf stat
    perf record
    Typical
    Workflow
    perf.data	
      
    text UI
    dump profile
    perf report
    flame graph
    visualization
    perf script
    stackcollapse-perf.pl
    flamegraph.pl
    
slide 36:
    Flame Graph Interpretation
    g()
    e()
    f()
    d()
    c()
    i()
    b()
    h()
    a()
    
slide 37:
    Flame Graph Interpretation (1/3)
    Top edge shows who is running on-CPU,
    and how much (width)
    g()
    e()
    f()
    d()
    c()
    i()
    b()
    h()
    a()
    
slide 38:
    Flame Graph Interpretation (2/3)
    Top-down shows ancestry
    e.g., from g():
    g()
    e()
    f()
    d()
    c()
    i()
    b()
    h()
    a()
    
slide 39:
    Flame Graph Interpretation (3/3)
    Widths are proportional to presence in samples
    e.g., comparing b() to h() (incl. children)
    g()
    e()
    f()
    d()
    c()
    i()
    b()
    h()
    a()
    
slide 40:
    Flame Graph Colors
    • Randomized by default
    • Can be used as a dimension. e.g.:
    – Mixed-mode flame graphs
    – Differential flame graphs
    – Search
    
slide 41:
    Mixed-Mode Flame Graphs
    • Hues:
    – green == Java
    – red == system
    – yellow == C++
    • Intensity randomized
    to differentiate frames
    – Or hashed based on
    function name
    Mixed-Mode
    Kernel
    Java
    JVM
    
slide 42:
    Differential Flame Graphs
    • Hues:
    Differential
    – red == more samples
    – blue == less samples
    • Intensity shows the
    degree of difference
    • Used for comparing
    two profiles
    • Also used for showing
    other metrics: e.g., CPI
    more
    less
    
slide 43:
    Flame Graph Search
    • Color: magenta to show matched frames
    search
    button
    
slide 44:
    Flame Charts
    • Final note: these are useful, but are not flame graphs
    • Flame charts: x-axis is time
    • Flame graphs: x-axis is population (maximize merging)
    
slide 45:
    Stack Tracing
    
slide 46:
    System Profiling Java on x86
    • For example,
    using Linux perf
    • The stacks are
    1 or 2 levels
    deep, and have
    junk values
    # perf record –F 99 –a –g – sleep 30
    # perf script
    […]
    java 4579 cpu-clock:
    ffffffff8172adff tracesys ([kernel.kallsyms])
    7f4183bad7ce pthread_cond_timedwait@@GLIBC_2…
    java
    4579 cpu-clock:
    7f417908c10b [unknown] (/tmp/perf-4458.map)
    java
    4579 cpu-clock:
    7f4179101c97 [unknown] (/tmp/perf-4458.map)
    java
    4579 cpu-clock:
    7f41792fc65f [unknown] (/tmp/perf-4458.map)
    a2d53351ff7da603 [unknown] ([unknown])
    java
    4579 cpu-clock:
    7f4179349aec [unknown] (/tmp/perf-4458.map)
    java
    4579 cpu-clock:
    7f4179101d0f [unknown] (/tmp/perf-4458.map)
    […]
    
slide 47:
    … as a Flame Graph
    Broken Java stacks
    (missing frame pointer)
    
slide 48:
    Why Stacks are Broken
    • On x86 (x86_64), hotspot uses
    the frame pointer register (RBP)
    as general purpose
    • This "compiler optimization"
    breaks (simple) stack walking
    • Once upon a time, x86 had fewer
    registers, and this made much more sense
    • gcc provides -fno-omit-frame-pointer to avoid
    doing this, but the JVM had no such option…
    
slide 49:
    Fixing Stack Walking
    Possibilities:
    A. Fix frame pointer-based stack walking (the default)
    – Pros: simple, supported by many tools
    – Cons: might cost a little extra CPU
    B. Use a custom walker (likely needing kernel support)
    – Pros: full stack walking (incl. inlining) & arguments
    – Cons: custom kernel code, can cost more CPU when in use
    C. Try libunwind and DWARF
    – Even feasible with JIT?
    Our current preference is (A)
    
slide 50:
    Hacking OpenJDK (1/2)
    • As a proof of concept, I hacked hotspot to support an
    x86_64 frame pointer
    --- openjdk8clean/hotspot/src/cpu/x86/vm/x86_64.ad 2014-03-04 …
    +++ openjdk8/hotspot/src/cpu/x86/vm/x86_64.ad 2014-11-08 …
    @@ -166,10 +166,9 @@
    // 3) reg_class stack_slots( /* one chunk of stack-based "registers" */ )
    -// Class for all pointer registers (including RSP)
    +// Class for all pointer registers (including RSP, excluding RBP)
    reg_class any_reg(RAX, RAX_H,
    RDX, RDX_H,
    RBP, RBP_H,
    Remove RBP from
    RDI, RDI_H,
    RSI, RSI_H,
    register pools
    RCX, RCX_H,
    [...]
    
slide 51:
    Hacking OpenJDK (2/2)
    --- openjdk8clean/hotspot/src/cpu/x86/vm/macroAssembler_x86.cpp 2014-03-04…
    +++ openjdk8/hotspot/src/cpu/x86/vm/macroAssembler_x86.cpp 2014-11-07 …
    @@ -5236,6 +5236,7 @@
    // We always push rbp, so that on return to interpreter rbp, will be
    // restored correctly and we can correct the stack.
    push(rbp);
    mov(rbp, rsp);
    Fix x86_64 function
    // Remove word for ebp
    prologues
    framesize -= wordSize;
    --- openjdk8clean/hotspot/src/cpu/x86/vm/c1_MacroAssembler_x86.cpp …
    +++ openjdk8/hotspot/src/cpu/x86/vm/c1_MacroAssembler_x86.cpp …
    [...]
    • We used this patched version successfully for some limited
    (and urgent) performance analysis
    
slide 52:
    -XX:+PreserveFramePointer
    • We shared our patch publicly
    – See "A hotspot patch for stack profiling (frame pointer)" on the
    hotspot complier dev mailing list
    – It became JDK-8068945 for JDK 9 and JDK-8072465 for JDK 8,
    and the -XX:+PreserveFramePointer option
    • Zoltán Majó (Oracle) took this on, rewrote it, and it is now:
    – In JDK 9
    – In JDK 8 update 60 build 19
    – Thanks to Zoltán, Oracle, and the other hotspot engineers for
    helping get this done!
    • It might cost 0 – 3% CPU, depending on workload
    
slide 53:
    Broken Java Stacks (before)
    # perf script
    […]
    java 4579 cpu-clock:
    ffffffff8172adff tracesys ([kernel.kallsyms])
    7f4183bad7ce pthread_cond_timedwait@@GLIBC_2…
    java
    4579 cpu-clock:
    7f417908c10b [unknown] (/tmp/perf-4458.map)
    java
    4579 cpu-clock:
    7f4179101c97 [unknown] (/tmp/perf-4458.map)
    java
    4579 cpu-clock:
    7f41792fc65f [unknown] (/tmp/perf-4458.map)
    a2d53351ff7da603 [unknown] ([unknown])
    java
    4579 cpu-clock:
    7f4179349aec [unknown] (/tmp/perf-4458.map)
    java
    4579 cpu-clock:
    7f4179101d0f [unknown] (/tmp/perf-4458.map)
    java
    4579 cpu-clock:
    7f417908c194 [unknown] (/tmp/perf-4458.map)
    […]
    • Check with "perf
    script" to see stack
    samples
    • These are 1 or 2
    levels deep (junk
    values)
    
slide 54:
    Fixed Java Stacks
    # perf script
    […]
    java 8131 cpu-clock:
    7fff76f2dce1 [unknown] ([vdso])
    7fd3173f7a93 os::javaTimeMillis() (/usr/lib/jvm…
    7fd301861e46 [unknown] (/tmp/perf-8131.map)
    7fd30184def8 [unknown] (/tmp/perf-8131.map)
    7fd30174f544 [unknown] (/tmp/perf-8131.map)
    7fd30175d3a8 [unknown] (/tmp/perf-8131.map)
    7fd30166d51c [unknown] (/tmp/perf-8131.map)
    7fd301750f34 [unknown] (/tmp/perf-8131.map)
    7fd3016c2280 [unknown] (/tmp/perf-8131.map)
    7fd301b02ec0 [unknown] (/tmp/perf-8131.map)
    7fd3016f9888 [unknown] (/tmp/perf-8131.map)
    7fd3016ece04 [unknown] (/tmp/perf-8131.map)
    7fd30177783c [unknown] (/tmp/perf-8131.map)
    7fd301600aa8 [unknown] (/tmp/perf-8131.map)
    7fd301a4484c [unknown] (/tmp/perf-8131.map)
    7fd3010072e0 [unknown] (/tmp/perf-8131.map)
    7fd301007325 [unknown] (/tmp/perf-8131.map)
    7fd301007325 [unknown] (/tmp/perf-8131.map)
    7fd3010004e7 [unknown] (/tmp/perf-8131.map)
    7fd3171df76a JavaCalls::call_helper(JavaValue*,…
    7fd3171dce44 JavaCalls::call_virtual(JavaValue*…
    7fd3171dd43a JavaCalls::call_virtual(JavaValue*…
    7fd31721b6ce thread_entry(JavaThread*, Thread*)…
    7fd3175389e0 JavaThread::thread_main_inner() (/…
    7fd317538cb2 JavaThread::run() (/usr/lib/jvm/nf…
    7fd3173f6f52 java_start(Thread*) (/usr/lib/jvm/…
    7fd317a7e182 start_thread (/lib/x86_64-linux-gn…
    • With -XX:
    +PreserveFramePointer
    stacks are full, and
    go all the way to
    start_thread()
    • This is what the
    CPUs are really
    running: inlined
    frames are not
    present
    
slide 55:
    Fixed Stacks Flame Graph
    Java stacks
    (but no symbols)
    
slide 56:
    Stacks & Inlining
    • Frames may be missing (inlined)
    • Disabling inlining:
    No inlining
    – -XX:-Inline
    – Many more Java frames
    – Can be 80% slower!
    • May not be necessary
    – Inlined flame graphs often make
    enough sense
    – Or tune -XX:MaxInlineSize and
    -XX:InlineSmallCode a little to reveal more frames
    • Can even improve performance!
    • perf-map-agent (next) has experimental un-inline support
    
slide 57:
    Symbols
    
slide 58:
    Missing Symbols
    • Missing symbols may show up as hex; e.g., Linux perf:
    71.79%
    sed
    sed
    |--11.65%-- 0x40a447
    0x40659a
    0x408dd8
    0x408ed1
    0x402689
    0x7fa1cd08aec5
    12.06%
    sed
    sed
    --- re_search_internal
    |--96.78%-- re_search_stub
    rpl_re_search
    match_regex
    do_subst
    execute_program
    process_files
    main
    __libc_start_main
    [.] 0x000000000001afc1
    broken
    [.] re_search_internal
    not broken
    
slide 59:
    Fixing Symbols
    For JIT'd code, Linux perf already looks for an
    externally provided symbol file: /tmp/perf-PID.map, and
    warns if it doesn't exist
    # perf script
    Failed to open /tmp/perf-8131.map, continuing without symbols
    […]
    java 8131 cpu-clock:
    7fff76f2dce1 [unknown] ([vdso])
    7fd3173f7a93 os::javaTimeMillis() (/usr/lib/jvm…
    7fd301861e46 [unknown] (/tmp/perf-8131.map)
    […]
    This file can be created by a Java agent
    
slide 60:
    Java Symbols for perf
    • perf-map-agent
    – https://github.com/jrudolph/perf-map-agent
    – Agent attaches and writes the /tmp file on demand (previous
    versions attached on Java start, wrote continually)
    – Thanks Johannes Rudolph!
    • Use of a /tmp symbol file
    – Pros: simple, can be low overhead (snapshot on demand)
    – Cons: stale symbols
    • Using a symbol logger with perf instead
    – Patch by Stephane Eranian currently being discussed on
    lkml; see "perf: add support for profiling jitted code"
    
slide 61:
    Stacks & Symbols
    Java Mixed-Mode Flame Graph
    Kernel
    Java
    JVM
    
slide 62:
    Stacks & Symbols (zoom)
    
slide 63:
    Instructions
    
slide 64:
    Instructions
    1. Check Java version
    2. Install perf-map-agent
    3. Set -XX:+PreserveFramePointer
    4. Profile Java
    5. Dump symbols
    6. Generate Mixed-Mode Flame Graph
    Note these are unsupported: use at your own risk
    Reference: http://techblog.netflix.com/2015/07/java-in-flames.html
    
slide 65:
    1. Check Java Version
    • Need JDK8u60 or better
    – for -XX:+PreserveFramePointer
    $ java -version
    java version "1.8.0_60"
    Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
    Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
    • Upgrade Java if necessary
    
slide 66:
    2. Install perf-map-agent
    • Check https://github.com/jrudolph/perf-map-agent for the
    latest instructions. e.g.:
    $ sudo bash
    # apt-get install -y cmake
    # export JAVA_HOME=/usr/lib/jvm/java-8-oracle
    # cd /usr/lib/jvm
    # git clone --depth=1 https://github.com/jrudolph/perf-map-agent
    # cd perf-map-agent
    # cmake .
    # make
    
slide 67:
    3. Set -XX:+PreserveFramePointer
    • Needs to be set on Java startup
    • Check it is enabled (on Linux):
    $ ps wwp `pgrep –n java` | grep PreserveFramePointer
    
slide 68:
    4. Profile Java
    • Using Linux perf_events to profile all processes, at 99
    Hertz, for 30 seconds (as root):
    # perf record -F 99 -a -g -- sleep 30
    • Just profile one PID (broken on some older kernels):
    # perf record -F 99 -p PID -g -- sleep 30
    • These create a perf.data file
    
slide 69:
    5. Dump Symbols
    • See perf-map-agent docs for updated usage
    • e.g., as the same user as java:
    $ cd /usr/lib/jvm/perf-map-agent/out
    $ java -cp attach-main.jar:$JAVA_HOME/lib/tools.jar \
    net.virtualvoid.perf.AttachOnce PID
    • perf-map-agent contains helper scripts. I wrote my own:
    – https://github.com/brendangregg/Misc/blob/master/java/jmaps
    • Dump symbols quickly after perf record to minimize stale
    symbols. How I do it:
    # perf record -F 99 -a -g -- sleep 30; jmaps
    
slide 70:
    6. Generate a Mixed-Mode Flame Graph
    • Using my FlameGraph software:
    # perf script >gt; out.stacks01
    # git clone --depth=1 https://github.com/brendangregg/FlameGraph
    # cat out.stacks01 | ./FlameGraph/stackcollapse-perf.pl | \
    ./FlameGraph/flamegraph.pl --color=java --hash >gt; flame01.svg
    – perf script reads perf.data with /tmp/*.map
    – out.stacks01 is an intermediate file; can be handy to keep
    • Finally open flame01.svg in a browser
    • Check for newer flame graph implementations (e.g., d3)
    
slide 71:
    Automation
    
slide 72:
    Netflix Vector
    
slide 73:
    Netflix Vector
    Select Instance
    Select
    Metrics
    Flame Graphs
    Near real-time,
    per-second metrics
    
slide 74:
    Netflix Vector
    • Open source, on-demand, instance analysis tool
    – https://github.com/netflix/vector
    • Shows various real-time metrics
    • Flame graph support currently in development
    – Automating previous steps
    – Using it internally already
    – Also developing a new d3 front end
    
slide 75:
    DEMO
    d3-flame-graph
    
slide 76:
    Advanced Analysis
    
slide 77:
    Linux perf_events Coverage
    … all possible with Java stacks
    
slide 78:
    Advanced Flame Graphs
    • Examples:
    – Page faults
    – Context switches
    – Disk I/O requests
    – TCP events
    – CPU cache misses
    – CPI
    • Any event issued in synchronous Java context
    
slide 79:
    Synchronous Java Context
    • Java thread still on-CPU, and event is directly triggered
    • Examples:
    – Disk I/O requests issued directly by Java à yes
    • direct reads, sync writes, page faults
    – Disk I/O completion interrupts à no*
    – Disk I/O requests triggered async, e.g., readahead à no*
    * can be made yes by tracing and associating context
    
slide 80:
    Page Faults
    • Show what triggered main memory (resident) to grow:
    # perf record -e page-faults -p PID -g -- sleep 120
    • "fault" as (physical) main memory is allocated ondemand, when a virtual page is first populated
    • Low overhead tool to solve some types of memory leak
    RES column in top(1)
    grows
    because
    
slide 81:
    Page Fault Flame Graph
    Java code
    epoll
    
slide 82:
    Context Switches
    • Show why Java blocked and stopped running on-CPU:
    # perf record -e context-switches -p PID -g -- sleep 5
    • Identifies locks, I/O, sleeps
    – If code path shouldn't block and looks random, it's an involuntary context switch. I
    could filter these, but you should have solved them beforehand (CPU load).
    • e.g., was used to understand framework differences:
    rxNetty
    Tomcat
    
slide 83:
    Context Switch Flame Graph (1/2)
    rxNetty
    epoll
    futex
    
slide 84:
    Context Switch Flame Graph (2/2)
    Tomcat
    sys_poll
    futex
    
slide 85:
    Disk I/O Requests
    • Shows who issued disk I/O (sync reads & writes):
    # perf record -e block:block_rq_insert -a -g -- sleep 60
    • e.g.: page faults in GC? This JVM has swapped out!:
    
slide 86:
    TCP Events
    • TCP transmit, using dynamic tracing:
    # perf probe tcp_sendmsg
    # perf record -e probe:tcp_sendmsg -a -g -- sleep 1; jmaps
    # perf script -f comm,pid,tid,cpu,time,event,ip,sym,dso,trace >gt; out.stacks
    # perf probe --del tcp_sendmsg
    • Note: can be high overhead for high packet rates
    – For the current perf trace, dump, post-process cycle
    • Can also trace TCP connect & accept (lower overhead)
    • TCP receive is async
    – Could trace via socket read
    
slide 87:
    TCP Send Flame Graph
    Only one code-path
    taken in this example
    kernel
    Java
    JVM
    ab (client process)
    
slide 88:
    CPU Cache Misses
    • In this example, sampling via Last Level Cache loads:
    # perf record -e LLC-loads -c 10000 -a -g -- sleep 5; jmaps
    # perf script -f comm,pid,tid,cpu,time,event,ip,sym,dso >gt; out.stacks
    • -c is the count (samples
    once per count)
    • Use other CPU counters to
    sample hits, misses, stalls
    
slide 89:
    One Last Example
    • Back to a
    mixed-mode
    CPU flame graph
    • What else can we
    show with color?
    
slide 90:
    CPI Flame Graph
    • Cycles Per
    Instruction!
    – red == instruction
    heavy
    – blue == cycle
    heavy (likely mem
    stall cycles)
    zoomed:
    
slide 91:
    Links & References
    Flame Graphs
    – http://www.brendangregg.com/flamegraphs.html
    – http://techblog.netflix.com/2015/07/java-in-flames.html
    – http://techblog.netflix.com/2014/11/nodejs-in-flames.html
    – http://www.brendangregg.com/blog/2014-11-09/differential-flame-graphs.html
    Linux perf_events
    – https://perf.wiki.kernel.org/index.php/Main_Page
    – http://www.brendangregg.com/perf.html
    – http://www.brendangregg.com/blog/2015-02-27/linux-profiling-at-netflix.html
    Netflix Vector
    – https://github.com/netflix/vector
    – http://techblog.netflix.com/2015/04/introducing-vector-netflixs-on-host.html
    JDK tickets
    – JDK8: https://bugs.openjdk.java.net/browse/JDK-8072465
    – JDK9: https://bugs.openjdk.java.net/browse/JDK-8068945
    hprof: http://www.brendangregg.com/blog/2014-06-09/java-cpu-sampling-using-hprof.html
    
slide 92:
    Oct	
      2015	
      
    Thanks
    Questions?
    http://techblog.netflix.com
    http://slideshare.net/brendangregg
    http://www.brendangregg.com
    bgregg@netflix.com
    @brendangregg