Systems Performance 2nd Ed.



BPF Performance Tools book

Recent posts:
Blog index
About
RSS

PerconaLive 2016: Linux Systems Performance

Talk for PerconaLive 2016 by Brendan Gregg.

Video: https://www.youtube.com/watch?v=CbmEDXq7es0

Description: "Systems performance provides a different perspective for analysis and tuning, and can help you find performance wins for your databases, applications, and the kernel. However, most of us are not performance or kernel engineers, and have limited time to study this topic. This talk summarizes six important areas of Linux systems performance in 50 minutes: observability tools, methodologies, benchmarking, profiling, tracing, and tuning. Included are recipes for Linux performance analysis and tuning (using vmstat, mpstat, iostat, etc), overviews of complex areas including profiling (perf_events), static tracing (tracepoints), and dynamic tracing (kprobes, uprobes), and much advice about what is and isn't important to learn. This talk is aimed at everyone: DBAs, developers, operations, etc, and in any environment running Linux, bare-metal or the cloud."

next
prev
1/72
next
prev
2/72
next
prev
3/72
next
prev
4/72
next
prev
5/72
next
prev
6/72
next
prev
7/72
next
prev
8/72
next
prev
9/72
next
prev
10/72
next
prev
11/72
next
prev
12/72
next
prev
13/72
next
prev
14/72
next
prev
15/72
next
prev
16/72
next
prev
17/72
next
prev
18/72
next
prev
19/72
next
prev
20/72
next
prev
21/72
next
prev
22/72
next
prev
23/72
next
prev
24/72
next
prev
25/72
next
prev
26/72
next
prev
27/72
next
prev
28/72
next
prev
29/72
next
prev
30/72
next
prev
31/72
next
prev
32/72
next
prev
33/72
next
prev
34/72
next
prev
35/72
next
prev
36/72
next
prev
37/72
next
prev
38/72
next
prev
39/72
next
prev
40/72
next
prev
41/72
next
prev
42/72
next
prev
43/72
next
prev
44/72
next
prev
45/72
next
prev
46/72
next
prev
47/72
next
prev
48/72
next
prev
49/72
next
prev
50/72
next
prev
51/72
next
prev
52/72
next
prev
53/72
next
prev
54/72
next
prev
55/72
next
prev
56/72
next
prev
57/72
next
prev
58/72
next
prev
59/72
next
prev
60/72
next
prev
61/72
next
prev
62/72
next
prev
63/72
next
prev
64/72
next
prev
65/72
next
prev
66/72
next
prev
67/72
next
prev
68/72
next
prev
69/72
next
prev
70/72
next
prev
71/72
next
prev
72/72

PDF: Percona2016_LinuxSystemsPerf.pdf

Keywords (from pdftotext):

slide 1:
    Apr,	
      2016	
      
    Linux	
      Systems	
      
    Performance	
      
    Brendan Gregg
    Senior Performance Architect
    
slide 2:
    Systems	
      Performance	
      in	
      50	
      mins	
      
    
slide 3:
slide 4:
    Agenda	
      
    A brief discussion of 6 facets of Linux performance:
    Observability
    Methodologies
    Benchmarking
    Profiling
    Tracing
    Tuning
    Audience: Everyone (DBAs, developers, operations, …)
    
slide 5:
    1.	
      Observability	
      
    
slide 6:
    How	
      do	
      you	
      measure	
      these?	
      
    
slide 7:
    Linux	
      Observability	
      Tools	
      
    
slide 8:
    Observability	
      Tools	
      
    • Tools showcase common metrics
    – Learning Linux tools is useful even if you never use them:
    the same metrics are in GUIs
    • We usually use these metrics via:
    – Netflix Atlas: cloud-wide monitoring
    – Netflix Vector: instance analysis
    • Linux has many tools
    – Plus many extra kernel sources of data that lack tools, are
    harder to use, and are practically undocumented
    • Some tool examples…
    
slide 9:
    upGme	
      
    • One way to print load averages:
    $ uptime
    07:42:06 up
    8:16,
    1 user,
    load average: 2.27, 2.84, 2.91
    • A measure of resource demand: CPUs + disks
    – Other OSes only show CPUs: easier to interpret
    • Exponentially-damped moving averages
    • Time constants of 1, 5, and 15 minutes
    – Historic trend without the line graph
    • Load >gt; # of CPUs, may mean CPU saturation
    – Don’t spend more than 5 seconds studying these
    
slide 10:
    top	
      (or	
      htop)	
      
    • System and per-process interval summary:
    $ top - 18:50:26 up 7:43, 1 user, load average: 4.11, 4.91, 5.22
    Tasks: 209 total,
    1 running, 206 sleeping,
    0 stopped,
    2 zombie
    Cpu(s): 47.1%us, 4.0%sy, 0.0%ni, 48.4%id, 0.0%wa, 0.0%hi, 0.3%si, 0.2%st
    Mem: 70197156k total, 44831072k used, 25366084k free,
    36360k buffers
    Swap:
    0k total,
    0k used,
    0k free, 11873356k cached
    PID USER
    5738 apiprod
    1386 apiprod
    1 root
    2 root
    […]
    NI VIRT RES SHR S %CPU %MEM
    0 62.6g 29g 352m S 417 44.2
    0 17452 1388 964 R
    0 0.0
    0 24340 2272 1340 S
    0 0.0
    0 S
    0 0.0
    TIME+ COMMAND
    2144:15 java
    0:00.02 top
    0:01.51 init
    0:00.00 kthreadd
    • %CPU is summed across all CPUs
    • Can miss short-lived processes (atop won’t)
    • Can consume noticeable CPU to read /proc
    
slide 11:
    htop	
      
    
slide 12:
    vmstat	
      
    • Virtual memory statistics and more:
    $ vmstat –Sm 1
    procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---r b
    swpd
    free
    buff cache
    cs us sy id wa
    8 0
    12 25 34 0 0
    7 0
    0 205 186 46 13 0 0
    8 0
    8 210 435 39 21 0 0
    8 0
    0 218 219 42 17 0 0
    […]
    • USAGE: vmstat [interval [count]]
    • First output line has some summary since boot values
    – Should be all; partial is confusing
    • High level CPU summary
    – “r” is runnable tasks
    
slide 13:
    iostat	
      
    • Block I/O (disk) stats. 1st output is since boot.
    $ iostat -xmdz 1
    Linux 3.13.0-29 (db001-eb883efa)
    Device:
    xvda
    xvdb
    xvdc
    md0
    rrqm/s
    08/18/2014
    wrqm/s
    r/s
    0.00 15299.00
    0.00 15271.00
    0.00 31082.00
    w/s
    _x86_64_
    rMB/s
    (16 CPU)
    wMB/s \ ...
    0.00 / ...
    0.00 \ ...
    0.01 / ...
    0.01 \ ...
    Workload	
      
    • Very useful
    set of stats
    ... \ avgqu-sz
    ... /
    ... \
    ... /
    ... \
    await r_await w_await
    ResulGng	
      Performance	
      
    svctm
    %util
    
slide 14:
    free	
      
    • Main memory usage:
    $ free -m
    total
    Mem:
    -/+ buffers/cache:
    Swap:
    used
    free
    • buffers: block device I/O cache
    • cached: virtual page cache
    shared
    buffers
    cached
    
slide 15:
    strace	
      
    • System call tracer:
    $ strace –tttT –p 313
    1408393285.779746 getgroups(0, NULL)
    = 1 gt;
    1408393285.779873 getgroups(1, [0])
    = 1 gt;
    1408393285.780797 close(3)
    = 0 gt;
    1408393285.781338 write(1, "LinuxCon 2014!\n", 15LinuxCon 2014!
    ) = 15 gt;
    • Eg, -ttt: time (us) since epoch; -T: syscall time (s)
    • Translates syscall args
    – Very helpful for solving system usage issues
    • Currently has massive overhead (ptrace based)
    – Can slow the target by >gt; 100x. Use extreme caution.
    
slide 16:
    tcpdump	
      
    • Sniff network packets for post analysis:
    $ tcpdump -i eth0 -w /tmp/out.tcpdump
    tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
    ^C7985 packets captured
    8996 packets received by filter
    1010 packets dropped by kernel
    # tcpdump -nr /tmp/out.tcpdump | head
    reading from file /tmp/out.tcpdump, link-type EN10MB (Ethernet)
    20:41:05.038437 IP 10.44.107.151.22 >gt; 10.53.237.72.46425: Flags [P.], seq 18...
    20:41:05.038533 IP 10.44.107.151.22 >gt; 10.53.237.72.46425: Flags [P.], seq 48...
    20:41:05.038584 IP 10.44.107.151.22 >gt; 10.53.237.72.46425: Flags [P.], seq 96...
    […]
    • Study packet sequences with timestamps (us)
    • CPU overhead optimized (socket ring buffers), but can
    still be significant. Use caution.
    
slide 17:
    netstat	
      
    • Various network protocol statistics using -s:
    • A multi-tool:
    $ netstat –s
    -i: interface stats
    -r: route table
    default: list conns
    • netstat -p: shows
    process details!
    • Per-second interval
    with -c
    […]
    Tcp:
    736455 active connections openings
    176887 passive connection openings
    33 failed connection attempts
    1466 connection resets received
    3311 connections established
    91975192 segments received
    180415763 segments send out
    223685 segments retransmited
    2 bad segments received.
    39481 resets sent
    […]
    TcpExt:
    12377 invalid SYN cookies received
    2982 delayed acks sent
    […]
    
slide 18:
    slabtop	
      
    • Kernel slab allocator memory usage:
    $ slabtop
    Active / Total Objects (% used)
    : 4692768 / 4751161 (98.8%)
    Active / Total Slabs (% used)
    : 129083 / 129083 (100.0%)
    Active / Total Caches (% used)
    : 71 / 109 (65.1%)
    Active / Total Size (% used)
    : 729966.22K / 738277.47K (98.9%)
    Minimum / Average / Maximum Object : 0.01K / 0.16K / 8.00K
    OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
    3565575 3565575 100%
    0.10K 91425
    365700K buffer_head
    314916 314066 99%
    0.19K 14996
    59984K dentry
    184192 183751 99%
    0.06K
    11512K kmalloc-64
    138618 138618 100%
    0.94K
    130464K xfs_inode
    138602 138602 100%
    0.21K
    29968K xfs_ili
    102116 99012 96%
    0.55K
    58352K radix_tree_node
    97482 49093 50%
    0.09K
    9284K kmalloc-96
    22695 20777 91%
    0.05K
    1068K shared_policy_node
    21312 21312 100%
    0.86K
    18432K ext4_inode_cache
    16288 14601 89%
    0.25K
    4072K kmalloc-256
    […]
    
slide 19:
    pcstat	
      
    • Show page cache residency by file:
    # ./pcstat data0*
    |----------+----------------+------------+-----------+---------|
    | Name
    | Size
    | Pages
    | Cached
    | Percent |
    |----------+----------------+------------+-----------+---------|
    | data00
    | 104857600
    | 25600
    | 25600
    | 100.000 |
    | data01
    | 104857600
    | 25600
    | 25600
    | 100.000 |
    | data02
    | 104857600
    | 25600
    | 4080
    | 015.938 |
    | data03
    | 104857600
    | 25600
    | 25600
    | 100.000 |
    | data04
    | 104857600
    | 25600
    | 16010
    | 062.539 |
    | data05
    | 104857600
    | 25600
    | 0
    | 000.000 |
    |----------+----------------+------------+-----------+---------|
    • Uses the mincore(2) syscall. Useful for database
    performance analysis.
    
slide 20:
    perf_events	
      
    • Provides the "perf" command
    • In Linux source code: tools/perf
    – Usually pkg added by linux-tools-common, etc.
    • Multi-tool with many capabilities
    – CPU profiling
    – PMC profiling
    – Static & dynamic tracing
    • Covered later in Profiling & Tracing
    
slide 21:
    Where	
      do	
      you	
      start?...and	
      stop?	
      
    
slide 22:
    2.	
      Methodologies	
      
    
slide 23:
    An#-­‐Methodologies	
      
    • The lack of a deliberate methodology…
    • Street Light Anti-Method:
    – 1. Pick observability tools that are
    • Familiar
    • Found on the Internet
    • Found at random
    – 2. Run tools
    – 3. Look for obvious issues
    • Drunk Man Anti-Method:
    – Tune things at random until the problem goes away
    
slide 24:
    Methodologies	
      
    Linux Performance Analysis in 60 seconds
    The USE method
    CPU Profile Method
    Resource Analysis
    Workload Analysis
    Others include:
    – Workload characterization
    – Drill-down analysis
    – Off-CPU analysis
    – Static performance tuning
    – 5 whys
    – …
    
slide 25:
    Linux	
      Perf	
      Analysis	
      in	
      60s	
      
    1. uptime
    2. dmesg | tail
    3. vmstat 1
    4. mpstat -P ALL 1
    5. pidstat 1
    6. iostat -xz 1
    7. free -m
    8. sar -n DEV 1
    9. sar -n TCP,ETCP 1
    10. top
    
slide 26:
    Linux	
      Perf	
      Analysis	
      in	
      60s	
      
    1. uptime
    2. dmesg | tail
    3. vmstat 1
    4. mpstat -P ALL 1
    5. pidstat 1
    6. iostat -xz 1
    7. free -m
    8. sar -n DEV 1
    9. sar -n TCP,ETCP 1
    10. top
    load	
      averages	
      
    kernel	
      errors	
      
    overall	
      stats	
      by	
      Gme	
      
    CPU	
      balance	
      
    process	
      usage	
      
    disk	
      I/O	
      
    memory	
      usage	
      
    network	
      I/O	
      
    TCP	
      stats	
      
    check	
      overview	
      
    hTp://techblog.neVlix.com/2015/11/linux-­‐performance-­‐analysis-­‐in-­‐60s.html	
      
    
slide 27:
    The	
      USE	
      Method	
      
    • For every resource, check:
    Utilization
    Saturation
    Errors
    X	
      
    Resource	
      
    UGlizaGon	
      
    (%)	
      
    • Definitions:
    – Utilization: busy time
    – Saturation: queue length or queued time
    – Errors: easy to interpret (objective)
    • Helps if you have a functional (block) diagram of your
    system / software / environment, showing all resources
    Start with the questions, then find the tools
    
slide 28:
    USE	
      Method	
      for	
      Hardware	
      
    • For every resource, check:
    Utilization
    Saturation
    Errors
    • Including busses & interconnects
    
slide 29:
    (hTp://www.brendangregg.com/USEmethod/use-­‐linux.html)	
      
    
slide 30:
    CPU	
      Profile	
      Method	
      
    1. Take a CPU profile
    2. Understand all software in profile >gt; 1%
    • Discovers a wide range of issues by their CPU usage
    – Directly: CPU consumers
    – Indirectly: initialization
    of I/O, locks, times, ...
    • Narrows target of study
    to only its running
    components
    Flame	
      Graph	
      
    
slide 31:
    Resource	
      Analysis	
      
    • Typical approach for system performance analysis:
    begin with system tools & metrics
    Workload	
      
    • Pros:
    – Generic
    – Aids resource
    perf tuning
    ApplicaGon	
      
    	
      
    	
       Libraries	
      
    System	
      
    • Cons:
    – Uneven coverage
    – False positives
    System	
      Calls	
      
    Kernel	
      
    Hardware	
      
    Analysis	
      
    
slide 32:
    Workload	
      Analysis	
      
    • Begin with application metrics & context
    • Pros:
    – Accurate, proportional metrics
    – App context
    • Cons:
    – App specific
    – Difficult to dig from
    app to resource
    Workload	
      
    ApplicaGon	
      
    	
      
    	
       Libraries	
      
    System	
      
    System	
      Calls	
      
    Kernel	
      
    Hardware	
      
    Analysis	
      
    
slide 33:
    3.	
      Benchmarking	
      
    
slide 34:
    ~100% of benchmarks are
    wrong
    
slide 35:
    Benchmarking	
      
    • Apart from observational analysis, benchmarking is a
    useful form of experimental analysis
    – Try observational first; benchmarks can perturb
    • However, benchmarking is error prone:
    – Testing the wrong target: eg, FS cache instead of disk
    – Choosing the wrong target: eg, disk instead of FS cache
    … doesn’t resemble real world usage
    – Invalid results: eg, bugs
    – Misleading results: you benchmark A, but actually measure B,
    and conclude you measured C
    • The energy needed to refute benchmarks is multiple
    orders of magnitude bigger than to run them
    
slide 36:
    Benchmark	
      Examples	
      
    • Micro benchmarks:
    – File system maximum cached read operations/sec
    – Network maximum throughput
    • Macro (application) benchmarks:
    – Simulated application maximum request rate
    • Bad benchmarks:
    – gitpid() in a tight loop
    – Context switch timing
    
slide 37:
    The	
      Benchmark	
      Paradox	
      
    • Benchmarking is used for product evaluations
    – Eg, evaluating cloud vendors
    • The Benchmark Paradox:
    – If your product’s chances of winning a benchmark are 50/50,
    you’ll usually lose
    – http://www.brendangregg.com/blog/2014-05-03/the-benchmark-paradox.html
    • Solving this seeming paradox (and benchmarking in
    general)…
    
slide 38:
    For any given benchmark result,
    ask:
    why isn’t it 10x?
    
slide 39:
    AcGve	
      Benchmarking	
      
    • Root cause performance analysis while the benchmark is
    still running
    – Use the earlier observability tools
    – Identify the limiter (or suspected limiter) and include it with the
    benchmark results
    – Answer: why not 10x?
    • This takes time, but uncovers most mistakes
    
slide 40:
    4.	
      Profiling	
      
    
slide 41:
    Profiling	
      
    • Can you do this?
    “As an experiment to investigate the performance of the
    resulting TCP/IP implementation ... the 11/750 is CPU
    saturated, but the 11/780 has about 30% idle time. The time
    spent in the system processing the data is spread out among
    handling for the Ethernet (20%), IP packet processing
    (10%), TCP processing (30%), checksumming (25%), and
    user system call handling (15%), with no single part of the
    handling dominating the time in the system.”
    
slide 42:
    Profiling	
      
    • Can you do this?
    “As an experiment to investigate the performance of the
    resulting TCP/IP implementation ... the 11/750 is CPU
    saturated, but the 11/780 has about 30% idle time. The time
    spent in the system processing the data is spread out among
    handling for the Ethernet (20%), IP packet processing
    (10%), TCP processing (30%), checksumming (25%), and
    user system call handling (15%), with no single part of the
    handling dominating the time in the system.”
    –	
      Bill	
      Joy,	
      1981,	
      TCP-­‐IP	
      Digest,	
      Vol	
      1	
      #6	
      
    hTps://www.rfc-­‐editor.org/rfc/museum/tcp-­‐ip-­‐digest/tcp-­‐ip-­‐digest.v1n6.1	
      
    
slide 43:
    perf_events	
      
    • Introduced earlier: multi-tool, profiler. Provides "perf".
    usage: perf [--version] [--help] [OPTIONS] COMMAND [ARGS]
    The most commonly used perf commands are:
    annotate
    Read perf.data (created by perf record) and display annotated code
    archive
    Create archive with object files with build-ids found in perf.data file
    bench
    General framework for benchmark suites
    buildid-cache
    Manage build-id cache.
    buildid-list
    List the buildids in a perf.data file
    data
    Data file related processing
    diff
    Read perf.data files and display the differential profile
    evlist
    List the event names in a perf.data file
    inject
    Filter to augment the events stream with additional information
    kmem
    Tool to trace/measure kernel memory(slab) properties
    kvm
    Tool to trace/measure kvm guest os
    list
    List all symbolic event types
    lock
    Analyze lock events
    mem
    Profile memory accesses
    record
    Run a command and record its profile into perf.data
    report
    Read perf.data (created by perf record) and display the profile
    sched
    Tool to trace/measure scheduler properties (latencies)
    script
    Read perf.data (created by perf record) and display trace output
    stat
    Run a command and gather performance counter statistics
    test
    Runs sanity tests.
    timechart
    Tool to visualize total system behavior during a workload
    top
    System profiling tool.
    trace
    strace inspired tool
    probe
    Define new dynamic tracepoints
    See 'perf help COMMAND' for more information on a specific command.
    
slide 44:
    perf_events:	
      CPU	
      profiling	
      
    • Sampling full stack traces at 99 Hertz, for 30 secs:
    # perf record -F 99 -ag -- sleep 30
    [ perf record: Woken up 9 times to write data ]
    [ perf record: Captured and wrote 2.745 MB perf.data (~119930 samples) ]
    # perf report -n --stdio
    1.40%
    java [kernel.kallsyms]
    [k] _raw_spin_lock
    --- _raw_spin_lock
    |--63.21%-- try_to_wake_up
    |--63.91%-- default_wake_function
    |--56.11%-- __wake_up_common
    __wake_up_locked
    ep_poll_callback
    __wake_up_common
    __wake_up_sync_key
    |--59.19%-- sock_def_readable
    […78,000 lines truncated…]
    
slide 45:
    perf_events:	
      Full	
      "report"	
      Output	
      
    
slide 46:
    …	
      as	
      a	
      Flame	
      Graph	
      
    
slide 47:
    perf_events:	
      Flame	
      Graphs	
      
    git clone --depth 1 https://github.com/brendangregg/FlameGraph
    cd FlameGraph
    perf record -F 99 -a –g -- sleep 30
    perf script | ./stackcollapse-perf.pl |./flamegraph.pl >gt; perf.svg
    • Flame Graphs:
    – x-axis: alphabetical stack sort, to maximize merging
    – y-axis: stack depth
    – color: random, or hue can be a dimension (eg, diff)
    • Interpretation:
    – top edge is on-CPU, beneath it is ancestry
    • Easy to get working
    – http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html
    • Also in Mar/Apr issue of ACMQ
    
slide 48:
    	
      
    • The first ever flame graph was generated for MySQL
    • This background is the output of a DTrace CPU profile,
    which only printed unique stacks with counts
    Size	
      of	
      
    one	
      stack	
      
    
slide 49:
    1 	
      Flame	
      Graph:	
      MySQL	
      
    …	
      same	
      data	
      
    
slide 50:
    perf_events:	
      Workflow	
      
    list	
      events	
      
    count	
      events	
      
    capture	
      stacks	
      
    perf list
    perf stat
    perf record
    Typical	
      
    Workflow	
      
    perf.data	
      
    text	
      UI	
      
    dump	
      profile	
      
    perf report
    flame	
      graph	
      
    visualizaGon	
      
    perf script
    stackcollapse-perf.pl
    flamegraph.pl
    
slide 51:
    perf_events:	
      Counters	
      
    • Performance Monitoring Counters (PMCs):
    $ perf list | grep –i hardware
    cpu-cycles OR cycles
    stalled-cycles-frontend OR idle-cycles-frontend
    stalled-cycles-backend OR idle-cycles-backend
    instructions
    […]
    branch-misses
    bus-cycles
    L1-dcache-loads
    L1-dcache-load-misses
    […]
    rNNN (see 'perf list --help' on how to encode it)
    mem:gt;[:access]
    [Hardware event]
    [Hardware event]
    [Hardware event]
    [Hardware event]
    [Hardware event]
    [Hardware event]
    [Hardware cache event]
    [Hardware cache event]
    [Raw hardware event …
    [Hardware breakpoint]
    • Identify CPU cycle breakdowns, esp. stall types
    • PMCs not enabled by-default in clouds (yet)
    • Can be time-consuming to use (CPU manuals)
    
slide 52:
    5.	
      Tracing	
      
    
slide 53:
    Linux	
      Event	
      Sources	
      
    
slide 54:
    Tracing	
      Stack	
      
    add-­‐on	
      tools:	
       trace-cmd,	
      perf-­‐tools,	
      bcc,	
      …	
      
    front-­‐end	
      tools:	
      
    tracing	
      frameworks:	
      
    back-­‐end	
      instrumentaGon:	
      
    perf	
      
    jrace,	
      perf_events,	
      BPF	
      
    tracepoints,	
      kprobes,	
      uprobes	
      
    in	
      
    Linux	
      
    
slide 55:
    jrace	
      
    • Added by Steven Rostedt and others since 2.6.27
    – CONFIG_FTRACE, CONFIG_FUNCTION_PROFILER, …
    – See Linux source: Documentation/trace/ftrace.txt
    – A collection of powerful features, good for hacking
    • Use directly via /sys/kernel/debug/tracing (not easy):
    linux-4.0.0+# ls /sys/kernel/debug/tracing/
    available_events
    max_graph_depth
    available_filter_functions options
    available_tracers
    per_cpu
    buffer_size_kb
    printk_formats
    […]
    stack_max_size
    stack_trace
    stack_trace_filter
    trace
    • Or via front-ends:
    – Steven's trace-cmd
    – my perf-tools: iosnoop, iolatency, funccount, kprobe, …
    
slide 56:
    jrace:	
      perf-­‐tools	
      iosnoop	
      
    • Block I/O (disk) events with latency:
    # ./iosnoop –ts
    Tracing block I/O. Ctrl-C to end.
    STARTs
    ENDs
    COMM
    5982800.302061 5982800.302679 supervise
    5982800.302423 5982800.302842 supervise
    5982800.304962 5982800.305446 supervise
    5982800.305250 5982800.305676 supervise
    […]
    PID
    TYPE DEV
    202,1
    202,1
    202,1
    202,1
    BLOCK
    BYTES LATms
    # ./iosnoop –h
    USAGE: iosnoop [-hQst] [-d device] [-i iotype] [-p PID] [-n name] [duration]
    -d device
    # device string (eg, "202,1)
    -i iotype
    # match type (eg, '*R*' for all reads)
    -n name
    # process name to match on I/O issue
    -p PID
    # PID to match on I/O issue
    # include queueing time in LATms
    # include start time of I/O (s)
    # include completion time of I/O (s)
    # this usage message
    duration
    # duration seconds, and use buffers
    […]
    
slide 57:
    jrace:	
      perf-­‐tools	
      iolatency	
      
    • Block I/O (disk) latency distributions:
    # ./iolatency
    Tracing block I/O. Output every 1 seconds. Ctrl-C to end.
    >gt;=(ms) .. gt; 1
    1 ->gt; 2
    2 ->gt; 4
    4 ->gt; 8
    8 ->gt; 16
    : I/O
    : 2104
    : 280
    : 2
    : 0
    : 202
    |Distribution
    |######################################|
    |######
    |####
    >gt;=(ms) .. gt; 1
    1 ->gt; 2
    2 ->gt; 4
    4 ->gt; 8
    8 ->gt; 16
    16 ->gt; 32
    32 ->gt; 64
    […]
    : I/O
    : 1144
    : 267
    : 10
    : 5
    : 248
    : 601
    : 117
    |Distribution
    |######################################|
    |#########
    |#########
    |####################
    |####
    
slide 58:
    jrace:	
      perf-­‐tools	
      funccount	
      
    • Count a kernel function call rate:
    # ./funccount -i 1 'bio_*'
    Tracing "bio_*"... Ctrl-C to end.
    FUNC
    bio_attempt_back_merge
    bio_get_nr_vecs
    bio_alloc
    bio_alloc_bioset
    bio_endio
    bio_free
    bio_fs_destructor
    bio_init
    bio_integrity_enabled
    bio_put
    bio_add_page
    COUNT
    Counts	
      are	
      in-­‐kernel,	
      
    for	
      low	
      overhead	
      
    [...]
    – -i: set an output interval (seconds), otherwise until Ctrl-C
    
slide 59:
    jrace:	
      perf-­‐tools	
      uprobe	
      
    • Dynamically trace user-level functions; eg, MySQL:
    # ./uprobe 'p:dispatch_command /opt/mysql/bin/mysqld:_Z16dispatch_command19enum_server_commandP3THDPcj
    +0(%dx):string'
    Tracing uprobe dispatch_command (p:dispatch_command /opt/mysql/bin/mysqld:0x2dbd40 +0(%dx):string).
    Ctrl-C to end.
    mysqld-2855 [001] d... 19956674.509085: dispatch_command: (0x6dbd40) arg1="show tables"
    mysqld-2855 [001] d... 19956675.541155: dispatch_command: (0x6dbd40) arg1="SELECT * FROM
    numbers where number >gt; 32000"
    – Filter on string match; eg, "SELECT":
    # ./uprobe 'p:dispatch_command /opt/mysql/bin/mysqld:_Z16dispatch_command19enum_server_commandP3THDPcj
    cmd=+0(%dx):string' 'cmd ~ "SELECT*"'
    Tracing uprobe dispatch_command (p:dispatch_command /opt/mysql/bin/mysqld:0x2dbd40 cmd=
    +0(%dx):string). Ctrl-C to end.
    mysqld-2855 [001] d... 19956754.619958: dispatch_command: (0x6dbd40) cmd="SELECT * FROM
    numbers where number >gt; 32000"
    mysqld-2855 [001] d... 19956755.060125: dispatch_command: (0x6dbd40) cmd="SELECT * FROM
    numbers where number >gt; 32000"
    • Ok for hacking, but not friendly; need perf_events/BPF
    
slide 60:
    perf_events	
      
    • Powerful profiler (covered earlier) and tracer:
    – User-level and kernel dynamic tracing
    – Kernel line tracing and local variables (debuginfo)
    – Kernel filtering expressions
    – Efficient in-kernel counts (perf stat)
    • Intended as the official Linux tracer/profiler
    • Becoming more programmable with BPF support (2016)
    – Search lkml for "perf" and "BPF"
    
slide 61:
    perf_events:	
      LisGng	
      Events	
      
    # perf list
    cpu-cycles OR cycles
    instructions
    cache-references
    cache-misses
    branch-instructions OR branches
    […]
    skb:kfree_skb
    skb:consume_skb
    skb:skb_copy_datagram_iovec
    net:net_dev_xmit
    net:net_dev_queue
    net:netif_receive_skb
    net:netif_rx
    […]
    block:block_rq_abort
    block:block_rq_requeue
    block:block_rq_complete
    block:block_rq_insert
    block:block_rq_issue
    block:block_bio_bounce
    block:block_bio_complete
    […]
    [Hardware event]
    [Hardware event]
    [Hardware event]
    [Hardware event]
    [Hardware event]
    [Tracepoint event]
    [Tracepoint event]
    [Tracepoint event]
    [Tracepoint event]
    [Tracepoint event]
    [Tracepoint event]
    [Tracepoint event]
    [Tracepoint event]
    [Tracepoint event]
    [Tracepoint event]
    [Tracepoint event]
    [Tracepoint event]
    [Tracepoint event]
    [Tracepoint event]
    
slide 62:
    perf_events:	
      Tracing	
      Tracepoints	
      
    # perf record -e block:block_rq_complete -a sleep 10
    [ perf record: Woken up 1 times to write data ]
    [ perf record: Captured and wrote 0.428 MB perf.data (~18687 samples) ]
    # perf script
    run 30339 [000] 2083345.722767: block:block_rq_complete: 202,1 W () 12984648 + 8 [0]
    run 30339 [000] 2083345.722857: block:block_rq_complete: 202,1 W () 12986336 + 8 [0]
    run 30339 [000] 2083345.723180: block:block_rq_complete: 202,1 W () 12986528 + 8 [0]
    swapper
    0 [000] 2083345.723489: block:block_rq_complete: 202,1 W () 12986496 + 8 [0]
    swapper
    0 [000] 2083346.745840: block:block_rq_complete: 202,1 WS () 1052984 + 144 [0]
    supervise 30342 [000] 2083346.746571: block:block_rq_complete: 202,1 WS () 1053128 + 8 [0]
    supervise 30342 [000] 2083346.746663: block:block_rq_complete: 202,1 W () 12986608 + 8 [0]
    run 30342 [000] 2083346.747003: block:block_rq_complete: 202,1 W () 12986832 + 8 [0]
    [...]
    • If -g is used in "perf record", stack traces are included
    • If "perf script" output is too verbose, try "perf report",
    or making a flame graph
    
slide 63:
    BPF	
      
    • Enhanced Berkeley Packet Filter, now just "BPF"
    – Enhancements added in Linux 3.15, 3.19, 4.1, 4.5, …
    • Provides programmatic tracing
    – measure latency, custom histograms, …
    System dashboards of 2017+ should look very different
    
slide 64:
    BPF:	
      bcc	
      ext4slower	
      
    • ext4 operations slower than the threshold:
    # ./ext4slower 1
    Tracing ext4 operations slower than 1 ms
    TIME
    COMM
    PID
    T BYTES
    OFF_KB
    06:49:17 bash
    R 128
    06:49:17 cksum
    R 39552
    06:49:17 cksum
    R 96
    06:49:17 cksum
    R 96
    06:49:17 cksum
    R 10320
    06:49:17 cksum
    R 65536
    06:49:17 cksum
    R 55400
    06:49:17 cksum
    R 36792
    06:49:17 cksum
    R 15008
    […]
    LAT(ms) FILENAME
    7.75 cksum
    1.34 [
    5.36 2to3-2.7
    14.94 2to3-3.4
    6.82 411toppm
    4.01 a2p
    8.77 ab
    16.34 aclocal-1.14
    19.31 acpi_listen
    • Better indicator of application pain than disk I/O
    • Measures & filters in-kernel for efficiency using BPF
    – https://github.com/iovisor/bcc
    
slide 65:
    BPF:	
      bcc	
      tools	
      (early	
      2016)	
      
    
slide 66:
    6.	
      Tuning	
      
    
slide 67:
    Ubuntu	
      Trusty	
      Tuning:	
      Early	
      2016	
      (1/2)	
      
    CPU
    schedtool –B PID
    disable Ubuntu apport (crash reporter)
    also upgrade to Xenial (better HT scheduling)
    Virtual Memory
    vm.swappiness = 0
    # from 60
    kernel.numa_balancing = 0 # temp workaround
    Huge Pages
    echo never >gt; /sys/kernel/mm/transparent_hugepage/enabled
    File System
    vm.dirty_ratio = 80
    # from 40
    vm.dirty_background_ratio = 5
    # from 10
    vm.dirty_expire_centisecs = 12000
    # from 3000
    mount -o defaults,noatime,discard,nobarrier …
    Storage I/O
    /sys/block/*/queue/rq_affinity
    /sys/block/*/queue/scheduler
    noop
    /sys/block/*/queue/nr_requests
    
slide 68:
    Ubuntu	
      Trusty	
      Tuning:	
      Early	
      2016	
      (2/2)	
      
    Storage (continued)
    /sys/block/*/queue/read_ahead_kb 256
    mdadm –chunk=64 …
    Networking
    net.core.somaxconn = 1000
    net.core.netdev_max_backlog = 5000
    net.core.rmem_max = 16777216
    net.core.wmem_max = 16777216
    net.ipv4.tcp_wmem = 4096 12582912 16777216
    net.ipv4.tcp_rmem = 4096 12582912 16777216
    net.ipv4.tcp_max_syn_backlog = 8096
    net.ipv4.tcp_slow_start_after_idle = 0
    net.ipv4.tcp_tw_reuse = 1
    net.ipv4.ip_local_port_range = 10240 65535
    net.ipv4.tcp_abort_on_overflow = 1
    # maybe
    Hypervisor (Xen)
    echo tsc >gt; /sys/devices/…/current_clocksource
    Plus PVHVM (HVM), SR-IOV, …
    
slide 69:
    Summary	
      
    A brief discussion of 6 facets of Linux performance:
    Observability
    Methodologies
    Benchmarking
    Profiling
    Tracing
    Tuning
    
slide 70:
    Takeaways	
      
    Some things to print out for your office wall:
    uptime
    dmesg -T | tail
    vmstat 1
    mpstat -P ALL 1
    pidstat 1
    iostat -xz 1
    free -m
    sar -n DEV 1
    sar -n TCP,ETCP 1
    top
    
slide 71:
    More	
      Links	
      
    Netflix Tech Blog on Linux:
    Linux Performance:
    https://www.kernel.org/doc/Documentation/trace/ftrace.txt
    https://github.com/brendangregg/perf-tools
    USE Method Linux:
    https://perf.wiki.kernel.org/index.php/Main_Page
    http://www.brendangregg.com/perf.html
    Linux ftrace:
    http://www.brendangregg.com/linuxperf.html
    Linux perf_events:
    http://techblog.netflix.com/2015/11/linux-performance-analysis-in-60s.html
    http://techblog.netflix.com/2015/08/netflix-at-velocity-2015-linux.html
    http://www.brendangregg.com/USEmethod/use-linux.html
    Flame Graphs:
    http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html
    http://queue.acm.org/detail.cfm?id=2927301
    
slide 72:
    Thanks	
      
    Questions?
    http://slideshare.net/brendangregg
    http://www.brendangregg.com
    bgregg@netflix.com
    @brendangregg