Systems Performance 2nd Ed.



BPF Performance Tools book

Recent posts:
Blog index
About
RSS

Velocity 2015: linux perf tools

Video: https://www.youtube.com/watch?v=FJW8nGV4jxY and https://www.youtube.com/watch?v=zrr2nUln9Kk

Tutorial slides for O'Reilly Velocity Santa Clara 2015, by Brendan Gregg.

Description: "There are many performance tools nowadays for Linux, but how do they all fit together, and when do we use them? This tutorial explains methodologies for using these tools, and provides a tour of four tool types: observability, benchmarking, tuning, and static tuning. Many tools will be discussed, including top, iostat, tcpdump, sar, perf_events, ftrace, SystemTap, sysdig, and others, as well observability frameworks in the Linux kernel: PMCs, tracepoints, kprobes, and uprobes.

This tutorial is updated and extended on an earlier talk that summarizes the Linux performance tool landscape. The value of this tutorial is not just learning that these tools exist and what they do, but hearing when and how they are used by a performance engineer to solve real world problems — important context that is typically not included in the standard documentation."

next
prev
1/142
next
prev
2/142
next
prev
3/142
next
prev
4/142
next
prev
5/142
next
prev
6/142
next
prev
7/142
next
prev
8/142
next
prev
9/142
next
prev
10/142
next
prev
11/142
next
prev
12/142
next
prev
13/142
next
prev
14/142
next
prev
15/142
next
prev
16/142
next
prev
17/142
next
prev
18/142
next
prev
19/142
next
prev
20/142
next
prev
21/142
next
prev
22/142
next
prev
23/142
next
prev
24/142
next
prev
25/142
next
prev
26/142
next
prev
27/142
next
prev
28/142
next
prev
29/142
next
prev
30/142
next
prev
31/142
next
prev
32/142
next
prev
33/142
next
prev
34/142
next
prev
35/142
next
prev
36/142
next
prev
37/142
next
prev
38/142
next
prev
39/142
next
prev
40/142
next
prev
41/142
next
prev
42/142
next
prev
43/142
next
prev
44/142
next
prev
45/142
next
prev
46/142
next
prev
47/142
next
prev
48/142
next
prev
49/142
next
prev
50/142
next
prev
51/142
next
prev
52/142
next
prev
53/142
next
prev
54/142
next
prev
55/142
next
prev
56/142
next
prev
57/142
next
prev
58/142
next
prev
59/142
next
prev
60/142
next
prev
61/142
next
prev
62/142
next
prev
63/142
next
prev
64/142
next
prev
65/142
next
prev
66/142
next
prev
67/142
next
prev
68/142
next
prev
69/142
next
prev
70/142
next
prev
71/142
next
prev
72/142
next
prev
73/142
next
prev
74/142
next
prev
75/142
next
prev
76/142
next
prev
77/142
next
prev
78/142
next
prev
79/142
next
prev
80/142
next
prev
81/142
next
prev
82/142
next
prev
83/142
next
prev
84/142
next
prev
85/142
next
prev
86/142
next
prev
87/142
next
prev
88/142
next
prev
89/142
next
prev
90/142
next
prev
91/142
next
prev
92/142
next
prev
93/142
next
prev
94/142
next
prev
95/142
next
prev
96/142
next
prev
97/142
next
prev
98/142
next
prev
99/142
next
prev
100/142
next
prev
101/142
next
prev
102/142
next
prev
103/142
next
prev
104/142
next
prev
105/142
next
prev
106/142
next
prev
107/142
next
prev
108/142
next
prev
109/142
next
prev
110/142
next
prev
111/142
next
prev
112/142
next
prev
113/142
next
prev
114/142
next
prev
115/142
next
prev
116/142
next
prev
117/142
next
prev
118/142
next
prev
119/142
next
prev
120/142
next
prev
121/142
next
prev
122/142
next
prev
123/142
next
prev
124/142
next
prev
125/142
next
prev
126/142
next
prev
127/142
next
prev
128/142
next
prev
129/142
next
prev
130/142
next
prev
131/142
next
prev
132/142
next
prev
133/142
next
prev
134/142
next
prev
135/142
next
prev
136/142
next
prev
137/142
next
prev
138/142
next
prev
139/142
next
prev
140/142
next
prev
141/142
next
prev
142/142

PDF: Velocity2015_LinuxPerfTools.pdf

Keywords (from pdftotext):

slide 1:
    Linux	
      Performance	
      Tools	
      
    Brendan Gregg
    Senior Performance Architect
    Performance Engineering Team
    bgregg@netflix.com
    @brendangregg
    
slide 2:
    This	
      Tutorial	
      
    • A tour of many Linux performance tools
    – To show you what can be done
    – With guidance for how to do it
    • This includes objectives, discussion, live demos
    – See the video of this tutorial
    Observability	
      
    Benchmarking	
      
    Tuning	
      
    Sta
slide 3:
    • Massive AWS EC2 Linux cloud
    – 10s of thousands of cloud instances
    • FreeBSD for content delivery
    – ~33% of US Internet traffic at night
    • Over 50M subscribers
    – Recently launched in ANZ
    • Use Linux server tools as needed
    – After cloud monitoring (Atlas, etc.) and
    instance monitoring (Vector) tools
    
slide 4:
    Agenda	
      
    • Methodologies
    • Tools
    • Tool Types:
    – Observability
    – Benchmarking
    – Tuning
    – Static
    • Profiling
    • Tracing
    
slide 5:
    Methodologies	
      
    
slide 6:
    Methodologies	
      
    • Objectives:
    – Recognize the Streetlight Anti-Method
    – Perform the Workload Characterization Method
    – Perform the USE Method
    – Learn how to start with the questions, before using tools
    – Be aware of other methodologies
    
slide 7:
    My	
      system	
      is	
      slow…	
      
    DEMO
    DISCUSSION
    
slide 8:
    Methodologies	
      
    • There are dozens of performance tools for Linux
    – Packages: sysstat, procps, coreutils, …
    – Commercial products
    • Methodologies can provide guidance for choosing and
    using tools effectively
    • A starting point, a process, and an ending point
    
slide 9:
    An#-­‐Methodologies	
      
    • The lack of a deliberate methodology…
    
slide 10:
    Street	
      Light	
      An
slide 11:
    Drunk	
      Man	
      An
slide 12:
    Blame	
      Someone	
      Else	
      An
slide 13:
    Actual	
      Methodologies	
      
    Problem Statement Method
    Workload Characterization Method
    USE Method
    Off-CPU Analysis
    CPU Profile Method
    RTFM Method
    Active Benchmarking (covered later)
    Static Performance Tuning (covered later)
    
slide 14:
    Problem	
      Statement	
      Method	
      
    1. What makes you think there is a performance problem?
    2. Has this system ever performed well?
    3. What has changed recently? (Software? Hardware?
    Load?)
    4. Can the performance degradation be expressed in
    terms of latency or run time?
    5. Does the problem affect other people or applications (or
    is it just you)?
    6. What is the environment? Software, hardware,
    instance types? Versions? Configuration?
    
slide 15:
    Workload	
      Characteriza
slide 16:
    The	
      USE	
      Method	
      
    • For every resource, check:
    Utilization
    Saturation
    Errors
    X	
      
    Resource	
      
    U
slide 17:
    USE	
      Method	
      for	
      Hardware	
      
    • For every resource, check:
    Utilization
    Saturation
    Errors
    • Including busses & interconnects
    
slide 18:
    Linux	
      USE	
      Method	
      Example	
      
    hOp://www.brendangregg.com/USEmethod/use-­‐linux.html	
      
    
slide 19:
    Off-­‐CPU	
      Analysis	
      
    On-­‐CPU	
      Profiling	
      
    Off-­‐CPU	
      Profiling	
      
    (everything	
      else)	
      
    Thread	
      State	
      Transi
slide 20:
    CPU	
      Profile	
      Method	
      
    1. Take a CPU profile
    2. Understand all software in profile >gt; 1%
    • Discovers a wide range of performance issues by their
    CPU usage
    Flame	
      Graph	
      
    • Narrows software
    to study
    
slide 21:
    RTFM	
      Method	
      
    • How to understand performance tools or metrics:
    Man pages
    Books
    Web search
    Co-workers
    Prior talk slides/video (this one!)
    Support services
    Source code
    Experimentation
    Social
    
slide 22:
    Tools	
      
    
slide 23:
    Tools	
      
    • Objectives:
    – Perform the USE Method for resource utilization
    – Perform Workload Characterization for disks, network
    – Perform the CPU Profile Method using flame graphs
    – Have exposure to various observability tools:
    • Basic: vmstat, iostat, mpstat, ps, top, …
    • Intermediate: tcpdump, netstat, nicstat, pidstat, sar, …
    • Advanced: ss, slaptop, perf_events, …
    – Perform Active Benchmarking
    – Understand tuning risks
    – Perform Static Performance Tuning
    
slide 24:
    Command	
      Line	
      Tools	
      
    • Useful to study even if you never use them: GUIs and
    commercial products often use the same interfaces
    Kernel	
      
    	
      
    /proc,	
      	
      /sys,	
      …	
      
    $ vmstat 1!
    procs -----------memory---------- ---swap-- …!
    r b
    swpd
    free
    buff cache
    so …!
    9 0
    0 29549320 29252 9299060
    2 0
    0 29547876 29252 9299332
    4 0
    0 29548124 29252 9299460
    5 0
    0 29548840 29252 9299592
    
slide 25:
    Tool	
      Types	
      
    Type
    Characteristic
    Observability
    Watch activity. Safe, usually, depending on
    resource overhead.
    Benchmarking
    Load test. Caution: production tests can
    cause issues due to contention.
    Tuning
    Change. Danger: changes could hurt
    performance, now or later with load.
    Static
    Check configuration. Should be safe.
    
slide 26:
    Observability	
      Tools	
      
    
slide 27:
    How	
      do	
      you	
      measure	
      these?	
      
    
slide 28:
    Observability	
      Tools:	
      Basic	
      
    uptime
    top (or htop)
    vmstat
    iostat
    mpstat
    free
    
slide 29:
    upgt; # of CPUs, may mean CPU saturation
    – Don’t spend more than 5 seconds studying these
    
slide 30:
    top	
      (or	
      htop)	
      
    • System and per-process interval summary:
    $ top - 18:50:26 up 7:43, 1 user, load average: 4.11, 4.91, 5.22!
    Tasks: 209 total,
    1 running, 206 sleeping,
    0 stopped,
    2 zombie!
    Cpu(s): 47.1%us, 4.0%sy, 0.0%ni, 48.4%id, 0.0%wa, 0.0%hi, 0.3%si, 0.2%st!
    Mem: 70197156k total, 44831072k used, 25366084k free,
    36360k buffers!
    Swap:
    0k total,
    0k used,
    0k free, 11873356k cached!
    PID USER
    PR NI VIRT RES SHR S %CPU %MEM
    TIME+ COMMAND
    5738 apiprod
    0 62.6g 29g 352m S 417 44.2
    2144:15 java
    1386 apiprod
    0 17452 1388 964 R
    0 0.0
    0:00.02 top
    1 root
    0 24340 2272 1340 S
    0 0.0
    0:01.51 init
    2 root
    0 S
    0 0.0
    0:00.00 kthreadd
    […]!
    • %CPU is summed across all CPUs
    • Can miss short-lived processes (atop won’t)
    • Can consume noticeable CPU to read /proc
    
slide 31:
    htop	
      
    
slide 32:
    ps	
      
    • Process status listing (eg, “ASCII art forest”):
    $ ps -ef f!
    UID
    PID PPID
    […]!
    root
    root
    28261 4546
    prod
    28287 28261
    prod
    28288 28287
    prod
    3156 28288
    root
    root
    4969 4965
    […]!
    C STIME TTY
    STAT
    TIME CMD!
    0 11:08 ?
    0 17:24 ?
    0 17:24 ?
    0 17:24 pts/0
    0 19:15 pts/0
    0 11:08 ?
    0 11:08 ?
    0:00 /usr/sbin/sshd -D!
    0:00 \_ sshd: prod [priv]!
    0:00
    \_ sshd: prod@pts/0 !
    0:00
    \_ -bash!
    0:00
    \_ ps -ef f!
    0:00 /bin/sh /usr/bin/svscanboot!
    0:00 \_ svscan /etc/service!
    • Custom fields:
    $ ps -eo user,sz,rss,minflt,majflt,pcpu,args!
    USER
    RSS MINFLT MAJFLT %CPU COMMAND!
    root
    6085 2272 11928
    24 0.0 /sbin/init!
    […]!
    
slide 33:
    vmstat	
      
    • Virtual memory statistics and more:
    $ vmstat –Sm 1!
    procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----!
    r b
    swpd
    free
    buff cache
    cs us sy id wa!
    8 0
    12 25 34 0 0!
    7 0
    0 205 186 46 13 0 0!
    8 0
    8 210 435 39 21 0 0!
    8 0
    0 218 219 42 17 0 0!
    […]!
    • USAGE: vmstat [interval [count]]
    • First output line has some summary since boot values
    – Should be all; partial is confusing
    • High level CPU summary
    – “r” is runnable tasks
    
slide 34:
    iostat	
      
    • Block I/O (disk) stats. 1st output is since boot.
    $ iostat -xmdz 1!
    Linux 3.13.0-29 (db001-eb883efa)
    08/18/2014
    _x86_64_
    Device:
    rrqm/s
    wrqm/s
    r/s
    w/s
    rMB/s
    xvda
    xvdb
    0.00 15299.00
    xvdc
    0.00 15271.00
    md0
    0.00 31082.00
    (16 CPU)!
    wMB/s \ ...!
    0.00 / ...!
    0.00 \ ...!
    0.01 / ...!
    0.01 \ ...!
    Workload	
      
    • Very useful
    set of stats
    ... \ avgqu-sz
    ... /
    ... \
    ... /
    ... \
    await r_await w_await
    Resul
slide 35:
    mpstat	
      
    • Multi-processor statistics, per-CPU:
    $ mpstat –P ALL 1!
    […]!
    08:06:43 PM CPU
    %usr
    08:06:44 PM all 53.45
    08:06:44 PM
    0 49.49
    08:06:44 PM
    1 51.61
    08:06:44 PM
    2 58.16
    08:06:44 PM
    3 54.55
    08:06:44 PM
    4 47.42
    08:06:44 PM
    5 65.66
    08:06:44 PM
    6 50.00
    […]!
    %nice
    %sys %iowait
    %irq
    %soft %steal %guest
    • Look for unbalanced workloads, hot CPUs.
    %idle!
    42.26!
    45.45!
    41.94!
    33.67!
    40.40!
    49.48!
    31.31!
    47.92!
    
slide 36:
    free	
      
    • Main memory usage:
    $ free -m!
    total
    Mem:
    -/+ buffers/cache:
    Swap:
    used
    free
    3313!
    • buffers: block device I/O cache
    • cached: virtual page cache
    shared
    buffers
    cached!
    527!
    
slide 37:
    Latency	
      is	
      now	
      much	
      higher…	
      
    DEMO
    DISCUSSION
    
slide 38:
    Observability	
      Tools:	
      Basic	
      
    
slide 39:
    Observability	
      Tools:	
      Intermediate	
      
    strace
    tcpdump
    netstat
    nicstat
    pidstat
    swapon
    lsof
    sar (and collectl, dstat, etc.)
    
slide 40:
    strace	
      
    • System call tracer:
    $ strace –tttT –p 313!
    1408393285.779746 getgroups(0, NULL)
    = 1 gt;!
    1408393285.779873 getgroups(1, [0])
    = 1 gt;!
    1408393285.780797 close(3)
    = 0 gt;!
    1408393285.781338 write(1, "LinuxCon 2014!\n", 15LinuxCon 2014!!
    ) = 15 gt;!
    • Eg, -ttt: time (us) since epoch; -T: syscall time (s)
    • Translates syscall args
    – Very helpful for solving system usage issues
    • Currently has massive overhead (ptrace based)
    – Can slow the target by >gt; 100x. Use extreme caution.
    
slide 41:
    tcpdump	
      
    • Sniff network packets for post analysis:
    $ tcpdump -i eth0 -w /tmp/out.tcpdump!
    tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes!
    ^C7985 packets captured!
    8996 packets received by filter!
    1010 packets dropped by kernel!
    # tcpdump -nr /tmp/out.tcpdump | head
    reading from file /tmp/out.tcpdump, link-type EN10MB (Ethernet)
    20:41:05.038437 IP 10.44.107.151.22 >gt; 10.53.237.72.46425: Flags [P.], seq 18...!
    20:41:05.038533 IP 10.44.107.151.22 >gt; 10.53.237.72.46425: Flags [P.], seq 48...!
    20:41:05.038584 IP 10.44.107.151.22 >gt; 10.53.237.72.46425: Flags [P.], seq 96...!
    […]!
    • Study packet sequences with timestamps (us)
    • CPU overhead optimized (socket ring buffers), but can
    still be significant. Use caution.
    
slide 42:
    netstat	
      
    • Various network protocol statistics using -s:
    • A multi-tool:
    $ netstat –s!
    -i: interface stats
    -r: route table
    default: list conns
    • netstat -p: shows
    process details!
    • Per-second interval
    with -c
    […]!
    Tcp:!
    736455 active connections openings!
    176887 passive connection openings!
    33 failed connection attempts!
    1466 connection resets received!
    3311 connections established!
    91975192 segments received!
    180415763 segments send out!
    223685 segments retransmited!
    2 bad segments received.!
    39481 resets sent!
    […]!
    TcpExt:!
    12377 invalid SYN cookies received!
    2982 delayed acks sent!
    […]!
    
slide 43:
    nicstat	
      
    • Network interface stats, iostat-like output:
    $ ./nicstat 1!
    Time
    Int
    21:21:43
    21:21:43
    eth0
    Time
    Int
    21:21:44
    21:21:44
    eth0
    Time
    Int
    21:21:45
    21:21:45
    eth0
    […]!
    rKB/s
    rKB/s
    rKB/s
    wKB/s
    wKB/s
    wKB/s
    rPk/s
    rPk/s
    rPk/s
    wPk/s
    wPk/s
    wPk/s
    rAvs
    wAvs %Util
    4915.4 4915.4 0.00
    139.8 0.00
    rAvs
    wAvs %Util
    0.00 0.00
    58.76 40441.3 0.00
    rAvs
    wAvs %Util
    4400.8 4400.8 0.00
    54.99 2979.1 0.00
    • Check network throughput and interface %util
    • I wrote this years ago; Tim Cook ported to Linux
    Sat!
    0.00!
    0.00!
    Sat!
    0.00!
    0.00!
    Sat!
    0.00!
    0.00!
    
slide 44:
    pidstat	
      
    • Very useful process stats. eg, by-thread, disk I/O:
    $ pidstat -t 1!
    Linux 3.2.0-54 (db002-91befe03) !08/18/2014
    !_x86_64_ !(8 CPU)!
    08:57:52 PM
    TGID
    TID
    %usr %system %guest
    %CPU
    08:57:54 PM
    - 484.75
    0.00 524.58
    08:57:54 PM
    08:57:54 PM
    08:57:54 PM
    08:57:54 PM
    […]!
    $ pidstat -d 1!
    […]!
    08:58:27 PM
    PID
    kB_rd/s
    kB_wr/s kB_ccwr/s Command!
    08:58:28 PM
    0.00 java!
    […]!
    • I usually prefer this over top(1)
    CPU
    Command!
    java!
    |__java!
    |__java!
    |__java!
    |__java!
    
slide 45:
    swapon	
      
    • Show swap device usage:
    $ swapon -s!
    Filename
    /dev/sda3
    Type
    partition
    Size
    • If you have swap enabled…
    Used
    Priority!
    -1!
    
slide 46:
    lsof	
      
    • More a debug tool, lsof(8) shows file descriptor usage,
    which for some apps, equals current active network
    connections:
    # lsof -iTCP -sTCP:ESTABLISHED!
    COMMAND
    PID USER FD TYPE DEVICE SIZE/OFF NODE NAME!
    sshd
    755 root 3r IPv4 13576887
    0t0 TCP bgregg-test-i-f106:ssh->gt;prod100.netflix.com:
    15241 (ESTABLISHED)!
    platforms 2614 app1 8u IPv4
    0t0 TCP localhost:33868->gt;localhost:5433 (ESTABLISHED)!
    postgres 2648 app1 7u IPv4
    0t0 TCP localhost:5433->gt;localhost:33868 (ESTABLISHED)!
    epic_plug 2857 app1 7u IPv4
    0t0 TCP localhost:33885->gt;localhost:5433 (ESTABLISHED)!
    postgres 2892 app1 7u IPv4
    0t0 TCP localhost:5433->gt;localhost:33885 (ESTABLISHED)!
    […]!
    • I’d prefer to: echo /proc/PID/fd | wc -l!
    
slide 47:
    sar	
      
    • System Activity Reporter. Many stats, eg:
    $ sar -n TCP,ETCP,DEV 1!
    Linux 3.2.55 (test-e4f1a80b)
    !08/18/2014
    !_x86_64_ !(8 CPU)!
    09:10:43 PM IFACE rxpck/s txpck/s
    rxkB/s
    txkB/s rxcmp/s txcmp/s
    09:10:44 PM
    09:10:44 PM
    eth0 4114.00 4186.00 4537.46 28513.24
    09:10:43 PM active/s passive/s
    iseg/s
    oseg/s!
    09:10:44 PM
    4107.00 22511.00!
    09:10:43 PM atmptf/s estres/s retrans/s isegerr/s
    orsts/s!
    09:10:44 PM
    1.00!
    […]!
    • Archive or live mode: (interval [count])
    • Well designed. Header naming convention, logical
    groups: TCP, ETCP, DEV, EDEV, …
    rxmcst/s!
    0.00!
    0.00!
    
slide 48:
    Observability:	
      sar	
      
    
slide 49:
    Other	
      Tools	
      
    • You may also use collectl, atop, dstat, or another
    measure-all tool
    • The tool isn’t important – it’s important to have a way to
    measure everything
    • In cloud environments, you are probably using a
    monitoring product, developed in-house or commercial.
    – We develop Atlas for cloud-wide monitoring, and Vector for
    instance-level analysis (both NetflixOSS)
    – Same method applies…
    
slide 50:
    How	
      does	
      your	
      monitoring	
      tool	
      
    measure	
      these?	
      
    
slide 51:
    App	
      is	
      taking	
      forever…	
      
    DEMO
    DISCUSSION
    
slide 52:
    Observability	
      Tools:	
      Intermediate	
      
    
slide 53:
    Advanced	
      Observability	
      Tools	
      
    • Misc:
    – ltrace, ss, iptraf, ethtool, snmpget, lldptool, iotop, blktrace,
    slabtop, /proc, pcstat
    • CPU Performance Counters:
    – perf_events, tiptop, rdmsr
    • Advanced Tracers:
    – perf_events, ftrace, eBPF, SystemTap, ktap, LTTng,
    dtrace4linux, sysdig
    • Some selected demos…
    
slide 54:
    ss	
      
    • More socket statistics:
    $ ss -mop!
    State
    Recv-Q Send-Q
    Local Address:Port
    Peer Address:Port
    CLOSE-WAIT 1
    127.0.0.1:42295
    127.0.0.1:28527
    users:(("apacheLogParser",2702,3))!
    ! mem:(r1280,w0,f2816,t0)!
    ESTAB
    127.0.0.1:5433
    127.0.0.1:41312
    timer:(keepalive,36min,0) users:(("postgres",2333,7))!
    ! mem:(r0,w0,f0,t0)!
    […]!
    $ ss –i!
    State
    Recv-Q Send-Q
    Local Address:Port
    Peer Address:Port
    CLOSE-WAIT 1
    127.0.0.1:42295
    127.0.0.1:28527
    cubic wscale:6,6 rto:208 rtt:9/6 ato:40 cwnd:10 send 145.6Mbps rcv_space:32792!
    ESTAB
    10.144.107.101:ssh
    10.53.237.72:4532
    cubic wscale:4,6 rto:268 rtt:71.5/3 ato:40 cwnd:10 send 1.5Mbps rcv_rtt:72
    rcv_space:14480!
    […]!
    
slide 55:
    iptraf	
      
    
slide 56:
    iotop	
      
    • Block device I/O (disk) by process:
    $ iotop!
    Total DISK READ:
    TID PRIO USER
    959 be/4 root
    6641 be/4 root
    1 be/4 root
    2 be/4 root
    3 be/4 root
    4 be/4 root
    5 be/4 root
    6 rt/4 root
    […]!
    50.47 M/s | Total DISK WRITE:
    59.21 M/s!
    DISK READ DISK WRITE SWAPIN
    IO>gt;
    COMMAND
    0.00 B/s
    0.00 B/s 0.00 % 99.99 % [flush-202:1]!
    50.47 M/s
    82.60 M/s 0.00 % 32.51 % java –Dnop –X!
    0.00 B/s
    0.00 B/s 0.00 % 0.00 % init!
    0.00 B/s
    0.00 B/s 0.00 % 0.00 % [kthreadd]!
    0.00 B/s
    0.00 B/s 0.00 % 0.00 % [ksoftirqd/0]!
    0.00 B/s
    0.00 B/s 0.00 % 0.00 % [kworker/0:0]!
    0.00 B/s
    0.00 B/s 0.00 % 0.00 % [kworker/u:0]!
    0.00 B/s
    0.00 B/s 0.00 % 0.00 % [migration/0]!
    • Needs kernel support enabled
    – CONFIG_TASK_IO_ACCOUNTING
    
slide 57:
    slabtop	
      
    • Kernel slab allocator memory usage:
    $ slabtop!
    Active / Total Objects (% used)
    : 4692768 / 4751161 (98.8%)!
    Active / Total Slabs (% used)
    : 129083 / 129083 (100.0%)!
    Active / Total Caches (% used)
    : 71 / 109 (65.1%)!
    Active / Total Size (% used)
    : 729966.22K / 738277.47K (98.9%)!
    Minimum / Average / Maximum Object : 0.01K / 0.16K / 8.00K!
    OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
    3565575 3565575 100%
    0.10K 91425
    365700K buffer_head!
    314916 314066 99%
    0.19K 14996
    59984K dentry!
    184192 183751 99%
    0.06K
    11512K kmalloc-64!
    138618 138618 100%
    0.94K
    130464K xfs_inode!
    138602 138602 100%
    0.21K
    29968K xfs_ili!
    102116 99012 96%
    0.55K
    58352K radix_tree_node!
    97482 49093 50%
    0.09K
    9284K kmalloc-96!
    22695 20777 91%
    0.05K
    1068K shared_policy_node!
    21312 21312 100%
    0.86K
    18432K ext4_inode_cache!
    16288 14601 89%
    0.25K
    4072K kmalloc-256!
    […]!
    
slide 58:
    pcstat	
      
    • Show page cache residency by file:
    # ./pcstat data0*!
    |----------+----------------+------------+-----------+---------|!
    | Name
    | Size
    | Pages
    | Cached
    | Percent |!
    |----------+----------------+------------+-----------+---------|!
    | data00
    | 104857600
    | 25600
    | 25600
    | 100.000 |!
    | data01
    | 104857600
    | 25600
    | 25600
    | 100.000 |!
    | data02
    | 104857600
    | 25600
    | 4080
    | 015.938 |!
    | data03
    | 104857600
    | 25600
    | 25600
    | 100.000 |!
    | data04
    | 104857600
    | 25600
    | 16010
    | 062.539 |!
    | data05
    | 104857600
    | 25600
    | 0
    | 000.000 |!
    |----------+----------------+------------+-----------+---------|!
    • Uses the mincore(2) syscall. Useful for database
    performance analysis.
    
slide 59:
    perf_events	
      
    • Provides the "perf" command
    • In Linux source code: tools/perf
    – Usually pkg added by linux-tools-common, etc.
    • Multi-tool with many capabilities
    – CPU profiling
    – PMC profiling
    – Static & dynamic tracing
    • Covered later in Profiling & Tracing
    
slide 60:
slide 61:
    rdmsr	
      
    • Model Specific Registers (MSRs), unlike PMCs, can be
    read by default in Xen guests
    – Timestamp clock, temp, power, …
    – Use rdmsr(1) from the msr-tools package to read them
    – From https://github.com/brendangregg/msr-cloud-tools:
    ec2-guest# ./showboost!
    [...]!
    TIME
    C0_MCYC
    C0_ACYC
    UTIL RATIO
    06:11:35
    51%
    116%
    06:11:40
    50%
    115%
    06:11:45
    49%
    115%
    [...]!
    ec2-guest# ./cputemp 1!
    CPU1 CPU2 CPU3 CPU4!
    61 61 60 59!
    CPU	
      Temperature	
      
    60 61 60 60!
    [...]!
    MHz!
    2900!
    2899!
    2899!
    Real	
      CPU	
      MHz	
      
    
slide 62:
    More	
      Advanced	
      Tools…	
      
    • Some others worth mentioning:
    Tool	
      
    Descrip-on	
      
    ltrace	
      
    Library	
      call	
      tracer	
      
    ethtool	
      
    Mostly	
      interface	
      tuning;	
      some	
      stats	
      
    snmpget	
      
    SNMP	
      network	
      host	
      sta
slide 63:
    Advanced	
      Tracers	
      
    • Many options on Linux:
    – perf_events, ftrace, eBPF, SystemTap, ktap, LTTng,
    dtrace4linux, sysdig
    • Most can do static and dynamic tracing
    – Static: pre-defined events (tracepoints)
    – Dynamic: instrument any software (kprobes, uprobes).
    Custom metrics on-demand. Catch all.
    • Many are in-development
    
slide 64:
    Linux	
      Observability	
      Tools	
      
    
slide 65:
    Linux	
      Observability	
      Tools	
      
    
slide 66:
    Benchmarking	
      Tools	
      
    
slide 67:
    Benchmarking	
      Tools	
      
    • Multi:
    – UnixBench, lmbench, sysbench, perf bench
    • FS/disk:
    – dd, hdparm, fio
    • App/lib:
    – ab, wrk, jmeter, openssl
    • Networking:
    – ping, hping3, iperf, ttcp, traceroute, mtr, pchar
    
slide 68:
    Benchmarking	
      
    • ~100% of benchmarks are wrong
    • Results are usually misleading:
    you benchmark A, but actually measure B, and conclude
    you measured C
    • Common mistakes:
    – Testing the wrong target: eg, FS cache instead of disk
    – Choosing the wrong target: eg, disk instead of FS cache
    … doesn’t resemble real world usage
    – Invalid results: eg, bugs
    • The energy needed to refute benchmarks is multiple
    orders of magnitude bigger than to run them
    
slide 69:
    Ac
slide 70:
    lmbench	
      
    • CPU, memory, and kernel micro-benchmarks
    • Eg, memory latency by stride size:
    $ lat_mem_rd 100m 128 >gt; out.latencies!
    some R processing…!
    L2	
      cache	
      
    Main	
      
    Memory	
      
    L1	
      cache	
      
    L3	
      cache	
      
    
slide 71:
    fio	
      
    • FS or disk I/O micro-benchmarks
    $ fio --name=seqwrite --rw=write --bs=128k --size=122374m!
    […]!
    seqwrite: (groupid=0, jobs=1): err= 0: pid=22321!
    write: io=122374MB, bw=840951KB/s, iops=6569 , runt=149011msec!
    clat (usec): min=41 , max=133186 , avg=148.26, stdev=1287.17!
    lat (usec): min=44 , max=133188 , avg=151.11, stdev=1287.21!
    bw (KB/s) : min=10746, max=1983488, per=100.18%, avg=842503.94,
    stdev=262774.35!
    cpu
    : usr=2.67%, sys=43.46%, ctx=14284, majf=1, minf=24!
    IO depths
    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >gt;=64=0.0%!
    submit
    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >gt;=64=0.0%!
    complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >gt;=64=0.0%!
    issued r/w/d: total=0/978992/0, short=0/0/0!
    lat (usec): 50=0.02%, 100=98.30%, 250=1.06%, 500=0.01%, 750=0.01%!
    lat (usec): 1000=0.01%!
    lat (msec): 2=0.01%, 4=0.01%, 10=0.25%, 20=0.29%, 50=0.06%!
    lat (msec): 100=0.01%, 250=0.01%!
    • Results include basic latency distribution
    
slide 72:
    pchar	
      
    • Traceroute with bandwidth per hop!
    $ pchar 10.71.83.1!
    […]!
    4: 10.110.80.1 (10.110.80.1)!
    Partial loss:
    0 / 5 (0%)!
    Partial char:
    rtt = 9.351109 ms, (b = 0.004961 ms/B), r2 = 0.184105!
    stddev rtt = 4.967992, stddev b = 0.006029!
    Partial queueing: avg = 0.000000 ms (0 bytes)!
    Hop char:
    rtt = --.--- ms, bw = 1268.975773 Kbps!
    Hop queueing:
    avg = 0.000000 ms (0 bytes)!
    5: 10.193.43.181 (10.193.43.181)!
    Partial loss:
    0 / 5 (0%)!
    Partial char:
    rtt = 25.461597 ms, (b = 0.011934 ms/B), r2 = 0.228707!
    stddev rtt = 10.426112, stddev b = 0.012653!
    Partial queueing: avg = 0.000000 ms (0 bytes)!
    Hop char:
    rtt = 16.110487 ms, bw = 1147.210397 Kbps!
    Hop queueing:
    avg = 0.000000 ms (0 bytes)!
    […]!
    • Needs love. Based on pathchar (Linux 2.0.30).
    
slide 73:
    Benchmarking	
      Tools	
      
    
slide 74:
    Tuning	
      Tools	
      
    
slide 75:
    Tuning	
      Tools	
      
    • Generic interfaces:
    – sysctl, /sys
    • Many areas have custom tuning tools:
    – Applications: their own config
    – CPU/scheduler: nice, renice, taskset, ulimit, chcpu
    – Storage I/O: tune2fs, ionice, hdparm, blockdev, …
    – Network: ethtool, tc, ip, route
    – Dynamic patching: stap, kpatch
    
slide 76:
    Tuning	
      Methods	
      
    • Scientific Method:
    Question
    Hypothesis
    Prediction
    Test
    Analysis
    • Any observational or benchmarking tests you can try
    before tuning?
    • Consider risks, and see previous tools
    
slide 77:
    Tuning	
      Tools	
      
    
slide 78:
    Sta
slide 79:
    Sta
slide 80:
    CPU	
      Types	
      &	
      Flags	
      
    $ more /proc/cpuinfo !
    processor!
    !: 0!
    vendor_id!
    !: GenuineIntel!
    cpu family
    !: 6!
    model
    !: 42!
    model name
    !: Intel(R) Core(TM) i5-2400 CPU @ 3.10GHz!
    stepping !
    !: 7!
    microcode!
    !: 0x1a!
    cpu MHz !
    !: 1600.000!
    cache size
    !: 6144 KB!
    physical id
    !: 0!
    siblings !
    !: 4!
    core id !
    !: 0!
    cpu cores!
    !: 4!
    apicid
    !: 0!
    initial apicid!: 0!
    fpu !
    !: yes!
    fpu_exception !: yes!
    cpuid level
    !: 13!
    wp !
    !: yes!
    flags
    !: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36
    clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc ar!
    ch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni
    pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2
    x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts
    dtherm tpr_shadow vnmi flexpriority ept vpid!
    […]!
    CPU	
      speed	
      s
slide 81:
    CPU	
      Frequency	
      Scaling	
      
    • Kernel may be configured to dynamically modify CPU
    frequency:
    # cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies !
    3101000 3100000 2900000 2700000 2500000 2300000 2100000 1900000 1700000 1600000 !
    # cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor !
    ondemand!
    – See Documentation/cpu-freq/governors.txt,
    and scaling_governor == performance
    • Not to be confused with Intel Turbo Boost (which is H/W)
    
slide 82:
    Storage	
      Devices	
      
    # cat /proc/scsi/scsi!
    Attached devices:!
    Host: scsi0 Channel: 00 Id: 00 Lun: 00!
    Vendor: ATA
    Model: ST3320413AS
    Rev: JC65!
    Type:
    Direct-Access
    ANSI SCSI revision: 05!
    Host: scsi1 Channel: 00 Id: 00 Lun: 00!
    Vendor: PLDS
    Model: DVD-RW DH16ABSH Rev: YL32!
    Type:
    CD-ROM
    ANSI SCSI revision: 05!
    # lsscsi!
    [0:0:0:0]
    disk
    ATA
    ST3320413AS
    JC65 /dev/sda !
    [1:0:0:0]
    cd/dvd PLDS
    DVD-RW DH16ABSH YL32 /dev/sr0 !
    • Micro-benchmarking disks (not file systems!) is also
    useful for understanding their characteristics
    
slide 83:
    Rou
slide 84:
    etc…	
      
    System messages: dmesg!
    Network interface config: ifconfig –a; ip link
    File system capacity: df -h!
    Volume config: mdadm --misc -D /dev/md0 !
    Storage device info: smartctl!
    NUMA config: numactl -s; numactl -H!
    PCI info: lspci
    Installed kernel modules: lsmod!
    Root crontab config: crontab –l!
    Services: service --status-all!
    
slide 85:
    Sta
slide 86:
    Profiling	
      
    
slide 87:
    Profiling	
      
    • Objectives:
    – Profile CPU usage by stack sampling
    – Generate CPU flame graphs
    – Understand gotchas with stacks & symbols
    
slide 88:
    CPU	
      Profiling	
      
    • Record stacks at a timed interval: simple and effective
    – Pros: Low (deterministic) overhead
    – Cons: Coarse accuracy, but usually sufficient
    stack	
      
    samples:	
      
    syscall	
      
    on-­‐CPU	
      
    off-­‐CPU	
      
    block	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      interrupt	
      
    
slide 89:
    perf_events	
      
    • Introduced earlier: multi-tool, profiler. Provides "perf".
    usage: perf [--version] [--help] [OPTIONS] COMMAND [ARGS]!
    The most commonly used perf commands are:!
    annotate
    Read perf.data (created by perf record) and display annotated code!
    archive
    Create archive with object files with build-ids found in perf.data file!
    bench
    General framework for benchmark suites!
    buildid-cache
    Manage build-id cache.!
    buildid-list
    List the buildids in a perf.data file!
    data
    Data file related processing!
    diff
    Read perf.data files and display the differential profile!
    evlist
    List the event names in a perf.data file!
    inject
    Filter to augment the events stream with additional information!
    kmem
    Tool to trace/measure kernel memory(slab) properties!
    kvm
    Tool to trace/measure kvm guest os!
    list
    List all symbolic event types!
    lock
    Analyze lock events!
    mem
    Profile memory accesses!
    record
    Run a command and record its profile into perf.data!
    report
    Read perf.data (created by perf record) and display the profile!
    sched
    Tool to trace/measure scheduler properties (latencies)!
    script
    Read perf.data (created by perf record) and display trace output!
    stat
    Run a command and gather performance counter statistics!
    test
    Runs sanity tests.!
    timechart
    Tool to visualize total system behavior during a workload!
    top
    System profiling tool.!
    trace
    strace inspired tool!
    probe
    Define new dynamic tracepoints!
    See 'perf help COMMAND' for more information on a specific command.!
    
slide 90:
    perf_events:	
      CPU	
      profiling	
      
    • Sampling full stack traces at 99 Hertz, for 30 secs:
    # perf record -F 99 -ag -- sleep 30!
    [ perf record: Woken up 9 times to write data ]!
    [ perf record: Captured and wrote 2.745 MB perf.data (~119930 samples) ]!
    # perf report -n --stdio!
    1.40%
    java [kernel.kallsyms]
    [k] _raw_spin_lock
    --- _raw_spin_lock!
    |--63.21%-- try_to_wake_up!
    |--63.91%-- default_wake_function!
    |--56.11%-- __wake_up_common!
    __wake_up_locked!
    ep_poll_callback!
    __wake_up_common!
    __wake_up_sync_key!
    |--59.19%-- sock_def_readable!
    […78,000 lines truncated…]!
    
slide 91:
    perf_events:	
      Full	
      "report"	
      Output	
      
    
slide 92:
    …	
      as	
      a	
      Flame	
      Graph	
      
    
slide 93:
    perf_events:	
      Flame	
      Graphs	
      
    git clone --depth 1 https://github.com/brendangregg/FlameGraph!
    cd FlameGraph!
    perf record -F 99 -a –g -- sleep 30!
    perf script | ./stackcollapse-perf.pl |./flamegraph.pl >gt; perf.svg!
    • Flame Graphs:
    – x-axis: alphabetical stack sort, to maximize merging
    – y-axis: stack depth
    – color: random, or hue can be a dimension (eg, diff)
    • Interpretation:
    – Top edge is on-CPU, beneath it is ancestry
    • Currently made from Perl + JavaScript & SVG
    • Easy to get working
    – http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html
    
slide 94:
    Mysterious	
      CPU	
      consumer…	
      
    DEMO
    DISCUSSION
    
slide 95:
    perf_events:	
      Workflow	
      
    list	
      events	
      
    count	
      events	
      
    capture	
      stacks	
      
    perf list!
    perf stat!
    perf record!
    Typical	
      
    Workflow	
      
    perf.data	
      
    text	
      UI	
      
    dump	
      profile	
      
    perf report!
    flame	
      graph	
      
    visualiza
slide 96:
    perf_events:	
      CPU	
      Flame	
      Graph	
      
    Kernel	
      
    TCP/IP	
      
    Broken	
      
    Java	
      stacks	
      
    (missing	
      
    frame	
      
    pointer)	
      
    GC	
      
    Locks	
      
    Time	
      
    Idle	
      
    thread	
      
    epoll	
      
    
slide 97:
    perf_events: Mixed-Mode
    CPU Flame Graph
    Kernel	
      
    JVM	
      
    (C++)	
      
    Fixed	
      Java	
      Stacks!	
      
    (C)	
      
    
slide 98:
    perf_events:	
      Gotchas	
      
    • Stack traces and symbols often don't work.
    – Can be a significant project to fix them. It is worth it!
    • C:
    – stacks: --fno-omit-frame-pointer
    • Java:
    – stacks on x86: -XX:+PreserveFramePointer
    (JDK-8068945 for JDK9, JDK-8072465 for JDK8)
    – symbols: perf-map-agent (other solutions exist)
    • Node.js:
    – symbols: run v8 with --perf_basic_prof
    http://www.brendangregg.com/blog/2015-02-27/linux-profiling-at-netflix.html
    
slide 99:
    perf_events:	
      Counters	
      
    • Performance Monitoring Counters (PMCs):
    $ perf list | grep –i hardware!
    cpu-cycles OR cycles
    stalled-cycles-frontend OR idle-cycles-frontend
    stalled-cycles-backend OR idle-cycles-backend
    instructions
    […]!
    branch-misses
    bus-cycles
    L1-dcache-loads
    L1-dcache-load-misses
    […]!
    rNNN (see 'perf list --help' on how to encode it)
    mem:gt;[:access]
    [Hardware event]!
    [Hardware event]!
    [Hardware event]!
    [Hardware event]!
    [Hardware event]!
    [Hardware event]!
    [Hardware cache event]!
    [Hardware cache event]!
    [Raw hardware event … !
    [Hardware breakpoint]!
    • Identify CPU cycle breakdowns, esp. stall types
    • PMCs not enabled by-default in clouds (yet)
    • Can be time-consuming to use (CPU manuals)
    – Please develop front-ends. Eg, tiptop.
    
slide 100:
    Tracing	
      
    
slide 101:
    Tracing	
      
    • Objectives:
    – Understand frameworks: tracepoints, kprobes, uprobes
    – Understand mainline tracers: ftrace, perf_events, eBPF
    – Awareness of other tracers: SystemTap, LTTng, ktap, sysdig
    – Awareness of what tracing can accomplish (eg, perf-tools)
    
slide 102:
    Tracing	
      Frameworks:	
      Tracepoints	
      
    • Statically placed at logical places in the kernel
    • Provides key event details as a “format” string
    
slide 103:
    Tracing	
      Frameworks:	
      +	
      probes	
      
    • kprobes: dynamic kernel tracing
    – function calls, returns, line numbers
    • uprobes: dynamic user-level tracing
    
slide 104:
    Linux	
      Tracing	
      Tools	
      
    irace	
      
    perf_events	
      
    LTTng	
      
    ktap	
      
    eBPF	
      
    dtrace4linux	
       OEL	
      DTrace	
      
    SystemTap	
      
    sysdig	
      
    • Many choices (too many?), all still in development
    
slide 105:
    Linux	
      Tracing	
      is	
      Magic!	
      
    irace	
      
    perf_events	
      
    LTTng	
      
    ktap	
      
    eBPF	
      
    dtrace4linux	
       OEL	
      DTrace	
      
    (Thanks Deirdré Straughan & General Zoi's Pony Creator)
    SystemTap	
      
    sysdig	
      
    
slide 106:
    Choosing	
      a	
      Tracer	
      
    • Some companies standardize on one tracer
    – eg, SystemTap, LTTng, …
    
slide 107:
    Choosing	
      a	
      Tracer	
      
    • My approach is:
    Study	
      what	
      Linux	
      already	
      has	
      built-­‐in	
      
    (perf_events,	
      irace,	
      eBPF?)	
      
    Y	
      
    Is	
      it	
      
    sufficient?	
      
    Purpose?	
      
    live	
      tracing,	
      
    coun
slide 108:
    irace	
      
    
slide 109:
    irace	
      
    • Added by Steven Rostedt and others since 2.6.27
    • Already enabled on our servers (3.2+)
    – CONFIG_FTRACE, CONFIG_FUNCTION_PROFILER, …
    – Use directly via /sys/kernel/debug/tracing:
    linux-4.0.0+# ls /sys/kernel/debug/tracing/!
    available_events
    max_graph_depth
    available_filter_functions options
    available_tracers
    per_cpu
    buffer_size_kb
    printk_formats
    buffer_total_size_kb
    README
    current_tracer
    saved_cmdlines
    dyn_ftrace_total_info
    saved_cmdlines_size
    enabled_functions
    set_event
    events
    set_ftrace_filter
    free_buffer
    set_ftrace_notrace
    function_profile_enabled
    set_ftrace_pid
    instances
    set_graph_function
    kprobe_events
    set_graph_notrace
    kprobe_profile
    snapshot!
    stack_max_size!
    stack_trace!
    stack_trace_filter!
    trace!
    trace_clock!
    trace_marker!
    trace_options!
    trace_pipe!
    trace_stat!
    tracing_cpumask!
    tracing_max_latency!
    tracing_on!
    tracing_thresh!
    – See Linux source: Documentation/trace/ftrace.txt
    
slide 110:
    irace	
      Front-­‐Ends	
      
    • Steven wrote a front-end: trace-cmd
    – Multi-tool, works well
    • I've developed the "perf-tools" front-ends
    – https://github.com/brendangregg/perf-tools
    – Both single & multi-purpose, Unix-like
    – Unsupported hacks: see WARNINGs
    • perf-tools:
    – single-purpose: iosnoop, iolatency, opensnoop
    – multi-tools: funccount, funcgraph, kprobe
    
slide 111:
    iosnoop	
      
    • Block I/O (disk) events with latency:
    # ./iosnoop –ts!
    Tracing block I/O. Ctrl-C to end.!
    STARTs
    ENDs
    COMM
    5982800.302061 5982800.302679 supervise
    5982800.302423 5982800.302842 supervise
    5982800.304962 5982800.305446 supervise
    5982800.305250 5982800.305676 supervise
    […]!
    PID
    TYPE DEV
    202,1
    202,1
    202,1
    202,1
    BLOCK
    BYTES LATms!
    0.62!
    0.42!
    0.48!
    0.43!
    # ./iosnoop –h!
    USAGE: iosnoop [-hQst] [-d device] [-i iotype] [-p PID] [-n name] [duration]!
    -d device
    # device string (eg, "202,1)!
    -i iotype
    # match type (eg, '*R*' for all reads)!
    -n name
    # process name to match on I/O issue!
    -p PID
    # PID to match on I/O issue!
    # include queueing time in LATms!
    # include start time of I/O (s)!
    # include completion time of I/O (s)!
    # this usage message!
    duration
    # duration seconds, and use buffers!
    […]!
    
slide 112:
    iolatency	
      
    • Block I/O (disk) latency distributions:
    # ./iolatency !
    Tracing block I/O. Output every 1 seconds. Ctrl-C to end.!
    >gt;=(ms) .. gt; 1
    : 2104
    |######################################|!
    1 ->gt; 2
    : 280
    |######
    2 ->gt; 4
    : 2
    4 ->gt; 8
    : 0
    8 ->gt; 16
    : 202
    |####
    >gt;=(ms) .. gt; 1
    : 1144
    |######################################|!
    1 ->gt; 2
    : 267
    |#########
    2 ->gt; 4
    : 10
    4 ->gt; 8
    : 5
    8 ->gt; 16
    : 248
    |#########
    16 ->gt; 32
    : 601
    |####################
    32 ->gt; 64
    : 117
    |####
    […]!
    
slide 113:
    opensnoop	
      
    • Trace open() syscalls showing filenames:
    # ./opensnoop -t!
    Tracing open()s. Ctrl-C to end.!
    TIMEs
    COMM
    PID
    postgres
    postgres
    postgres
    postgres
    postgres
    postgres
    postgres
    svstat
    svstat
    stat
    stat
    stat
    stat
    stat
    stat
    […]!
    FD FILE!
    0x8 /proc/self/oom_adj!
    0x5 global/pg_filenode.map!
    0x5 global/pg_internal.init!
    0x5 base/16384/PG_VERSION!
    0x5 base/16384/pg_filenode.map!
    0x5 base/16384/pg_internal.init!
    0x5 base/16384/11725!
    0x4 supervise/ok!
    0x4 supervise/status!
    0x3 /etc/ld.so.cache!
    0x3 /lib/x86_64-linux-gnu/libselinux…!
    0x3 /lib/x86_64-linux-gnu/libc.so.6!
    0x3 /lib/x86_64-linux-gnu/libdl.so.2!
    0x3 /proc/filesystems!
    0x3 /etc/nsswitch.conf!
    
slide 114:
    tpoint	
      
    • Who is creating disk I/O, and of what type?
    # ./tpoint -H block:block_rq_insert!
    tracepoint	
      
    Tracing block:block_rq_insert. Ctrl-C to end.!
    type	
      
    size	
      (sectors)	
      
    # tracer: nop!
    dev	
      
    offset	
      
    TASK-PID
    CPU#
    TIMESTAMP FUNCTION!
    | |
    flush-9:0-9318 [013] 1936182.007914: block_rq_insert: 202,16 W 0 () 160186560 + 8 [flush-9:0]!
    flush-9:0-9318 [013] 1936182.007939: block_rq_insert: 202,16 W 0 () 280100936 + 8 [flush-9:0]!
    java-9469 [014] 1936182.316184: block_rq_insert: 202,1 R 0 () 1319592 + 72 [java]!
    java-9469 [000] 1936182.331270: block_rq_insert: 202,1 R 0 () 1125744 + 8 [java]!
    java-9469 [000] 1936182.341418: block_rq_insert: 202,1 R 0 () 2699008 + 88 [java]!
    java-9469 [000] 1936182.341419: block_rq_insert: 202,1 R 0 () 2699096 + 88 [java]!
    java-9469 [000] 1936182.341419: block_rq_insert: 202,1 R 0 () 2699184 + 32 [java]!
    java-9469 [000] 1936182.345870: block_rq_insert: 202,1 R 0 () 1320304 + 24 [java]!
    java-9469 [000] 1936182.351590: block_rq_insert: 202,1 R 0 () 1716848 + 16 [java]!
    ^C!
    Ending tracing...!
    • tpoint traces a given tracepoint. -H prints the header.
    
slide 115:
    tpoint	
      -­‐l	
      
    # ./tpoint -l!
    block:block_bio_backmerge!
    block:block_bio_bounce!
    block:block_bio_complete!
    block:block_bio_frontmerge!
    block:block_bio_queue!
    block:block_bio_remap!
    block:block_getrq!
    block:block_plug!
    block:block_rq_abort!
    block:block_rq_complete!
    block:block_rq_insert!
    block:block_rq_issue!
    block:block_rq_remap!
    block:block_rq_requeue!
    […]!
    # ./tpoint –l | wc –l!
    1257!
    Lis
slide 116:
    funccount	
      
    • Count a kernel function call rate:
    # ./funccount -i 1 'bio_*'!
    Tracing "bio_*"... Ctrl-C to end.!
    FUNC
    COUNT!
    bio_attempt_back_merge
    26!
    bio_get_nr_vecs
    361!
    bio_alloc
    536!
    bio_alloc_bioset
    536!
    bio_endio
    536!
    bio_free
    536!
    bio_fs_destructor
    536!
    bio_init
    536!
    bio_integrity_enabled
    536!
    bio_put
    729!
    bio_add_page
    1004!
    [...]!
    Counts	
      are	
      in-­‐kernel,	
      
    for	
      low	
      overhead	
      
    – -i: set an output interval (seconds), otherwise until Ctrl-C
    
slide 117:
    funcgraph	
      
    • Trace a graph of kernel code flow:
    # ./funcgraph -Htp 5363 vfs_read!
    Tracing "vfs_read" for PID 5363... Ctrl-C to end.!
    # tracer: function_graph!
    TIME
    CPU DURATION
    FUNCTION CALLS!
    4346366.073832 |
    | vfs_read() {!
    4346366.073834 |
    rw_verify_area() {!
    4346366.073834 |
    security_file_permission() {!
    4346366.073834 |
    apparmor_file_permission() {!
    4346366.073835 |
    0.153 us
    common_file_perm();!
    4346366.073836 |
    0.947 us
    4346366.073836 |
    0.066 us
    __fsnotify_parent();!
    4346366.073836 |
    0.080 us
    fsnotify();!
    4346366.073837 |
    2.174 us
    4346366.073837 |
    2.656 us
    4346366.073837 |
    tty_read() {!
    4346366.073837 |
    0.060 us
    tty_paranoia_check();!
    […]!
    
slide 118:
    kprobe	
      
    • Dynamically trace a kernel function call or return, with
    variables, and in-kernel filtering:
    # ./kprobe 'p:open do_sys_open filename=+0(%si):string' 'filename ~ "*stat"'!
    Tracing kprobe myopen. Ctrl-C to end.!
    postgres-1172 [000] d... 6594028.787166: open: (do_sys_open
    +0x0/0x220) filename="pg_stat_tmp/pgstat.stat"!
    postgres-1172 [001] d... 6594028.797410: open: (do_sys_open
    +0x0/0x220) filename="pg_stat_tmp/pgstat.stat"!
    postgres-1172 [001] d... 6594028.797467: open: (do_sys_open
    +0x0/0x220) filename="pg_stat_tmp/pgstat.stat”!
    ^C!
    Ending tracing...!
    • Add -s for stack traces; -p for PID filter in-kernel.
    • Quickly confirm kernel behavior; eg: did a tunable take
    effect?
    
slide 119:
    perf-­‐tools	
      (so	
      far…)	
      
    
slide 120:
    perf-­‐tools	
      (so	
      far…)	
      
    
slide 121:
    perf_events	
      
    
slide 122:
    perf_events	
      
    • Powerful profiler (covered earlier)
    • … and tracer:
    – User-level and kernel dynamic tracing
    – Kernel line tracing and local variables (debuginfo)
    – Kernel filtering expressions
    – Efficient in-kernel counts (perf stat)
    • Not very programmable, yet
    – Limited kernel summaries. Should improve with eBPF.
    
slide 123:
    perf_events	
      Lis
slide 124:
    Linux	
      Event	
      Sources	
      
    
slide 125:
    perf_events	
      Tracing	
      Tracepoints	
      
    # perf record -e block:block_rq_complete -a sleep 10!
    [ perf record: Woken up 1 times to write data ]!
    [ perf record: Captured and wrote 0.428 MB perf.data (~18687 samples) ]!
    # perf script!
    run 30339 [000] 2083345.722767: block:block_rq_complete: 202,1 W () 12984648 + 8 [0]!
    run 30339 [000] 2083345.722857: block:block_rq_complete: 202,1 W () 12986336 + 8 [0]!
    run 30339 [000] 2083345.723180: block:block_rq_complete: 202,1 W () 12986528 + 8 [0]!
    swapper
    0 [000] 2083345.723489: block:block_rq_complete: 202,1 W () 12986496 + 8 [0]!
    swapper
    0 [000] 2083346.745840: block:block_rq_complete: 202,1 WS () 1052984 + 144 [0]!
    supervise 30342 [000] 2083346.746571: block:block_rq_complete: 202,1 WS () 1053128 + 8 [0]!
    supervise 30342 [000] 2083346.746663: block:block_rq_complete: 202,1 W () 12986608 + 8 [0]!
    run 30342 [000] 2083346.747003: block:block_rq_complete: 202,1 W () 12986832 + 8 [0]!
    [...]!
    • If -g is used in "perf record", stack traces are included
    • If "perf script" output is too verbose, try "perf report",
    or making a flame graph
    
slide 126:
    perf_events	
      Report	
      
    # perf record –e skb:consume_skb -ag!
    ^C[ perf record: Woken up 1 times to write data ]!
    [ perf record: Captured and wrote 0.065 MB perf.data (~2851 samples) ]!
    # perf report -n --stdio!
    [...]!
    74.42% swapper [kernel.kallsyms] [k] consume_skb!
    --- consume_skb!
    arp_process!
    arp_rcv!
    Summarizing	
      stack	
      
    __netif_receive_skb_core!
    __netif_receive_skb!
    traces	
      for	
      a	
      tracepoint	
      
    netif_receive_skb!
    virtnet_poll!
    net_rx_action!
    __do_softirq!
    irq_exit!
    do_IRQ!
    ret_from_intr!
    default_idle!
    cpu_idle!
    start_secondary!
    […]!
    
slide 127:
    eBPF	
      
    
slide 128:
    eBPF	
      
    • Extended BPF: programs on tracepoints
    – High performance filtering: JIT
    – In-kernel summaries: maps
    – Developed by Alexei Starovoitov (Plumgrid)
    • Currently being integrated in parts (Linux 3.18, 4.1, …)
    # ./bitehist!
    Tracing block device I/O... Interval 5 secs. Ctrl-C to end.!
    kbytes
    : count
    distribution!
    0 ->gt; 1
    : 3
    2 ->gt; 3
    : 0
    4 ->gt; 7
    : 3395
    |************************************* |!
    8 ->gt; 15
    : 1
    16 ->gt; 31
    : 2
    32 ->gt; 63
    : 738
    |*******
    64 ->gt; 127
    : 3
    in-­‐kernel	
      
    ummary	
      
    128 ->gt; 255
    : 1
    
slide 129:
    eBPF	
      
    • Example in-kernel latency heat map:
    
slide 130:
    Other	
      Tracers	
      
    
slide 131:
    SystemTap	
      
    • Fully programmable, fully featured
    – Including access to user-level tracepoints
    • Compiles tracing programs into kernel modules
    – Needs a compiler, and takes time
    • “Works great on Red Hat”
    – I keep trying on other distros and have hit trouble in the past
    – Make sure you are on the latest version (>gt;=2.6)
    • "Works great with kernel debuginfo"
    – Suited for kernel developers who have it handy
    – A difficult requirement for our cloud environment
    
slide 132:
    "Lightweight"	
      SystemTap	
      
    • SystemTap can be used without kernel debuginfo
    – Makes life harder, but some tasks are still possible
    – providers: nd_syscall, kprobe.function, kernel.trace, …
    – args via: int_arg(), uint_arg(), pointer_arg(), user_string()
    • Something I've experimented with. Examples:
    – https://github.com/brendangregg/systemtap-lwtools/
    # stap -e 'global a; probe nd_syscall.write { a 
slide 133:
    ktap	
      
    • Was a very promising new Linux tracer:
    – Sampling, static & dynamic tracing
    – Lightweight, simple. Uses bytecode.
    – Suited for embedded devices
    • Development suspended while eBPF integrates
    • Will it restart?
    
slide 134:
    sysdig	
      
    • sysdig: Innovative new tracer. Simple expressions:
    sysdig fd.type=file and evt.failed=true!
    sysdig evt.type=open and fd.name contains /etc!
    sysdig -p"%proc.name %fd.name" "evt.type=accept and proc.name!=httpd”!
    • Replacement for strace? (or “perf trace” will)
    • Programmable “chisels”. Eg, one of mine:
    # sysdig -c fileslower 1!
    TIME
    PROCESS
    2014-04-13 20:40:43.973 cksum
    2014-04-13 20:40:44.187 cksum
    2014-04-13 20:40:44.689 cksum
    […]!
    TYPE
    read
    read
    read
    LAT(ms) FILE!
    2 /mnt/partial.0.0!
    1 /mnt/partial.0.0!
    2 /mnt/partial.0.0!
    • Currently syscalls and user-level processing only
    – I'm not sure it can be optimized enough for kernel tracing,
    unless it adopts eBPF for in-kernel processing & summaries
    
slide 135:
    Present	
      &	
      Future	
      
    • Present:
    – ftrace & perf_events solving many needs today:
    • PMC profiling, stack profiling, tracepoint & dynamic tracing, …
    • Expected Future:
    – eBPF for kernel summaries & advanced programs
    – eBPF perf integration to improve ease of use
    • Possible Future:
    – eBPF high level language (ktap?)
    – ftrace/eBPF integration
    – Other tracer eBPF integration (SystemTap, LTTng, sysdig?)
    – One of the other tracers going mainline?
    
slide 136:
    The	
      Tracing	
      Landscape,	
      May	
      2015	
      
    (less	
      brutal)	
      
    (my	
      opinion)	
      
    Ease	
      of	
      use	
      
    sysdig	
      
    perf	
      
    stap	
      
    irace	
      
    (alpha)	
      
    (brutal)	
      
    dtrace4L.
    ktap	
      
    (mature)	
      
    Stage	
      of	
      
    Development	
      
    eBPF	
      
    Scope	
      &	
      Capability	
      
    
slide 137:
    In	
      Summary…	
      
    
slide 138:
    Methodologies	
      Summary	
      
    • Objectives:
    – Recognize the Streetlight Anti-Method
    – Perform the Workload Characterization Method
    – Perform the USE Method
    – Be aware of other methodologies
    Try to start with the questions (methodology), to help guide
    your use of the tools
    
slide 139:
    Tools	
      Summary	
      
    • Objectives:
    – Perform the USE Method for resource utilization
    – Perform Workload Characterization for disks, network
    – Have exposure to various observability tools:
    • Basic: vmstat, iostat, mpstat, ps, top, …
    • Intermediate: tcpdump, netstat, nicstat, pidstat, sar, …
    • Advanced: ss, slaptop, perf_events, …
    – Perform Active Benchmarking
    – Understand tuning risks
    – Perform Static Performance Tuning
    Print out the tools diagrams for your office wall
    
slide 140:
    Profiling	
      &	
      Tracing	
      Summary	
      
    • Objectives:
    – Profile CPU usage by stack sampling
    – Generate CPU flame graphs
    – Understand gotchas with stacks & symbols
    – Understand frameworks: tracepoints, kprobes, uprobes
    – Understand mainline tracers: ftrace, perf_events, eBPF
    – Awareness of other tracers: SystemTap, LTTng, ktap, sysdig
    – Awareness of what tracing can accomplish (eg, perf-tools)
    I've hopefully turned some unknown unknowns into known
    unknowns
    
slide 141:
    References	
      &	
      Links	
      
    Systems	
      Performance:	
      Enterprise	
      and	
      the	
      Cloud,	
      Pren
slide 142:
    Thanks	
      
    Questions?
    http://slideshare.net/brendangregg
    http://www.brendangregg.com
    bgregg@netflix.com
    @brendangregg