Systems Performance 2nd Ed.



BPF Performance Tools book

Recent posts:
Blog index
About
RSS

SCaLE11x (2013): Linux Performance Analysis and Tools

Video: http://www.youtube.com/watch?v=0yyorhl6IjM

Talk for the Southern California Linux Expo 11x (2013) by Brendan Gregg.

Description: "This talk for SCaLE11x covers system performance analysis methodologies and the Linux tools to support them, so that you can get the most out of your systems and solve performance issues quickly. This includes a wide variety of tools, including basics like top(1), advanced tools like perf, and new tools like the DTrace for Linux prototypes."

next
prev
1/115
next
prev
2/115
next
prev
3/115
next
prev
4/115
next
prev
5/115
next
prev
6/115
next
prev
7/115
next
prev
8/115
next
prev
9/115
next
prev
10/115
next
prev
11/115
next
prev
12/115
next
prev
13/115
next
prev
14/115
next
prev
15/115
next
prev
16/115
next
prev
17/115
next
prev
18/115
next
prev
19/115
next
prev
20/115
next
prev
21/115
next
prev
22/115
next
prev
23/115
next
prev
24/115
next
prev
25/115
next
prev
26/115
next
prev
27/115
next
prev
28/115
next
prev
29/115
next
prev
30/115
next
prev
31/115
next
prev
32/115
next
prev
33/115
next
prev
34/115
next
prev
35/115
next
prev
36/115
next
prev
37/115
next
prev
38/115
next
prev
39/115
next
prev
40/115
next
prev
41/115
next
prev
42/115
next
prev
43/115
next
prev
44/115
next
prev
45/115
next
prev
46/115
next
prev
47/115
next
prev
48/115
next
prev
49/115
next
prev
50/115
next
prev
51/115
next
prev
52/115
next
prev
53/115
next
prev
54/115
next
prev
55/115
next
prev
56/115
next
prev
57/115
next
prev
58/115
next
prev
59/115
next
prev
60/115
next
prev
61/115
next
prev
62/115
next
prev
63/115
next
prev
64/115
next
prev
65/115
next
prev
66/115
next
prev
67/115
next
prev
68/115
next
prev
69/115
next
prev
70/115
next
prev
71/115
next
prev
72/115
next
prev
73/115
next
prev
74/115
next
prev
75/115
next
prev
76/115
next
prev
77/115
next
prev
78/115
next
prev
79/115
next
prev
80/115
next
prev
81/115
next
prev
82/115
next
prev
83/115
next
prev
84/115
next
prev
85/115
next
prev
86/115
next
prev
87/115
next
prev
88/115
next
prev
89/115
next
prev
90/115
next
prev
91/115
next
prev
92/115
next
prev
93/115
next
prev
94/115
next
prev
95/115
next
prev
96/115
next
prev
97/115
next
prev
98/115
next
prev
99/115
next
prev
100/115
next
prev
101/115
next
prev
102/115
next
prev
103/115
next
prev
104/115
next
prev
105/115
next
prev
106/115
next
prev
107/115
next
prev
108/115
next
prev
109/115
next
prev
110/115
next
prev
111/115
next
prev
112/115
next
prev
113/115
next
prev
114/115
next
prev
115/115

PDF: SCaLE_Linux_Performance2013.pdf

Keywords (from pdftotext):

slide 1:
    Linux Performance
    Analysis and Tools
    Brendan Gregg
    Lead Performance Engineer
    brendan@joyent.com
    @brendangregg
    Sunday, February 24, 13
    SCaLE11x
    February, 2013
    
slide 2:
    Find the Bottleneck
    Operating System
    Hardware
    Applications
    DBs, all server types, ...
    System Libraries
    Linux Kernel
    System Call Interface
    VFS
    ext3/...
    Scheduler
    Sockets
    ZFS
    CPU
    Interconnect
    CPU
    TCP/UDP
    LVM
    Block Device Interface
    Ethernet
    Virtual
    Memory
    Memory
    Bus
    DRAM
    Device Drivers
    I/O Bus
    I/O Bridge
    Expander Interconnect
    I/O Controller
    Disk
    Sunday, February 24, 13
    Disk
    Network Controller
    Interface
    Transports
    Port
    Port
    
slide 3:
    whoami
    • Lead Performance Engineer
    • Work/Research: tools, visualizations, methodologies
    • Was Brendan@Sun Microsystems, Oracle, now Joyent
    Sunday, February 24, 13
    
slide 4:
    Joyent
    • High-Performance Cloud Infrastructure
    • Compete on cloud instance/OS performance
    • Public/private cloud provider
    • OS-Virtualization for bare metal performance (Zones)
    • Core developers of SmartOS and node.js
    • KVM for Linux guests
    Sunday, February 24, 13
    
slide 5:
    SCaLE10x: Cloud Performance Analysis
    • Example perf issues, including new tools and visualizations:
    http://dtrace.org/blogs/brendan/2012/01/30/performance-analysis-talk-at-scale10x/
    Sunday, February 24, 13
    
slide 6:
    SCaLE11x: Linux Performance Analysis
    • The primary operating system for my next book:
    (secondary is the OpenSolaris-illumos-based SmartOS)
    Systems
    Performance
    ENTERPRISE
    AND THE CLOUD
    Brendan Gregg
    Prentice Hall, 2013
    Sunday, February 24, 13
    
slide 7:
    Agenda
    • Background
    • Linux Analysis and Tools
    • Basic
    • Intermediate
    • Advanced
    • Methodologies
    • Challenges
    Sunday, February 24, 13
    
slide 8:
    Performance
    • Why do performance analysis?
    • Reduce IT spend – find and eliminate waste, find areas to
    tune, and do more with less
    • Build scalable architectures – understand system limits
    and develop around them
    • Solve issues – locate bottlenecks and latency outliers
    Sunday, February 24, 13
    
slide 9:
    Systems Performance
    • Why study the operating system?
    • Find and fix kernel-based perf issues
    • 2-20% wins: I/O or buffer size tuning, NUMA config, etc
    • 2-200x wins: bugs, disabled features, perturbations causing
    latency outliers
    • Kernels change, new devices are added, workloads scale, and
    new perf issues are encountered.
    • Analyze application perf from kernel/system context
    • 2-2000x wins: identifying and eliminating unnecessary work
    Sunday, February 24, 13
    
slide 10:
    Perspectives
    • System analysis can be top-down, or bottom-up:
    Workload
    Developers
    Operating
    System
    Software
    Stack
    Application
    Workload
    Analysis
    System Libraries
    System Calls
    Kernel
    Sysadmins
    Sunday, February 24, 13
    Devices
    Resource
    Analysis
    
slide 11:
    Kernel Internals
    • Eventually you’ll need to know some kernel internals
    Operating System
    Applications
    DBs, all server types, ...
    System Libraries
    user-level
    System Call Interface
    Linux Kernel
    VFS
    ext3/...
    Sockets
    ZFS
    LVM
    TCP/UDP
    Block Device Int.
    Ethernet
    Device Drivers
    Sunday, February 24, 13
    Scheduler
    Virtual
    Memory
    kernel-level
    
slide 12:
    Common System Metrics
    $ iostat
    Linux 3.2.6-3.fc16.x86_64 (node104)
    avg-cpu:
    Device:
    vda
    vdb
    %user
    %nice %system %iowait
    tps
    kB_read/s
    02/20/2013
    %steal
    kB_wrtn/s
    _x86_64_ (1 CPU)
    %idle
    kB_read
    kB_wrtn
    • It’s also worth studying common system metrics (iostat, ...),
    even if you intend to use a monitoring product. Monitoring
    products often use the same metrics, read from /proc.
    Sunday, February 24, 13
    
slide 13:
    Analysis and Tools
    Sunday, February 24, 13
    
slide 14:
    Analysis and Tools
    • A quick tour of tools, to show what can be done
    • Then, some methodologies for applying them
    Sunday, February 24, 13
    
slide 15:
    Analysis and Tools
    Operating System
    Hardware
    Applications
    DBs, all server types, ...
    System Libraries
    System Call Interface
    VFS
    ext3/...
    Sockets
    ZFS
    CPU
    TCP/UDP
    LVM
    Block Device Interface
    Scheduler
    Ethernet
    Virtual
    Memory
    DRAM
    Device Drivers
    I/O Bridge
    I/O Controller
    Disk
    Sunday, February 24, 13
    Disk
    Network Controller
    Port
    Port
    
slide 16:
    Analysis and Tools
    perf dtrace stap
    strace
    Operating System
    netstat
    perf
    top
    pidstat
    mpstat
    dstat
    Applications
    DBs, all server types, ...
    System Libraries
    System Call Interface
    VFS
    ext3/...
    Scheduler
    Sockets
    ZFS
    TCP/UDP
    LVM
    Virtual
    Memory
    Block Device Interface
    Ethernet
    Device Drivers
    iostat
    iotop
    blktrace
    dtrace
    perf
    I/O Bridge
    I/O Controller
    Disk
    Disk
    perf
    CPU
    vmstat
    slabtop
    dstat
    free
    DRAM
    top
    tcpdump ip nicstat dtrace
    dstat
    Network Controller
    Port
    ping
    Sunday, February 24, 13
    Hardware
    Port
    Various:
    sar
    /proc
    
slide 17:
    Tools: Basic
    • uptime
    • top or htop
    • mpstat
    • iostat
    • vmstat
    • free
    • ping
    • nicstat
    • dstat
    Sunday, February 24, 13
    
slide 18:
    uptime
    • Shows load averages, which are also shown by other tools:
    $ uptime
    16:23:34 up 126 days,
    1:03,
    1 user,
    load average: 5.09, 2.12, 1.82
    • This counts runnable threads (tasks), on-CPU, or, runnable
    and waiting. Linux includes tasks blocked on disk I/O.
    • These are exponentially-damped moving averages, with time
    constants of 1, 5 and 15 minutes. With three values you can
    see if load is increasing, steady, or decreasing.
    • If the load is greater than the CPU count, it might mean the
    CPUs are saturated (100% utilized), and threads are suffering
    scheduler latency. Might. There’s that disk I/O factor too.
    • This is only useful as a clue. Use other tools to investigate!
    Sunday, February 24, 13
    
slide 19:
    top
    • System-wide and per-process summaries:
    $ top
    top - 01:38:11 up 63 days, 1:17, 2 users, load average: 1.57, 1.81, 1.77
    Tasks: 256 total,
    2 running, 254 sleeping,
    0 stopped,
    0 zombie
    Cpu(s): 2.0%us, 3.6%sy, 0.0%ni, 94.2%id, 0.0%wa, 0.0%hi, 0.2%si, 0.0%st
    Mem: 49548744k total, 16746572k used, 32802172k free,
    182900k buffers
    Swap: 100663292k total,
    0k used, 100663292k free, 14925240k cached
    PID USER
    11721 web
    11715 web
    10 root
    51 root
    11724 admin
    1 root
    [...]
    NI VIRT RES SHR S %CPU %MEM
    TIME+ COMMAND
    0 623m 50m 4984 R
    93 0.1
    0:59.50 node
    0 619m 20m 4916 S
    25 0.0
    0:07.52 node
    0 S
    1 0.0 248:52.56 ksoftirqd/2
    0 S
    0 0.0
    0:35.66 events/0
    0 19412 1444 960 R
    0 0.0
    0:00.07 top
    0 23772 1948 1296 S
    0 0.0
    0:04.35 init
    • %CPU = interval sum for all CPUs (varies on other OSes)
    • top can consume CPU (syscalls to read /proc)
    • Straight-forward. Or is it?
    Sunday, February 24, 13
    
slide 20:
    top, cont.
    • Interview questions:
    • 1. Does it show all CPU consumers?
    • 2. A process has high %CPU – next steps for analysis?
    Sunday, February 24, 13
    
slide 21:
    top, cont.
    • 1. top can miss:
    • short-lived processes
    • kernel threads (tasks), unless included (see top options)
    • 2. analyzing high CPU processes:
    • identify why – profile code path
    • identify what – execution or stall cycles
    • High %CPU time may be stall cycles on memory I/O –
    upgrading to faster CPUs doesn’t help!
    Sunday, February 24, 13
    
slide 22:
    htop
    • Super top. Super configurable. Eg, basic CPU visualization:
    Sunday, February 24, 13
    
slide 23:
    mpstat
    • Check for hot threads, unbalanced workloads:
    $ mpstat -P ALL 1
    02:47:49
    CPU
    %usr
    02:47:50
    all
    02:47:50
    02:47:50
    02:47:50
    02:47:50
    3 100.00
    02:47:50
    4 100.00
    02:47:50
    5 100.00
    02:47:50
    6 100.00
    02:47:50
    02:47:50
    8 100.00
    [...]
    %nice
    %sys %iowait
    %irq
    %soft %steal %guest
    • Columns are summarized system-wide in top(1)’s header
    Sunday, February 24, 13
    %idle
    
slide 24:
    iostat
    • Disk I/O statistics. 1st output is summary since boot.
    $ iostat -xkdz 1
    Linux 2.6.35-32-server (prod21)
    02/20/13
    _x86_64_
    (16 CPU)
    Device:
    sda
    sdb
    rrqm/s
    wrqm/s
    r/s
    w/s
    rkB/s
    wkB/s
    Device:
    sdb
    rrqm/s
    wrqm/s
    r/s
    w/s
    rkB/s
    wkB/s
    \ ...
    / ...
    \ ...
    / ...
    \ ...
    / ...
    workload input
    ... \
    ... /
    ... \
    ... /
    ... \
    ... /
    ... \
    avgqu-sz
    await r_await w_await
    svctm
    %util
    avgqu-sz
    await r_await w_await
    svctm
    %util
    resulting performance
    Sunday, February 24, 13
    
slide 25:
    iostat, cont.
    • %util: usefulness depends on target – virtual devices backed
    by multiple disks may accept more work a 100% utilization
    • Also calculate I/O controller stats by summing their devices
    • One nit: would like to see disk errors too. Add a “-e”?
    Sunday, February 24, 13
    
slide 26:
    vmstat
    • Virtual-Memory statistics, and other high-level summaries:
    $ vmstat 1
    procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---r b
    swpd
    free
    buff cache
    cs us sy id wa
    15 0
    2852 46686812 279456 1401196
    0 0 0 100 0
    16 0
    2852 46685192 279456 1401196
    0 2136 36607 56 33 11 0
    15 0
    2852 46685952 279456 1401196
    56 2150 36905 54 35 11 0
    15 0
    2852 46685960 279456 1401196
    0 2173 36645 54 33 13 0
    [...]
    • First line of output includes some summary-since-boot values
    • “r” = total number of runnable threads, including those running
    • Swapping (aka paging) allows over-subscription of main
    memory by swapping pages to disk, but costs performance
    Sunday, February 24, 13
    
slide 27:
    free
    • Memory usage summary (Kbytes default):
    $ free
    total
    Mem:
    -/+ buffers/cache:
    Swap:
    used
    free
    • buffers: block device I/O cache
    • cached: virtual page cache
    Sunday, February 24, 13
    shared
    buffers
    cached
    
slide 28:
    ping
    • Simple network test (ICMP):
    $ ping www.hilton.com
    PING a831.b.akamai.net (63.234.226.9): 56 data bytes
    64 bytes from 63.234.226.9: icmp_seq=0 ttl=56 time=737.737 ms
    Request timeout for icmp_seq 1
    64 bytes from 63.234.226.9: icmp_seq=2 ttl=56 time=819.457 ms
    64 bytes from 63.234.226.9: icmp_seq=3 ttl=56 time=897.835 ms
    64 bytes from 63.234.226.9: icmp_seq=4 ttl=56 time=669.052 ms
    64 bytes from 63.234.226.9: icmp_seq=5 ttl=56 time=799.932 ms
    --- a831.b.akamai.net ping statistics --6 packets transmitted, 5 packets received, 16.7% packet loss
    round-trip min/avg/max/stddev = 669.052/784.803/897.835/77.226 ms
    • Used to measure network latency. Actually kernel gt; kernel
    IP stack latency, including how the network handles ICMP.
    • Tells us some, but not a lot (above is an exception).
    Lots of other/better tools for this (eg, hping). Try using TCP.
    Sunday, February 24, 13
    
slide 29:
    nicstat
    • Network statistics tool, ver 1.92 on Linux:
    # nicstat -z 1
    Time
    Int
    rKB/s
    01:20:58
    eth0
    01:20:58
    eth4
    01:20:58 vlan123
    01:20:58
    br0
    Time
    Int
    rKB/s
    01:20:59
    eth4 42376.0
    Time
    Int
    rKB/s
    01:21:00
    eth0
    01:21:00
    eth4 41834.7
    Time
    Int
    rKB/s
    01:21:01
    eth4 42017.9
    [...]
    wKB/s
    rPk/s
    wPk/s
    wKB/s
    rPk/s
    wPk/s
    974.5 28589.4 14002.1
    wKB/s
    rPk/s
    wPk/s
    977.9 28221.5 14058.3
    wKB/s
    rPk/s
    wPk/s
    979.0 28345.0 14073.0
    rAvs
    rAvs
    rAvs
    rAvs
    wAvs %Util
    64.81 0.00
    80.11 0.00
    64.81 0.00
    42.07 0.00
    wAvs %Util
    71.27 35.5
    wAvs %Util
    0.00 0.00
    71.23 35.1
    wAvs %Util
    71.24 35.2
    Sat
    Sat
    Sat
    Sat
    • This was the tool I wanted, and finally wrote it out of frustration
    (Tim Cook ported and enhanced it on Linux)
    • Calculate network controller stats by summing interfaces
    Sunday, February 24, 13
    
slide 30:
    dstat
    • A better vmstat-like tool. Does coloring (FWIW).
    Sunday, February 24, 13
    
slide 31:
    Tools: Basic, recap
    • uptime
    • top or htop
    • mpstat
    • iostat
    • vmstat
    • free
    • ping
    • nicstat
    • dstat
    Sunday, February 24, 13
    
slide 32:
    Tools: Basic, recap
    Operating System
    Hardware
    Applications
    DBs, all server types, ...
    top
    mpstat
    dstat
    System Libraries
    System Call Interface
    VFS
    ext3/...
    Scheduler
    Sockets
    ZFS
    Virtual
    Memory
    Block Device Interface
    vmstat
    TCP/UDP
    LVM
    Ethernet
    Device Drivers
    iostat
    infer
    Disk
    Disk
    dstat
    DRAM
    nicstat
    Network Controller
    Port
    ping
    Sunday, February 24, 13
    dstat
    free
    top
    infer
    I/O Bridge
    I/O Controller
    CPU
    Port
    
slide 33:
    Tools: Intermediate
    • sar
    • netstat
    • pidstat
    • strace
    • tcpdump
    • blktrace
    • iotop
    • slabtop
    • sysctl
    • /proc
    Sunday, February 24, 13
    
slide 34:
    sar
    • System Activity Reporter. Eg, paging statistics -B:
    $ sar -B 1
    Linux 3.2.6-3.fc16.x86_64 (node104)
    05:24:34 PM
    05:24:35 PM
    05:24:36 PM
    05:24:37 PM
    05:24:38 PM
    05:24:39 PM
    05:24:40 PM
    [...]
    pgpgin/s pgpgout/s
    02/20/2013 _x86_64_
    fault/s
    majflt/s
    (1 CPU)
    pgfree/s pgscank/s pgscand/s pgsteal/s
    • Configure to archive statistics from cron
    • Many, many statistics available:
    • -d: block device statistics, -q: run queue statistics, ...
    • Same statistics as shown by other tools (vmstat, iostat, ...)
    Sunday, February 24, 13
    %vmeff
    
slide 35:
    netstat
    • Various network protocol statistics using -s:
    $ netstat -s
    [...]
    Tcp:
    127116 active connections openings
    165223 passive connection openings
    12904 failed connection attempts
    19873 connection resets received
    20 connections established
    662889209 segments received
    354923419 segments send out
    405146 segments retransmited
    6 bad segments received.
    26379 resets sent
    [...]
    TcpExt:
    2142 invalid SYN cookies received
    3350 resets received for embryonic SYN_RECV sockets
    7460 packets pruned from receive queue because of socket buffer overrun
    2932 ICMP packets dropped because they were out-of-window
    96670 TCP sockets finished time wait in fast timer
    86 time wait sockets recycled by time stamp
    1007 packets rejects in established connections because of timestamp
    [...many...]
    Sunday, February 24, 13
    
slide 36:
    pidstat
    • Very useful process breakdowns:
    # pidstat 1
    Linux 3.2.6-3.fc16.x86_64 (node107)
    02/20/2013
    _x86_64_ (1 CPU)
    05:55:18 PM
    05:55:19 PM
    05:55:19 PM
    PID
    %usr %system
    %guest
    %CPU
    CPU
    Command
    pidstat
    cksum
    05:55:19 PM
    05:55:20 PM
    [...]
    PID
    %usr %system
    %guest
    %CPU
    CPU
    Command
    cksum
    # pidstat -d 1
    Linux 3.2.6-3.fc16.x86_64 (node107)
    02/20/2013
    _x86_64_ (1 CPU)
    05:55:22 PM
    05:55:23 PM
    05:55:23 PM
    PID
    kB_rd/s
    12643 151985.71
    kB_wr/s kB_ccwr/s
    Command
    jbd2/vda2-8
    cksum
    05:55:23 PM
    05:55:24 PM
    [...]
    PID
    kB_wr/s kB_ccwr/s
    Command
    cksum
    kB_rd/s
    disk I/O (yay!)
    Sunday, February 24, 13
    
slide 37:
    strace
    • System call tracer:
    $ strace -tttT -p 12670
    1361424797.229550 read(3, "REQUEST 1888 CID 2"..., 65536) = 959 gt;
    1361424797.239053 read(3, "", 61440)
    = 0 gt;
    1361424797.239406 close(3)
    = 0 gt;
    1361424797.239738 munmap(0x7f8b22684000, 4096) = 0 gt;
    1361424797.240145 fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
    gt;
    [...]
    • -ttt: microsecond timestamp since epoch (left column)
    • -T: time spent in syscall (gt;)
    • -p: PID to trace (or provide a command)
    • Useful – high application latency often caused by resource I/O,
    and most resource I/O is performed by syscalls
    Sunday, February 24, 13
    
slide 38:
    strace, cont.
    • -c: print summary:
    # strace -c dd if=/dev/zero of=/dev/null bs=512 count=1024k
    [...]
    % time
    seconds usecs/call
    calls
    errors syscall
    ------ ----------- ----------- --------- --------- ---------------51.32
    read
    write
    open
    [...]
    • This is also a (worst case) demo of the strace overhead:
    # time dd if=/dev/zero of=/dev/null bs=512 count=1024k
    [...]
    536870912 bytes (537 MB) copied, 0.35226 s, 1.5 GB/s
    real 0m0.355s
    user 0m0.021s
    sys 0m0.022s
    # time strace -c dd if=/dev/zero of=/dev/null bs=512 count=1024k
    [...]
    536870912 bytes (537 MB) copied, 71.9565 s, 7.5 MB/s
    real 1m11.969s
    user 0m3.179s
    sys 1m6.346s
    200x slower
    Sunday, February 24, 13
    
slide 39:
    tcpdump
    • Sniff network packets, dump to output files for post analysis:
    # tcpdump -i eth4 -w /tmp/out.tcpdump
    tcpdump: listening on eth4, link-type EN10MB (Ethernet), capture size 65535
    bytes
    ^C33651 packets captured
    34160 packets received by filter
    508 packets dropped by kernel
    # tcpdump -nr /tmp/out.tcpdump
    reading from file /tmp/out.tcpdump, link-type EN10MB (Ethernet)
    06:24:43.908732 IP 10.2.0.2.55502 >gt; 10.2.203.2.22: Flags [.], ack ...
    06:24:43.908922 IP 10.2.0.2.55502 >gt; 10.2.203.2.22: Flags [.], ack ...
    06:24:43.908943 IP 10.2.203.2.22 >gt; 10.2.0.2.55502: Flags [.], seq ...
    06:24:43.909061 IP 10.2.0.2.55502 >gt; 10.2.203.2.22: Flags [.], ack ...
    • Output has timestamps with microsecond resolution
    • Study odd network latency packet-by-packet
    • Import file into other tools (wireshark)
    Sunday, February 24, 13
    
slide 40:
    tcpdump, cont.
    • Does have overhead in terms of CPU and storage; previous
    example dropped packets
    • Should be using socket ring buffers to reduce overhead
    • Can use filter expressions to also reduce overhead
    • Could still be problematic for busy interfaces
    Sunday, February 24, 13
    
slide 41:
    blktrace
    • Block device I/O event tracing. Launch using btrace, eg:
    # btrace /dev/sdb
    8,16
    8,16
    8,16
    8,16
    8,16
    8,16
    8,16
    8,16
    [...]
    0.429604145 20442
    0.429604569 20442
    0.429606014 20442
    0.429607624 20442
    0.429608804 20442
    0.429610501 20442
    0.429611912 20442
    R 184773879 + 8 
slide 42:
    iotop
    • Disk I/O by process:
    # iotop -bod5
    Total DISK READ:
    TID PRIO USER
    12824 be/4 root
    279 be/3 root
    12716 be/4 root
    12816 be/4 root
    iotop -bod5
    [...]
    35.38 M/s | Total DISK WRITE:
    39.50 K/s
    DISK READ DISK WRITE SWAPIN
    COMMAND
    35.35 M/s
    0.00 B/s 0.00 % 80.59 % cksum ...
    0.00 B/s
    27.65 K/s 0.00 % 2.21 % [jbd2/vda2-8]
    28.44 K/s
    0.00 B/s 2.35 % 0.00 % sshd: root@pts/0
    6.32 K/s
    0.00 B/s 0.89 % 0.00 % python /usr/bin/
    • IO: time thread was waiting on I/O (this is even more useful
    than pidstat’s Kbytes)
    • Needs CONFIG_TASK_IO_ACCOUNTING or something
    similar enabled to work.
    Sunday, February 24, 13
    
slide 43:
    slabtop
    • Kernel slab allocator usage top:
    # slabtop -sc
    Active / Total Objects (% used)
    : 900356 / 1072416 (84.0%)
    Active / Total Slabs (% used)
    : 29085 / 29085 (100.0%)
    Active / Total Caches (% used)
    : 68 / 91 (74.7%)
    Active / Total Size (% used)
    : 237067.98K / 260697.24K (90.9%)
    Minimum / Average / Maximum Object : 0.01K / 0.24K / 10.09K
    OBJS ACTIVE
    112035 110974
    726660 579946
    83496 76878
    23809 23693
    [...]
    USE OBJ SIZE
    99%
    0.91K
    79%
    0.11K
    96%
    4.00K
    92%
    0.19K
    99%
    0.55K
    86%
    0.62K
    77%
    1.00K
    84%
    5.73K
    82%
    0.17K
    74%
    0.58K
    SLABS OBJ/SLAB CACHE SIZE NAME
    102432K ext4_inode_cache
    80740K buffer_head
    18432K kmalloc-4096
    15904K dentry
    13136K radix_tree_node
    6912K proc_inode_cache
    3488K kmalloc-1024
    3264K task_struct
    1904K vm_area_struct
    1504K inode_cache
    • Shows where kernel memory is consumed
    Sunday, February 24, 13
    
slide 44:
    sysctl
    • System settings:
    # sysctl -a
    [...]
    net.ipv4.tcp_fack = 1
    net.ipv4.tcp_reordering = 3
    net.ipv4.tcp_ecn = 2
    net.ipv4.tcp_dsack = 1
    net.ipv4.tcp_mem = 24180
    net.ipv4.tcp_wmem = 4096
    net.ipv4.tcp_rmem = 4096
    [...]
    • Static performance tuning: check the config of the sysetm
    Sunday, February 24, 13
    
slide 45:
    /proc
    • Read statistic sources directly:
    $ cat /proc/meminfo
    MemTotal:
    8181740 kB
    MemFree:
    71632 kB
    Buffers:
    163288 kB
    Cached:
    4518600 kB
    SwapCached:
    7036 kB
    Active:
    4765476 kB
    Inactive:
    2866016 kB
    Active(anon):
    2480336 kB
    Inactive(anon):
    478580 kB
    Active(file):
    2285140 kB
    Inactive(file): 2387436 kB
    Unevictable:
    0 kB
    Mlocked:
    0 kB
    SwapTotal:
    2932728 kB
    SwapFree:
    2799568 kB
    Dirty:
    76 kB
    Writeback:
    0 kB
    [...]
    • Also see /proc/vmstat
    Sunday, February 24, 13
    
slide 46:
    Tools: Intermediate, recap.
    • sar
    • netstat
    • pidstat
    • strace
    • tcpdump
    • blktrace
    • iotop
    • slabtop
    • sysctl
    • /proc
    Sunday, February 24, 13
    
slide 47:
    Tools: Advanced
    • perf
    • DTrace
    • SystemTap
    • and more ...
    Sunday, February 24, 13
    
slide 48:
    perf
    • Originally Performance Counters for Linux (PCL), focusing on
    CPU performance counters (programmable registers)
    • Now a collection of profiling and tracing tools, with numerous
    subcommands, including:
    kmem
    Trace/measure kernel memory (slab) properties
    kvm
    Trace/measure KVM guest OS
    list
    List available events (targets of instrumentation)
    lock
    Analyze lock events
    probe
    Create dynamic probe points (dynamic tracing!)
    record
    Run a command and record profile data (as perf.data)
    report
    Read perf.data and summarize, has an interactive mode
    sched
    Trace/measure kernel scheduler statistics
    stat
    Run a command, gather, and report perf counter stats
    Sunday, February 24, 13
    
slide 49:
    perf: Performance Counters
    • Key performance counter summary:
    $ perf stat gzip file1
    Performance counter stats for 'gzip file1':
    task-clock-msecs
    context-switches
    CPU-migrations
    page-faults
    cycles
    instructions
    branches
    branch-misses
    cache-references
    cache-misses
    seconds time elapsed
    0.901 CPUs
    0.000 M/sec
    0.000 M/sec
    0.000 M/sec
    2395.230 M/sec
    2.221 IPC
    550.641 M/sec
    1.032 %
    2.059 M/sec
    1.211 M/sec
    yay
    • Low IPC (
slide 50:
    perf: Performance Counters, cont.
    • Can choose different counters:
    $ perf list | grep Hardware
    cpu-cycles OR cycles
    [Hardware event]
    stalled-cycles-frontend OR idle-cycles-frontend
    [Hardware event]
    stalled-cycles-backend OR idle-cycles-backend
    [Hardware event]
    instructions
    [Hardware event]
    cache-references
    [Hardware event]
    [...]
    $ perf stat -e instructions,cycles,L1-dcache-load-misses,LLC-loadmisses,dTLB-load-misses gzip file1
    Performance counter stats for 'gzip file1':
    instructions
    cycles
    L1-dcache-load-misses
    LLC-load-misses
    dTLB-load-misses
    seconds time elapsed
    2.199 IPC
    • Supports additional custom counters (in hex or a desc) for
    whatever the processor supports. Examine bus events.
    Sunday, February 24, 13
    
slide 51:
    perf: Performance Counters, cont.
    Operating System
    Hardware
    Applications
    DBs, all server types, ...
    System Libraries
    CPU
    Interconnect
    System Call Interface
    VFS
    ext3/...
    Sockets
    ZFS
    Scheduler
    TCP/UDP
    LVM
    Block Device Interface
    Ethernet
    Virtual
    Memory
    Memory
    Bus
    DRAM
    Device Drivers
    perf stat
    I/O Bus
    advanced activity:
    refer to the
    processor
    manuals
    I/O Bridge
    I/O Controller
    Disk
    Sunday, February 24, 13
    CPU
    Disk
    Expander Interconnect
    Network Controller
    Port
    Port
    
slide 52:
    perf: Profiling
    • Profiling (sampling) CPU activity:
    # perf record -a -g -F 997 sleep 10
    [ perf record: Woken up 44 times to write data ]
    • -a: all CPUs
    • -g: call stacks
    • -F: Hertz
    • sleep 10: duration to sample (dummy command)
    • Generates a perf.data file
    • Can profile other hardware events too, with call stacks
    Sunday, February 24, 13
    
slide 53:
    perf: Profiling, cont.
    • Reading perf.data, forcing non-interactive mode (--stdio):
    # perf report --stdio
    [...]
    # Overhead
    Command
    Shared Object
    Symbol
    # ........ ........... ................. ..............................
    72.98%
    swapper [kernel.kallsyms] [k] native_safe_halt
    --- native_safe_halt
    default_idle
    cpu_idle
    rest_init
    start_kernel
    x86_64_start_reservations
    x86_64_start_kernel
    9.43%
    [...]
    Sunday, February 24, 13
    dd [kernel.kallsyms] [k] acpi_pm_read
    --- acpi_pm_read
    ktime_get_ts
    |--87.75%-- __delayacct_blkio_start
    io_schedule_timeout
    balance_dirty_pages_ratelimited_nr
    generic_file_buffered_write
    
slide 54:
    perf: Profiling, cont.
    • Flame Graphs support perf profiling data:
    • Interactive SVG. Navigate to quantify and compare code paths
    Sunday, February 24, 13
    
slide 55:
    perf: Static Tracing
    • Listing static tracepoints for block I/O:
    $ perf list | grep block:
    block:block_rq_abort
    block:block_rq_requeue
    block:block_rq_complete
    block:block_rq_insert
    block:block_rq_issue
    block:block_bio_bounce
    block:block_bio_complete
    block:block_bio_backmerge
    block:block_bio_frontmerge
    block:block_bio_queue
    block:block_getrq
    block:block_sleeprq
    block:block_plug
    block:block_unplug
    block:block_split
    block:block_bio_remap
    block:block_rq_remap
    [Tracepoint event]
    [Tracepoint event]
    [Tracepoint event]
    [Tracepoint event]
    [Tracepoint event]
    [Tracepoint event]
    [Tracepoint event]
    [Tracepoint event]
    [Tracepoint event]
    [Tracepoint event]
    [Tracepoint event]
    [Tracepoint event]
    [Tracepoint event]
    [Tracepoint event]
    [Tracepoint event]
    [Tracepoint event]
    [Tracepoint event]
    • Many useful probes already provided for kernel tracing:
    $ perf list | grep Tracepoint | wc -l
    Sunday, February 24, 13
    
slide 56:
    perf: Static Tracepoints
    net:
    sock: skb:
    Operating System
    Hardware
    Applications
    DBs, all server types, ...
    System Libraries
    syscalls:
    System Call Interface
    VFS
    ext4:
    sched:
    ext3/...
    Sockets
    ZFS
    Scheduler
    CPU
    TCP/UDP
    LVM
    block: Block Device Interface
    Ethernet
    Virtual
    Memory
    vmscan:
    kmem:
    DRAM
    Device Drivers
    scsi:
    irq:
    I/O Bridge
    I/O Controller
    Disk
    Sunday, February 24, 13
    Disk
    ... more can be added as needed
    Network Controller
    Port
    Port
    device
    stats
    can be
    inferred
    
slide 57:
    perf: Dynamic Tracing
    • Define custom probes from kernel code; eg, tcp_sendmsg():
    # perf probe --add='tcp_sendmsg'
    Add new event:
    probe:tcp_sendmsg
    (on tcp_sendmsg)
    [...]
    # perf record -e probe:tcp_sendmsg -aR -g sleep 5
    [ perf record: Woken up 1 times to write data ]
    [ perf record: Captured and wrote 0.091 MB perf.data (~3972 samples) ]
    # perf report --stdio
    [...]
    # Overhead Command
    Shared Object
    # ........ ....... .................
    100.00%
    sshd [kernel.kallsyms]
    --- tcp_sendmsg
    sock_aio_write
    do_sync_write
    vfs_write
    sys_write
    system_call
    __GI___libc_write
    Sunday, February 24, 13
    Symbol
    [k] tcp_sendmsg
    active traced call stacks from
    arbitrary kernel locations!
    
slide 58:
    perf: Dynamic Tracing, cont.
    perf probe --add
    Operating System
    Applications
    DBs, all server types, ...
    System Libraries
    System Call Interface
    VFS
    ext3/...
    Sockets
    ZFS
    Block Device Interface
    Scheduler
    CPU
    TCP/UDP
    LVM
    advanced activity:
    refer to the
    kernel source
    code
    Ethernet
    Virtual
    Memory
    DRAM
    Device Drivers
    I/O Bridge
    I/O Controller
    Disk
    Sunday, February 24, 13
    Hardware
    Disk
    Network Controller
    Port
    Port
    device
    stats
    can be
    inferred
    
slide 59:
    perf: Dynamic Tracing, cont.
    • Fills in kernel observability gaps
    • Awesome capability
    • Takes some effort to use (waiting for the trace-dumpanalyze cycle, and using post-processors to rework the
    output, or the post-scripting capability)
    • Would be the awesomest tool ever, if it wasn’t for ...
    Sunday, February 24, 13
    
slide 60:
    DTrace
    Sunday, February 24, 13
    
slide 61:
    DTrace
    • Programmable, real-time, dynamic and static tracing
    • Perf analysis and troubleshooting, without restarting anything
    • Used on Solaris, illumos/SmartOS, Mac OS X, FreeBSD, ...
    • Two ports in development for Linux (that we know of):
    • 1. dtrace4linux
    • Mostly by Paul Fox
    • 2. Oracle Enterprise Linux DTrace
    • Steady progress
    There are a couple of awesome books about DTrace too
    Sunday, February 24, 13
    
slide 62:
    DTrace: Installation
    • dtrace4linux version:
    1. https://github.com/dtrace4linux/dtrace
    2. README:
    tools/get-deps.pl
    # if using Ubuntu
    tools/get-deps-fedora.sh
    # RedHat/Fedora
    make all
    make install
    make load
    (need to be root or have sudo access)
    # make load
    tools/load.pl
    13:40:14 Syncing...
    13:40:14 Loading: build-3.2.6-3.fc16.x86_64/driver/dtracedrv.ko
    13:40:15 Preparing symbols...
    13:40:15 Probes available: 281887
    13:40:18 Time: 4s
    • WARNING: still a prototype, can panic/freeze kernels.
    I’m using it the lab to solve replicated production perf issues
    Sunday, February 24, 13
    
slide 63:
    DTrace: Programming
    • Programming capabilities allow for powerful, efficient, oneliners and scripts. In-kernel custom filtering and aggregation.
    # dtrace -n 'fbt::tcp_sendmsg:entry /execname == "sshd"/ {
    @["bytes"] = quantize(arg3); }'
    dtrace: description 'fbt::tcp_sendmsg:entry ' matched 1 probe
    bytes
    value ------------- Distribution ------------- count
    16 |
    32 |@@@@@@@@@@@@@@@@
    64 |@@@@@@@@@@@@@
    128 |@@@
    256 |@@@@
    512 |@@@
    1024 |@
    2048 |
    4096 |
    8192 |
    • Example shows tcp_sendmsg() size dist for “sshd” PIDs
    Sunday, February 24, 13
    
slide 64:
    DTrace: Programming
    • Programming capabilities allow for powerful, efficient, oneliners and scripts. In-kernel custom filtering and aggregation.
    # dtrace -n 'fbt::tcp_sendmsg:entry /execname == "sshd"/ {
    @["bytes"] = quantize(arg3); }'
    dtrace: description 'fbt::tcp_sendmsg:entry ' matched 1 probe
    bytes
    filter
    aggregation (summarizes)
    value ------------- Distribution ------------- count
    16 |
    32 |@@@@@@@@@@@@@@@@
    64 |@@@@@@@@@@@@@
    128 |@@@
    256 |@@@@
    512 |@@@
    kernel ->gt; user transfers
    1024 |@
    these these numbers
    2048 |
    4096 |
    only
    (pre-summarized)
    8192 |
    • Example shows tcp_sendmsg() size dist for “sshd” PIDs
    these examples use dtrace4linux
    Sunday, February 24, 13
    
slide 65:
    DTrace: Real-Time
    • Multiple GUIs use DTrace for real-time statistics. Eg, Joyent
    Cloud Analytics, showing real-time cloud-wide syscall latency:
    Sunday, February 24, 13
    
slide 66:
    DTrace, cont.
    • Has advanced capabilities, but not necessarily difficult;
    You may just:
    • use one-liners (google “DTrace one-liners”)
    • use scripts (DTraceToolkit; DTrace book; google)
    • tweak one-liners or scripts a little
    • ask someone else to write the scripts you need
    • Ideally, you learn DTrace and write your own
    Sunday, February 24, 13
    
slide 67:
    DTrace: Scripts
    #!/usr/sbin/dtrace -s
    fbt::vfs_read:entry
    self->gt;start = timestamp;
    13 line script to time
    VFS reads by process name
    fbt::vfs_read:return
    /self->gt;start/
    @[execname, "ns"] = quantize(timestamp - self->gt;start);
    self->gt;start = 0;
    # ./vfsread.d
    dtrace: script './vfsread.d' matched 2 probes
    cksum
    value ------------- Distribution ------------- count
    [...]
    262144 |
    524288 |@@@@@@@@@@
    1048576 |
    2097152 |
    read latency distribution,
    4194304 |
    8388608 |@
    0.5ms
    ->gt;
    33ms
    (disks)
    16777216 |
    33554432 |
    Sunday, February 24, 13
    
slide 68:
    DTrace: Basics
    • CLI syntax:
    dtrace -n ‘provider:module:function:name /predicate/ { action }’
    probe description
    optional
    filter
    do this when
    probe “fires”
    • provider – library of related probes
    • module:function – shows where probe is located (for debug)
    • name – name of probe
    • Online reference and tutorial: http://dtrace.org/guide
    Sunday, February 24, 13
    
slide 69:
    DTrace: Providers
    Applications
    DBs, all server types, ...
    System Libraries
    System Call Interface
    VFS
    ext3/...
    Sockets
    ZFS
    CPU
    TCP/UDP
    LVM
    Block Device Interface
    Scheduler
    Ethernet
    Virtual
    Memory
    DRAM
    Device Drivers
    I/O Bridge
    I/O Controller
    Disk
    Sunday, February 24, 13
    Disk
    Network Controller
    Port
    Port
    
slide 70:
    DTrace: Providers
    syscall plockstat
    tcp udp ip
    pid
    Applications
    DBs, all server types, ...
    System Libraries
    System Call Interface
    VFS
    Sockets
    ext3/...
    fbt
    profile
    java javascript
    node perl python
    php ruby erlang
    objc tcl ...
    mysql postgres ...
    ZFS
    CPU
    TCP/UDP
    LVM
    Block Device Interface
    Scheduler
    profile
    sched
    proc
    Ethernet
    Virtual
    Memory
    cpc
    DRAM
    Device Drivers
    vminfo
    cpc
    I/O Bridge
    infer
    infer
    I/O Controller
    fbt and pid
    are dynamic
    Sunday, February 24, 13
    Disk
    Disk
    Network Controller
    Port
    Port
    fbt
    
slide 71:
    DTrace: Linux Examples
    • Following examples use fbt – kernel dynamic tracing
    Sunday, February 24, 13
    
slide 72:
    DTrace: ext4slower.d
    • Show me:
    • ext4 reads and writes
    • slower than a specified latency (milliseconds)
    • with time, process, direction, size, latency, and file name
    # ./ext4slower.d 10
    Tracing ext4 read/write slower than 10 ms
    TIME
    PROCESS
    2013 Feb 22 17:17:02 cksum
    2013 Feb 22 17:17:02 cksum
    2013 Feb 22 17:17:03 cksum
    2013 Feb 22 17:17:03 cksum
    ms FILE
    35 100m
    16 1m
    18 data1
    23 data1
    • I wrote this to answer: is ext4 to blame for latency outliers?
    • Argument is latency you are looking for: here, 10+ ms
    Sunday, February 24, 13
    
slide 73:
    DTrace: ext4slower.d, cont.
    • Extending vfs_read() example:
    #!/usr/sbin/dtrace -s
    #pragma D option quiet
    #pragma D option defaultargs
    #pragma D option switchrate=5
    dtrace:::BEGIN
    min_ns = $1 * 1000000;
    printf("Tracing ext4 read/write slower than %d ms\n", $1);
    printf("%-20s %-16s %1s %4s %6s %s\n", "TIME", "PROCESS",
    "D", "KB", "ms", "FILE");
    fbt::vfs_read:entry, fbt::vfs_write:entry
    this->gt;file = (struct file *)arg0;
    this->gt;fs = this->gt;file->gt;f_path.dentry->gt;d_inode->gt;i_sb->gt;s_type->gt;name;
    • ... continued:
    Sunday, February 24, 13
    
slide 74:
    DTrace: ext4slower.d, cont.
    fbt::vfs_read:entry, fbt::vfs_write:entry
    /stringof(this->gt;fs) == "ext4"/
    self->gt;start = timestamp;
    self->gt;name = this->gt;file->gt;f_path.dentry->gt;d_name.name;
    fbt::vfs_read:return, fbt::vfs_write:return
    /self->gt;start && (this->gt;delta = timestamp - self->gt;start) >gt; min_ns/
    this->gt;dir = probefunc == "vfs_read" ? "R" : "W";
    printf("%-20Y %-16s %1s %4d %6d %s\n", walltimestamp,
    execname, this->gt;dir, arg1 / 1024, this->gt;delta / 1000000,
    stringof(self->gt;name));
    fbt::vfs_read:return, fbt::vfs_write:return
    self->gt;start = 0;
    self->gt;name = 0;
    • Immediately exonerate or blame ext4.
    ... should add more vfs_*() calls; or trace ext4 funcs directly
    Sunday, February 24, 13
    
slide 75:
    DTrace: tcpretransmit.d
    • Show me:
    • TCP retransmits
    • destination IP address
    • kernel stack (shows why)
    • in real-time
    • Don’t sniff all packets – only trace retransmits, to minimize
    overhead
    Sunday, February 24, 13
    
slide 76:
    DTrace: tcpretransmit.d, cont.
    # ./tcpretransmit.d
    Tracing TCP retransmits... Ctrl-C to end.
    2013 Feb 23 18:24:11: retransmit to 10.2.124.2, by:
    kernel`tcp_retransmit_timer+0x1bd
    kernel`tcp_write_timer+0x188
    kernel`run_timer_softirq+0x12b
    kernel`tcp_write_timer
    kernel`__do_softirq+0xb8
    kernel`read_tsc+0x9
    kernel`sched_clock+0x9
    kernel`sched_clock_local+0x25
    kernel`call_softirq+0x1c
    kernel`do_softirq+0x65
    kernel`irq_exit+0x9e
    kernel`smp_apic_timer_interrupt+0x6e
    kernel`apic_timer_interrupt+0x6e
    [...]
    ... can trace those stack functions directly for more detail
    Sunday, February 24, 13
    
slide 77:
    DTrace: tcpretransmit.d, cont.
    • Source:
    #!/usr/sbin/dtrace -s
    #pragma D option quiet
    dtrace:::BEGIN { trace("Tracing TCP retransmits... Ctrl-C to end.\n"); }
    fbt::tcp_retransmit_skb:entry {
    this->gt;so = (struct sock *)arg0;
    this->gt;d = (unsigned char *)&this->gt;so->gt;__sk_common.skc_daddr;
    printf("%Y: retransmit to %d.%d.%d.%d, by:", walltimestamp,
    this->gt;d[0], this->gt;d[1], this->gt;d[2], this->gt;d[3]);
    stack(99);
    Sunday, February 24, 13
    
slide 78:
    DTrace: Current State
    • This was demoed on a prototype DTrace for Linux
    • Right now (Feb 2013) not stable – will panic/freeze
    • Needs other handholding to work around nits/bugs
    • AFAIK, both DTrace ports welcome help (that means you!)
    • Those examples were also fbt-based:
    • Will probably need tweaks to match different kernels, since
    the API is dynamically built from the kernel code
    • DTrace stable providers solve that problem – but many
    aren’t there on Linux yet
    Sunday, February 24, 13
    
slide 79:
    DTrace: Trying it out
    • All providers are available to try on illumos/SmartOS
    • illumos is the on-going fork of the OpenSolaris kernel
    • SmartOS is Joyent’s illumos-based cloud OS (distro)
    • Rough translation guide:
    kernel: linux == illumos
    distros: {ubuntu|CentOS|Fedora} == {SmartOS|OmniOS|OpenIndiana}
    • DTrace implementation mature
    • Joyent uses SmartOS as a hypervisor for running KVM Linux
    on ZFS
    Sunday, February 24, 13
    
slide 80:
    DTrace: Other Capabilities
    • Trace short lived processes
    • Profile CPU usage
    • Time any thread blocking event
    • Investigate disk I/O latency
    • Investigate network I/O latency
    • Examine cache activity
    • Investigate memory allocation: growth or leaks
    • Investigate swapping (paging) in detail
    • Follow network packets through the stack
    • Examine lock contention
    • ...
    Sunday, February 24, 13
    
slide 81:
    SystemTap
    Sunday, February 24, 13
    
slide 82:
    SystemTap
    • Created when there
    wasn’t DTrace for
    Linux ports
    • Static and dynamic
    tracing, probes,
    tapsets, scripts, ...
    • I’ve used it a lot:
    • panics/freezes
    • slow startups
    • for Linux only
    • incompatible with D
    Sunday, February 24, 13
    
slide 83:
    Tools: Advanced, recap.
    Operating System
    Hardware
    Applications
    DBs, all server types, ...
    System Libraries
    System Call Interface
    VFS
    Sockets
    ext3/...
    ZFS
    CPU
    TCP/UDP
    LVM
    Block Device Interface
    Scheduler
    Ethernet
    Virtual
    Memory
    DRAM
    Device Drivers
    Given the tools to
    see everything,
    how do you
    use them?
    I/O Bridge
    I/O Controller
    Disk
    Sunday, February 24, 13
    Disk
    Network Controller
    Port
    Port
    
slide 84:
    And More ...
    • Other observability tools at all levels include:
    • ps, pmap, traceroute, ntop, ss, lsof, oprofile, gprof,
    kcachegrind, valgrind, google profiler, nfsiostat, cifsiostat,
    latencytop, powertop, LLTng, ktap, ...
    • And many experimental tools: micro-benchmarks
    • So many tools it gets confusing – where do you start?
    Sunday, February 24, 13
    
slide 85:
    Methodologies
    • Selected four:
    • Streetlight Anti-Method
    • Workload Characterization Method
    • Drill-Down Analysis Method
    • USE Method
    • Methodologies give beginners a starting point, casual users a
    checklist, and experts a reminder
    Sunday, February 24, 13
    
slide 86:
    Streetlight Anti-Method
    Sunday, February 24, 13
    
slide 87:
    Streetlight Anti-Method
    • 1. Pick observability tools that are
    • familiar
    • found on the Internet
    • found at random
    • 2. Run tools
    • 3. Look for obvious issues
    • Included for comparison (don’t use this methodology)
    Sunday, February 24, 13
    
slide 88:
    Streetlight Anti-Method, cont.
    • Named after an observational bias called the streetlight effect
    A policeman sees a drunk looking under a streetlight,
    and asks what he is looking for.
    The drunk says he has lost his keys.
    The policeman can't find them either,
    and asks if he lost them under the streetlight.
    The drunk replies:
    “No, but this is where the light is best.”
    Sunday, February 24, 13
    
slide 89:
    Streetlight Anti-Method, cont.
    top - 15:09:38 up 255 days, 16:54, 10 users, load average: 0.00, 0.03, 0.00
    Tasks: 274 total,
    1 running, 273 sleeping,
    0 stopped,
    0 zombie
    Cpu(s): 0.7%us, 0.0%sy, 0.0%ni, 99.1%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st
    Mem:
    8181740k total, 7654228k used,
    527512k free,
    405616k buffers
    Swap: 2932728k total,
    125064k used, 2807664k free, 3826244k cached
    PID USER
    16876 root
    3947 brendan
    15841 joshw
    16922 joshw
    1 root
    2 root
    3 root
    4 root
    5 root
    [...]
    NI VIRT RES SHR S %CPU %MEM
    TIME+ COMMAND
    0 57596 17m 1972 S
    4 0.2
    3:00.60 python
    0 19352 1552 1060 R
    0 0.0
    0:00.06 top
    0 67144 23m 908 S
    0 0.3 218:21.70 mosh-server
    0 54924 11m 920 S
    0 0.1 121:34.20 mosh-server
    0 23788 1432 736 S
    0 0.0
    0:18.15 init
    0 S
    0 0.0
    0:00.61 kthreadd
    0 S
    0 0.0
    0:00.11 migration/0
    0 S
    0 0.0 18:43.09 ksoftirqd/0
    0 S
    0 0.0
    0:00.00 watchdog/0
    • Why are you still running top?
    Sunday, February 24, 13
    
slide 90:
    Streetlight Anti-Method, cont.
    • Tools-based approach
    • Inefficient:
    • can take time before the right tool is found
    • can be wasteful when investigating false positives
    • Incomplete:
    • don’t find the right tool, or,
    • the right tool doesn’t exist
    Sunday, February 24, 13
    
slide 91:
    Workload Characterization Method
    Sunday, February 24, 13
    
slide 92:
    Workload Characterization Method
    • 1. Who
    • 2. Why
    • 3. What
    • 4. How
    Sunday, February 24, 13
    
slide 93:
    Workload Characterization Method
    • 1. Who is causing the load? PID, UID, IP addr, ...
    • 2. Why is the load called? code path
    • 3. What is the load? IOPS, tput, direction, type
    • 4. How is the load changing over time?
    Sunday, February 24, 13
    
slide 94:
    Workload Characterization Method, cont.
    • Identifies issues of load
    • Best performance wins are from eliminating unnecessary work
    • Don’t assume you know what the workload is – characterize
    • Many of the previous analysis tools included workload statistics
    Sunday, February 24, 13
    
slide 95:
    Workload Characterization Method, cont.
    • Pros:
    • Potentially largest wins
    • Cons:
    • Only solves a class of issues – load
    • Time consuming, and can be discouraging – most
    attributes examined will not be a problem
    Sunday, February 24, 13
    
slide 96:
    Drill-Down Analysis Method
    Sunday, February 24, 13
    
slide 97:
    Drill-Down Analysis Method
    • 1. Start at highest level
    • 2. Examine next-level details
    • 3. Pick most interesting breakdown
    • 4. If problem unsolved, go to 2
    Sunday, February 24, 13
    
slide 98:
    Drill-Down Analysis Method, cont.: Example
    • For example, ext4 – identify latency origin top-down:
    Drill-Down Analysis
    Applications
    User
    Kernel
    System Libraries
    System Call Interface
    VFS
    ext4
    Block Device Interface
    Device Drivers
    Dynamic Tracing / DTrace is well suited for this,
    as it can dig through all layers with custom detail
    Sunday, February 24, 13
    
slide 99:
    Drill-Down Analysis: ext4
    • eg, ext4_readpages() latency distribution (microseconds):
    # dtrace -n 'fbt::ext4_readpages:entry { self->gt;ts = timestamp; }
    fbt::ext4_readpages:return /self->gt;ts/ {
    @["us"] = lquantize((timestamp - self->gt;ts) / 1000, 0, 10000, 250);
    self->gt;ts = 0;
    dtrace: description 'fbt::ext4_readpages:entry ' matched 2 probes
    [...]
    Sunday, February 24, 13
    value ------------- Distribution ------------- count
    
slide 100:
    Drill-Down Analysis: ext4
    • ... can dig out more details as needed: file name, code path:
    # dtrace -n 'fbt::ext4_readpages:entry {
    this->gt;file = (struct file *)arg0;
    this->gt;name = this->gt;file->gt;f_path.dentry->gt;d_name.name;
    @[stringof(this->gt;name), stack()] = count();
    dtrace: description 'fbt::ext4_readpages:entry ' matched 1 probe
    ^C[...]
    foo8
    kernel`__do_page_cache_readahead+0x1c7
    kernel`ra_submit+0x21
    kernel`ondemand_readahead+0x115
    kernel`page_cache_async_readahead+0x80
    kernel`radix_tree_lookup_slot+0xe
    kernel`find_get_page+0x1e
    kernel`generic_file_aio_read+0x48b
    kernel`vma_merge+0x121
    kernel`do_sync_read+0xd2
    kernel`__switch_to+0x132
    kernel`security_file_permission+0x93
    kernel`rw_verify_area+0x61
    # of
    kernel`vfs_read+0xb0
    occurrences kernel`sys_read+0x4a
    kernel`system_call_fastpath+0x16
    Sunday, February 24, 13
    
slide 101:
    Drill-Down Analysis Method, cont.
    • Moves from higher- to lower-level details based on findings:
    environment-wide down to metal
    • Pros:
    • Will identify root cause(s)
    • Cons:
    • Time consuming – especially when drilling in the wrong
    direction
    Sunday, February 24, 13
    
slide 102:
    USE Method
    Sunday, February 24, 13
    
slide 103:
    USE Method
    • For every resource, check:
    • 1. Utilization
    • 2. Saturation
    • 3. Errors
    Sunday, February 24, 13
    
slide 104:
    USE Method, cont.
    • For every resource, check:
    • 1. Utilization: time resource was busy, or degree used
    • 2. Saturation: degree of queued extra work
    • 3. Errors: any errors
    Saturation
    Errors
    Sunday, February 24, 13
    Utilization
    
slide 105:
    USE Method, cont.
    • Hardware Resources:
    • CPUs
    • Main Memory
    • Network Interfaces
    • Storage Devices
    • Controllers
    • Interconnects
    • Find the functional diagram and examine every item in the
    data path...
    Sunday, February 24, 13
    
slide 106:
    USE Method, cont.: Functional Diagram
    Hardware
    Memory
    Bus
    DRAM
    For each check:
    1. Utilization
    2. Saturation
    3. Errors
    CPU
    CPU
    Interconnect
    I/O Bus
    I/O Bridge
    Expander Interconnect
    I/O Controller
    Disk
    Sunday, February 24, 13
    CPU
    Memory
    Bus
    Disk
    Network Controller
    Interface
    Transports
    Port
    Port
    DRAM
    
slide 107:
    USE Method, cont.: Example Linux Checklist
    http://dtrace.org/blogs/brendan/2012/03/07/the-use-method-linux-performance-checklist
    Resource Type
    CPU
    CPU
    CPU
    Utilization
    Saturation
    Errors
    Metric
    per-cpu: mpstat -P ALL 1, “%idle”; sar -P ALL,
    “%idle”; system-wide: vmstat 1, “id”; sar -u, “%idle”;
    dstat -c, “idl”; per-process:top, “%CPU”; htop, “CPU
    %”; ps -o pcpu; pidstat 1, “%CPU”; per-kernelthread: top/htop (“K” to toggle), where VIRT == 0
    (heuristic).
    [1]vmstat 1, “r” >gt; CPU count [2]; sar -q,
    system-wide:
    “runq-sz” >gt; CPU count; dstat -p, “run” >gt; CPU count;
    per-process: /proc/PID/schedstat 2nd field
    (sched_info.run_delay); perf sched latency (shows
    “Average” and “Maximum” delay per-schedule); dynamic
    tracing, eg, SystemTap schedtimes.stp “queued(us)” [3]
    perf (LPE) if processor specific error events (CPC) are
    available; eg, AMD64′s “04Ah Single-bit ECC Errors
    Recorded by Scrubber” [4]
    ... etc for all combinations (would fill a dozen slides)
    Sunday, February 24, 13
    
slide 108:
    USE Method, cont.
    • Some software resources can also be studied:
    • Mutex Locks
    • Thread Pools
    • Process/Thread Capacity
    • File Descriptor Capacity
    • Consider possible USE metrics for each
    Sunday, February 24, 13
    
slide 109:
    USE Method, cont.
    • This process may reveal missing metrics – those not provided
    by your current toolset
    • They are your known unknowns
    • Much better than unknown unknowns
    • More tools can be installed and developed to help
    • So many top(1)s, but where is the interconnect-top?
    • Full USE Method checklist may, practically, only be used for
    critical issues
    Sunday, February 24, 13
    
slide 110:
    USE Method, cont.
    • Resource-based approach
    • Quick system health check, early in an investigation
    • Pros:
    • Complete: all resource bottlenecks and errors
    • Not limited in scope by your current toolset
    • No unknown unknowns – at least known unknowns
    • Efficient: picks three metrics for each resource –
    from what may be dozens available
    • Cons:
    • Limited to a class of issues
    Sunday, February 24, 13
    
slide 111:
    Other Methodologies
    • Include:
    • Blame-Someone-Else Anti-Method
    • Tools Method
    • Ad-Hoc Checklist Method
    • Problem Statement Method
    • Scientific Method
    • Latency Analysis
    • Stack Profile Method
    • http://dtrace.org/blogs/brendan/2012/12/13/usenix-lisa-2012-performance-analysis-methodology/
    Sunday, February 24, 13
    
slide 112:
    Challenges
    • Performance counter analysis (eg, bus or interconnect port
    analysis) is time consuming – would like tools for convenience
    • How about a “bustop” subcommand for perf?
    • DTrace for Linux ports still in progress – will be awesome
    when complete
    Sunday, February 24, 13
    
slide 113:
    Cloud Computing
    • Performance may be limited by cloud resource controls, rather
    than physical limits
    • Hardware Virtualization complicates things – as a guest you
    can’t analyze down to metal directly
    • Hopefully the cloud provider provides an API for accessing
    physical statistics, or does the analysis on your behalf
    • We do analysis at Joyent (and our hypervisors have DTrace!)
    • Free trial for new customers: good for $125 of usage value (~
    one Small 1GB SmartMachine for 60 days). All prices subject
    to change. Limited time only. Sign up at joyent.com
    Sunday, February 24, 13
    
slide 114:
    References
    • Linux man pages, source, /Documentation
    • USE Method: http://queue.acm.org/detail.cfm?id=2413037
    • http://dtrace.org/blogs/brendan/2012/03/07/the-use-methodlinux-performance-checklist/
    • http://dtrace.org/blogs/brendan/2012/12/13/usenix-lisa-2012performance-analysis-methodology/
    • https://github.com/dtrace4linux, http://www.dtracebook.com,
    http://illumos.org, http://smartos.org
    • Upcoming: “Systems Performance” (Prentice Hall)
    Sunday, February 24, 13
    
slide 115:
    Thank you!
    • email: brendan@joyent.com
    • twitter: @brendangregg
    • blog: http://dtrace.org/blogs/brendan
    • blog resources:
    • http://dtrace.org/blogs/brendan/tag/linux-2/
    • http://dtrace.org/blogs/brendan/2012/02/29/the-use-method/
    • http://dtrace.org/blogs/brendan/2012/03/07/the-use-method-linux-performancechecklist/
    • http://dtrace.org/blogs/brendan/2011/12/16/flame-graphs/
    • http://dtrace.org/blogs/brendan/2012/03/17/linux-kernel-performance-flamegraphs/
    • http://dtrace.org/blogs/brendan/2011/10/15/using-systemtap/
    Sunday, February 24, 13