SCaLE_Linux_Performance2013.pdf

SCaLE11x (2013): Linux Performance Analysis and Tools

Video: http://www.youtube.com/watch?v=0yyorhl6IjM

Talk for the Southern California Linux Expo 11x (2013) by Brendan Gregg.

Description: "This talk for SCaLE11x covers system performance analysis methodologies and the Linux tools to support them, so that you can get the most out of your systems and solve performance issues quickly. This includes a wide variety of tools, including basics like top(1), advanced tools like perf, and new tools like the DTrace for Linux prototypes."

	next prev 1/115
	next prev 2/115
	next prev 3/115
	next prev 4/115
	next prev 5/115
	next prev 6/115
	next prev 7/115
	next prev 8/115
	next prev 9/115
	next prev 10/115
	next prev 11/115
	next prev 12/115
	next prev 13/115
	next prev 14/115
	next prev 15/115
	next prev 16/115
	next prev 17/115
	next prev 18/115
	next prev 19/115
	next prev 20/115
	next prev 21/115
	next prev 22/115
	next prev 23/115
	next prev 24/115
	next prev 25/115
	next prev 26/115
	next prev 27/115
	next prev 28/115
	next prev 29/115
	next prev 30/115
	next prev 31/115
	next prev 32/115
	next prev 33/115
	next prev 34/115
	next prev 35/115
	next prev 36/115
	next prev 37/115
	next prev 38/115
	next prev 39/115
	next prev 40/115
	next prev 41/115
	next prev 42/115
	next prev 43/115
	next prev 44/115
	next prev 45/115
	next prev 46/115
	next prev 47/115
	next prev 48/115
	next prev 49/115
	next prev 50/115
	next prev 51/115
	next prev 52/115
	next prev 53/115
	next prev 54/115
	next prev 55/115
	next prev 56/115
	next prev 57/115
	next prev 58/115
	next prev 59/115
	next prev 60/115
	next prev 61/115
	next prev 62/115
	next prev 63/115
	next prev 64/115
	next prev 65/115
	next prev 66/115
	next prev 67/115
	next prev 68/115
	next prev 69/115
	next prev 70/115
	next prev 71/115
	next prev 72/115
	next prev 73/115
	next prev 74/115
	next prev 75/115
	next prev 76/115
	next prev 77/115
	next prev 78/115
	next prev 79/115
	next prev 80/115
	next prev 81/115
	next prev 82/115
	next prev 83/115
	next prev 84/115
	next prev 85/115
	next prev 86/115
	next prev 87/115
	next prev 88/115
	next prev 89/115
	next prev 90/115
	next prev 91/115
	next prev 92/115
	next prev 93/115
	next prev 94/115
	next prev 95/115
	next prev 96/115
	next prev 97/115
	next prev 98/115
	next prev 99/115
	next prev 100/115
	next prev 101/115
	next prev 102/115
	next prev 103/115
	next prev 104/115
	next prev 105/115
	next prev 106/115
	next prev 107/115
	next prev 108/115
	next prev 109/115
	next prev 110/115
	next prev 111/115
	next prev 112/115
	next prev 113/115
	next prev 114/115
	next prev 115/115

PDF: SCaLE_Linux_Performance2013.pdf

Keywords (from pdftotext):

slide 1:

Linux Performance
Analysis and Tools
Brendan Gregg
Lead Performance Engineer
brendan@joyent.com
@brendangregg
Sunday, February 24, 13
SCaLE11x
February, 2013

slide 2:

Find the Bottleneck
Operating System
Hardware
Applications
DBs, all server types, ...
System Libraries
Linux Kernel
System Call Interface
VFS
ext3/...
Scheduler
Sockets
ZFS
CPU
Interconnect
CPU
TCP/UDP
LVM
Block Device Interface
Ethernet
Virtual
Memory
Memory
Bus
DRAM
Device Drivers
I/O Bus
I/O Bridge
Expander Interconnect
I/O Controller
Disk
Sunday, February 24, 13
Disk
Network Controller
Interface
Transports
Port
Port

slide 3:

whoami
• Lead Performance Engineer
• Work/Research: tools, visualizations, methodologies
• Was Brendan@Sun Microsystems, Oracle, now Joyent
Sunday, February 24, 13

slide 4:

Joyent
• High-Performance Cloud Infrastructure
• Compete on cloud instance/OS performance
• Public/private cloud provider
• OS-Virtualization for bare metal performance (Zones)
• Core developers of SmartOS and node.js
• KVM for Linux guests
Sunday, February 24, 13

slide 5:

SCaLE10x: Cloud Performance Analysis
• Example perf issues, including new tools and visualizations:
http://dtrace.org/blogs/brendan/2012/01/30/performance-analysis-talk-at-scale10x/
Sunday, February 24, 13

slide 6:

SCaLE11x: Linux Performance Analysis
• The primary operating system for my next book:
(secondary is the OpenSolaris-illumos-based SmartOS)
Systems
Performance
ENTERPRISE
AND THE CLOUD
Brendan Gregg
Prentice Hall, 2013
Sunday, February 24, 13

slide 7:

Agenda
• Background
• Linux Analysis and Tools
• Basic
• Intermediate
• Advanced
• Methodologies
• Challenges
Sunday, February 24, 13

slide 8:

Performance
• Why do performance analysis?
• Reduce IT spend – find and eliminate waste, find areas to
tune, and do more with less
• Build scalable architectures – understand system limits
and develop around them
• Solve issues – locate bottlenecks and latency outliers
Sunday, February 24, 13

slide 9:

Systems Performance
• Why study the operating system?
• Find and fix kernel-based perf issues
• 2-20% wins: I/O or buffer size tuning, NUMA config, etc
• 2-200x wins: bugs, disabled features, perturbations causing
latency outliers
• Kernels change, new devices are added, workloads scale, and
new perf issues are encountered.
• Analyze application perf from kernel/system context
• 2-2000x wins: identifying and eliminating unnecessary work
Sunday, February 24, 13

slide 10:

Perspectives
• System analysis can be top-down, or bottom-up:
Workload
Developers
Operating
System
Software
Stack
Application
Workload
Analysis
System Libraries
System Calls
Kernel
Sysadmins
Sunday, February 24, 13
Devices
Resource
Analysis

slide 11:

Kernel Internals
• Eventually you’ll need to know some kernel internals
Operating System
Applications
DBs, all server types, ...
System Libraries
user-level
System Call Interface
Linux Kernel
VFS
ext3/...
Sockets
ZFS
LVM
TCP/UDP
Block Device Int.
Ethernet
Device Drivers
Sunday, February 24, 13
Scheduler
Virtual
Memory
kernel-level

slide 12:

Common System Metrics
$ iostat
Linux 3.2.6-3.fc16.x86_64 (node104)
avg-cpu:
Device:
vda
vdb
%user
%nice %system %iowait
tps
kB_read/s
02/20/2013
%steal
kB_wrtn/s
_x86_64_ (1 CPU)
%idle
kB_read
kB_wrtn
• It’s also worth studying common system metrics (iostat, ...),
even if you intend to use a monitoring product. Monitoring
products often use the same metrics, read from /proc.
Sunday, February 24, 13

slide 13:

Analysis and Tools
Sunday, February 24, 13

slide 14:

Analysis and Tools
• A quick tour of tools, to show what can be done
• Then, some methodologies for applying them
Sunday, February 24, 13

slide 15:

Analysis and Tools
Operating System
Hardware
Applications
DBs, all server types, ...
System Libraries
System Call Interface
VFS
ext3/...
Sockets
ZFS
CPU
TCP/UDP
LVM
Block Device Interface
Scheduler
Ethernet
Virtual
Memory
DRAM
Device Drivers
I/O Bridge
I/O Controller
Disk
Sunday, February 24, 13
Disk
Network Controller
Port
Port

slide 16:

Analysis and Tools
perf dtrace stap
strace
Operating System
netstat
perf
top
pidstat
mpstat
dstat
Applications
DBs, all server types, ...
System Libraries
System Call Interface
VFS
ext3/...
Scheduler
Sockets
ZFS
TCP/UDP
LVM
Virtual
Memory
Block Device Interface
Ethernet
Device Drivers
iostat
iotop
blktrace
dtrace
perf
I/O Bridge
I/O Controller
Disk
Disk
perf
CPU
vmstat
slabtop
dstat
free
DRAM
top
tcpdump ip nicstat dtrace
dstat
Network Controller
Port
ping
Sunday, February 24, 13
Hardware
Port
Various:
sar
/proc

slide 17:

Tools: Basic
• uptime
• top or htop
• mpstat
• iostat
• vmstat
• free
• ping
• nicstat
• dstat
Sunday, February 24, 13

slide 18:

uptime
• Shows load averages, which are also shown by other tools:
$ uptime
16:23:34 up 126 days,
1:03,
1 user,
load average: 5.09, 2.12, 1.82
• This counts runnable threads (tasks), on-CPU, or, runnable
and waiting. Linux includes tasks blocked on disk I/O.
• These are exponentially-damped moving averages, with time
constants of 1, 5 and 15 minutes. With three values you can
see if load is increasing, steady, or decreasing.
• If the load is greater than the CPU count, it might mean the
CPUs are saturated (100% utilized), and threads are suffering
scheduler latency. Might. There’s that disk I/O factor too.
• This is only useful as a clue. Use other tools to investigate!
Sunday, February 24, 13

slide 19:

top
• System-wide and per-process summaries:
$ top
top - 01:38:11 up 63 days, 1:17, 2 users, load average: 1.57, 1.81, 1.77
Tasks: 256 total,
2 running, 254 sleeping,
0 stopped,
0 zombie
Cpu(s): 2.0%us, 3.6%sy, 0.0%ni, 94.2%id, 0.0%wa, 0.0%hi, 0.2%si, 0.0%st
Mem: 49548744k total, 16746572k used, 32802172k free,
182900k buffers
Swap: 100663292k total,
0k used, 100663292k free, 14925240k cached
PID USER
11721 web
11715 web
10 root
51 root
11724 admin
1 root
[...]
NI VIRT RES SHR S %CPU %MEM
TIME+ COMMAND
0 623m 50m 4984 R
93 0.1
0:59.50 node
0 619m 20m 4916 S
25 0.0
0:07.52 node
0 S
1 0.0 248:52.56 ksoftirqd/2
0 S
0 0.0
0:35.66 events/0
0 19412 1444 960 R
0 0.0
0:00.07 top
0 23772 1948 1296 S
0 0.0
0:04.35 init
• %CPU = interval sum for all CPUs (varies on other OSes)
• top can consume CPU (syscalls to read /proc)
• Straight-forward. Or is it?
Sunday, February 24, 13

slide 20:

top, cont.
• Interview questions:
• 1. Does it show all CPU consumers?
• 2. A process has high %CPU – next steps for analysis?
Sunday, February 24, 13

slide 21:

top, cont.
• 1. top can miss:
• short-lived processes
• kernel threads (tasks), unless included (see top options)
• 2. analyzing high CPU processes:
• identify why – profile code path
• identify what – execution or stall cycles
• High %CPU time may be stall cycles on memory I/O –
upgrading to faster CPUs doesn’t help!
Sunday, February 24, 13

slide 22:

htop
• Super top. Super configurable. Eg, basic CPU visualization:
Sunday, February 24, 13

slide 23:

mpstat
• Check for hot threads, unbalanced workloads:
$ mpstat -P ALL 1
02:47:49
CPU
%usr
02:47:50
all
02:47:50
02:47:50
02:47:50
02:47:50
3 100.00
02:47:50
4 100.00
02:47:50
5 100.00
02:47:50
6 100.00
02:47:50
02:47:50
8 100.00
[...]
%nice
%sys %iowait
%irq
%soft %steal %guest
• Columns are summarized system-wide in top(1)’s header
Sunday, February 24, 13
%idle

slide 24:

iostat
• Disk I/O statistics. 1st output is summary since boot.
$ iostat -xkdz 1
Linux 2.6.35-32-server (prod21)
02/20/13
_x86_64_
(16 CPU)
Device:
sda
sdb
rrqm/s
wrqm/s
r/s
w/s
rkB/s
wkB/s
Device:
sdb
rrqm/s
wrqm/s
r/s
w/s
rkB/s
wkB/s
\ ...
/ ...
\ ...
/ ...
\ ...
/ ...
workload input
... \
... /
... \
... /
... \
... /
... \
avgqu-sz
await r_await w_await
svctm
%util
avgqu-sz
await r_await w_await
svctm
%util
resulting performance
Sunday, February 24, 13

slide 25:

iostat, cont.
• %util: usefulness depends on target – virtual devices backed
by multiple disks may accept more work a 100% utilization
• Also calculate I/O controller stats by summing their devices
• One nit: would like to see disk errors too. Add a “-e”?
Sunday, February 24, 13

slide 26:

vmstat
• Virtual-Memory statistics, and other high-level summaries:
$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---r b
swpd
free
buff cache
cs us sy id wa
15 0
2852 46686812 279456 1401196
0 0 0 100 0
16 0
2852 46685192 279456 1401196
0 2136 36607 56 33 11 0
15 0
2852 46685952 279456 1401196
56 2150 36905 54 35 11 0
15 0
2852 46685960 279456 1401196
0 2173 36645 54 33 13 0
[...]
• First line of output includes some summary-since-boot values
• “r” = total number of runnable threads, including those running
• Swapping (aka paging) allows over-subscription of main
memory by swapping pages to disk, but costs performance
Sunday, February 24, 13

slide 27:

free
• Memory usage summary (Kbytes default):
$ free
total
Mem:
-/+ buffers/cache:
Swap:
used
free
• buffers: block device I/O cache
• cached: virtual page cache
Sunday, February 24, 13
shared
buffers
cached

slide 28:

ping
• Simple network test (ICMP):
$ ping www.hilton.com
PING a831.b.akamai.net (63.234.226.9): 56 data bytes
64 bytes from 63.234.226.9: icmp_seq=0 ttl=56 time=737.737 ms
Request timeout for icmp_seq 1
64 bytes from 63.234.226.9: icmp_seq=2 ttl=56 time=819.457 ms
64 bytes from 63.234.226.9: icmp_seq=3 ttl=56 time=897.835 ms
64 bytes from 63.234.226.9: icmp_seq=4 ttl=56 time=669.052 ms
64 bytes from 63.234.226.9: icmp_seq=5 ttl=56 time=799.932 ms
--- a831.b.akamai.net ping statistics --6 packets transmitted, 5 packets received, 16.7% packet loss
round-trip min/avg/max/stddev = 669.052/784.803/897.835/77.226 ms
• Used to measure network latency. Actually kernel gt; kernel
IP stack latency, including how the network handles ICMP.
• Tells us some, but not a lot (above is an exception).
Lots of other/better tools for this (eg, hping). Try using TCP.
Sunday, February 24, 13

slide 29:

nicstat
• Network statistics tool, ver 1.92 on Linux:
# nicstat -z 1
Time
Int
rKB/s
01:20:58
eth0
01:20:58
eth4
01:20:58 vlan123
01:20:58
br0
Time
Int
rKB/s
01:20:59
eth4 42376.0
Time
Int
rKB/s
01:21:00
eth0
01:21:00
eth4 41834.7
Time
Int
rKB/s
01:21:01
eth4 42017.9
[...]
wKB/s
rPk/s
wPk/s
wKB/s
rPk/s
wPk/s
974.5 28589.4 14002.1
wKB/s
rPk/s
wPk/s
977.9 28221.5 14058.3
wKB/s
rPk/s
wPk/s
979.0 28345.0 14073.0
rAvs
rAvs
rAvs
rAvs
wAvs %Util
64.81 0.00
80.11 0.00
64.81 0.00
42.07 0.00
wAvs %Util
71.27 35.5
wAvs %Util
0.00 0.00
71.23 35.1
wAvs %Util
71.24 35.2
Sat
Sat
Sat
Sat
• This was the tool I wanted, and finally wrote it out of frustration
(Tim Cook ported and enhanced it on Linux)
• Calculate network controller stats by summing interfaces
Sunday, February 24, 13

slide 30:

dstat
• A better vmstat-like tool. Does coloring (FWIW).
Sunday, February 24, 13

slide 31:

Tools: Basic, recap
• uptime
• top or htop
• mpstat
• iostat
• vmstat
• free
• ping
• nicstat
• dstat
Sunday, February 24, 13

slide 32:

Tools: Basic, recap
Operating System
Hardware
Applications
DBs, all server types, ...
top
mpstat
dstat
System Libraries
System Call Interface
VFS
ext3/...
Scheduler
Sockets
ZFS
Virtual
Memory
Block Device Interface
vmstat
TCP/UDP
LVM
Ethernet
Device Drivers
iostat
infer
Disk
Disk
dstat
DRAM
nicstat
Network Controller
Port
ping
Sunday, February 24, 13
dstat
free
top
infer
I/O Bridge
I/O Controller
CPU
Port

slide 33:

Tools: Intermediate
• sar
• netstat
• pidstat
• strace
• tcpdump
• blktrace
• iotop
• slabtop
• sysctl
• /proc
Sunday, February 24, 13

slide 34:

sar
• System Activity Reporter. Eg, paging statistics -B:
$ sar -B 1
Linux 3.2.6-3.fc16.x86_64 (node104)
05:24:34 PM
05:24:35 PM
05:24:36 PM
05:24:37 PM
05:24:38 PM
05:24:39 PM
05:24:40 PM
[...]
pgpgin/s pgpgout/s
02/20/2013 _x86_64_
fault/s
majflt/s
(1 CPU)
pgfree/s pgscank/s pgscand/s pgsteal/s
• Configure to archive statistics from cron
• Many, many statistics available:
• -d: block device statistics, -q: run queue statistics, ...
• Same statistics as shown by other tools (vmstat, iostat, ...)
Sunday, February 24, 13
%vmeff

slide 35:

netstat
• Various network protocol statistics using -s:
$ netstat -s
[...]
Tcp:
127116 active connections openings
165223 passive connection openings
12904 failed connection attempts
19873 connection resets received
20 connections established
662889209 segments received
354923419 segments send out
405146 segments retransmited
6 bad segments received.
26379 resets sent
[...]
TcpExt:
2142 invalid SYN cookies received
3350 resets received for embryonic SYN_RECV sockets
7460 packets pruned from receive queue because of socket buffer overrun
2932 ICMP packets dropped because they were out-of-window
96670 TCP sockets finished time wait in fast timer
86 time wait sockets recycled by time stamp
1007 packets rejects in established connections because of timestamp
[...many...]
Sunday, February 24, 13

slide 36:

pidstat
• Very useful process breakdowns:
# pidstat 1
Linux 3.2.6-3.fc16.x86_64 (node107)
02/20/2013
_x86_64_ (1 CPU)
05:55:18 PM
05:55:19 PM
05:55:19 PM
PID
%usr %system
%guest
%CPU
CPU
Command
pidstat
cksum
05:55:19 PM
05:55:20 PM
[...]
PID
%usr %system
%guest
%CPU
CPU
Command
cksum
# pidstat -d 1
Linux 3.2.6-3.fc16.x86_64 (node107)
02/20/2013
_x86_64_ (1 CPU)
05:55:22 PM
05:55:23 PM
05:55:23 PM
PID
kB_rd/s
12643 151985.71
kB_wr/s kB_ccwr/s
Command
jbd2/vda2-8
cksum
05:55:23 PM
05:55:24 PM
[...]
PID
kB_wr/s kB_ccwr/s
Command
cksum
kB_rd/s
disk I/O (yay!)
Sunday, February 24, 13

slide 37:

strace
• System call tracer:
$ strace -tttT -p 12670
1361424797.229550 read(3, "REQUEST 1888 CID 2"..., 65536) = 959 gt;
1361424797.239053 read(3, "", 61440)
= 0 gt;
1361424797.239406 close(3)
= 0 gt;
1361424797.239738 munmap(0x7f8b22684000, 4096) = 0 gt;
1361424797.240145 fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
gt;
[...]
• -ttt: microsecond timestamp since epoch (left column)
• -T: time spent in syscall (gt;)
• -p: PID to trace (or provide a command)
• Useful – high application latency often caused by resource I/O,
and most resource I/O is performed by syscalls
Sunday, February 24, 13

slide 38:

strace, cont.
• -c: print summary:
# strace -c dd if=/dev/zero of=/dev/null bs=512 count=1024k
[...]
% time
seconds usecs/call
calls
errors syscall
------ ----------- ----------- --------- --------- ---------------51.32
read
write
open
[...]
• This is also a (worst case) demo of the strace overhead:
# time dd if=/dev/zero of=/dev/null bs=512 count=1024k
[...]
536870912 bytes (537 MB) copied, 0.35226 s, 1.5 GB/s
real 0m0.355s
user 0m0.021s
sys 0m0.022s
# time strace -c dd if=/dev/zero of=/dev/null bs=512 count=1024k
[...]
536870912 bytes (537 MB) copied, 71.9565 s, 7.5 MB/s
real 1m11.969s
user 0m3.179s
sys 1m6.346s
200x slower
Sunday, February 24, 13

slide 39:

tcpdump
• Sniff network packets, dump to output files for post analysis:
# tcpdump -i eth4 -w /tmp/out.tcpdump
tcpdump: listening on eth4, link-type EN10MB (Ethernet), capture size 65535
bytes
^C33651 packets captured
34160 packets received by filter
508 packets dropped by kernel
# tcpdump -nr /tmp/out.tcpdump
reading from file /tmp/out.tcpdump, link-type EN10MB (Ethernet)
06:24:43.908732 IP 10.2.0.2.55502 >gt; 10.2.203.2.22: Flags [.], ack ...
06:24:43.908922 IP 10.2.0.2.55502 >gt; 10.2.203.2.22: Flags [.], ack ...
06:24:43.908943 IP 10.2.203.2.22 >gt; 10.2.0.2.55502: Flags [.], seq ...
06:24:43.909061 IP 10.2.0.2.55502 >gt; 10.2.203.2.22: Flags [.], ack ...
• Output has timestamps with microsecond resolution
• Study odd network latency packet-by-packet
• Import file into other tools (wireshark)
Sunday, February 24, 13

slide 40:

tcpdump, cont.
• Does have overhead in terms of CPU and storage; previous
example dropped packets
• Should be using socket ring buffers to reduce overhead
• Can use filter expressions to also reduce overhead
• Could still be problematic for busy interfaces
Sunday, February 24, 13

slide 41:

blktrace
• Block device I/O event tracing. Launch using btrace, eg:
# btrace /dev/sdb
8,16
8,16
8,16
8,16
8,16
8,16
8,16
8,16
[...]
0.429604145 20442
0.429604569 20442
0.429606014 20442
0.429607624 20442
0.429608804 20442
0.429610501 20442
0.429611912 20442
R 184773879 + 8

slide 42:

iotop
• Disk I/O by process:
# iotop -bod5
Total DISK READ:
TID PRIO USER
12824 be/4 root
279 be/3 root
12716 be/4 root
12816 be/4 root
iotop -bod5
[...]
35.38 M/s | Total DISK WRITE:
39.50 K/s
DISK READ DISK WRITE SWAPIN
COMMAND
35.35 M/s
0.00 B/s 0.00 % 80.59 % cksum ...
0.00 B/s
27.65 K/s 0.00 % 2.21 % [jbd2/vda2-8]
28.44 K/s
0.00 B/s 2.35 % 0.00 % sshd: root@pts/0
6.32 K/s
0.00 B/s 0.89 % 0.00 % python /usr/bin/
• IO: time thread was waiting on I/O (this is even more useful
than pidstat’s Kbytes)
• Needs CONFIG_TASK_IO_ACCOUNTING or something
similar enabled to work.
Sunday, February 24, 13

slide 43:

slabtop
• Kernel slab allocator usage top:
# slabtop -sc
Active / Total Objects (% used)
: 900356 / 1072416 (84.0%)
Active / Total Slabs (% used)
: 29085 / 29085 (100.0%)
Active / Total Caches (% used)
: 68 / 91 (74.7%)
Active / Total Size (% used)
: 237067.98K / 260697.24K (90.9%)
Minimum / Average / Maximum Object : 0.01K / 0.24K / 10.09K
OBJS ACTIVE
112035 110974
726660 579946
83496 76878
23809 23693
[...]
USE OBJ SIZE
99%
0.91K
79%
0.11K
96%
4.00K
92%
0.19K
99%
0.55K
86%
0.62K
77%
1.00K
84%
5.73K
82%
0.17K
74%
0.58K
SLABS OBJ/SLAB CACHE SIZE NAME
102432K ext4_inode_cache
80740K buffer_head
18432K kmalloc-4096
15904K dentry
13136K radix_tree_node
6912K proc_inode_cache
3488K kmalloc-1024
3264K task_struct
1904K vm_area_struct
1504K inode_cache
• Shows where kernel memory is consumed
Sunday, February 24, 13

slide 44:

sysctl
• System settings:
# sysctl -a
[...]
net.ipv4.tcp_fack = 1
net.ipv4.tcp_reordering = 3
net.ipv4.tcp_ecn = 2
net.ipv4.tcp_dsack = 1
net.ipv4.tcp_mem = 24180
net.ipv4.tcp_wmem = 4096
net.ipv4.tcp_rmem = 4096
[...]
• Static performance tuning: check the config of the sysetm
Sunday, February 24, 13

slide 45:

/proc
• Read statistic sources directly:
$ cat /proc/meminfo
MemTotal:
8181740 kB
MemFree:
71632 kB
Buffers:
163288 kB
Cached:
4518600 kB
SwapCached:
7036 kB
Active:
4765476 kB
Inactive:
2866016 kB
Active(anon):
2480336 kB
Inactive(anon):
478580 kB
Active(file):
2285140 kB
Inactive(file): 2387436 kB
Unevictable:
0 kB
Mlocked:
0 kB
SwapTotal:
2932728 kB
SwapFree:
2799568 kB
Dirty:
76 kB
Writeback:
0 kB
[...]
• Also see /proc/vmstat
Sunday, February 24, 13

slide 46:

Tools: Intermediate, recap.
• sar
• netstat
• pidstat
• strace
• tcpdump
• blktrace
• iotop
• slabtop
• sysctl
• /proc
Sunday, February 24, 13

slide 47:

Tools: Advanced
• perf
• DTrace
• SystemTap
• and more ...
Sunday, February 24, 13

slide 48:

perf
• Originally Performance Counters for Linux (PCL), focusing on
CPU performance counters (programmable registers)
• Now a collection of profiling and tracing tools, with numerous
subcommands, including:
kmem
Trace/measure kernel memory (slab) properties
kvm
Trace/measure KVM guest OS
list
List available events (targets of instrumentation)
lock
Analyze lock events
probe
Create dynamic probe points (dynamic tracing!)
record
Run a command and record profile data (as perf.data)
report
Read perf.data and summarize, has an interactive mode
sched
Trace/measure kernel scheduler statistics
stat
Run a command, gather, and report perf counter stats
Sunday, February 24, 13

slide 49:

perf: Performance Counters
• Key performance counter summary:
$ perf stat gzip file1
Performance counter stats for 'gzip file1':
task-clock-msecs
context-switches
CPU-migrations
page-faults
cycles
instructions
branches
branch-misses
cache-references
cache-misses
seconds time elapsed
0.901 CPUs
0.000 M/sec
0.000 M/sec
0.000 M/sec
2395.230 M/sec
2.221 IPC
550.641 M/sec
1.032 %
2.059 M/sec
1.211 M/sec
yay
• Low IPC (

slide 50:

perf: Performance Counters, cont.
• Can choose different counters:
$ perf list | grep Hardware
cpu-cycles OR cycles
[Hardware event]
stalled-cycles-frontend OR idle-cycles-frontend
[Hardware event]
stalled-cycles-backend OR idle-cycles-backend
[Hardware event]
instructions
[Hardware event]
cache-references
[Hardware event]
[...]
$ perf stat -e instructions,cycles,L1-dcache-load-misses,LLC-loadmisses,dTLB-load-misses gzip file1
Performance counter stats for 'gzip file1':
instructions
cycles
L1-dcache-load-misses
LLC-load-misses
dTLB-load-misses
seconds time elapsed
2.199 IPC
• Supports additional custom counters (in hex or a desc) for
whatever the processor supports. Examine bus events.
Sunday, February 24, 13

slide 51:

perf: Performance Counters, cont.
Operating System
Hardware
Applications
DBs, all server types, ...
System Libraries
CPU
Interconnect
System Call Interface
VFS
ext3/...
Sockets
ZFS
Scheduler
TCP/UDP
LVM
Block Device Interface
Ethernet
Virtual
Memory
Memory
Bus
DRAM
Device Drivers
perf stat
I/O Bus
advanced activity:
refer to the
processor
manuals
I/O Bridge
I/O Controller
Disk
Sunday, February 24, 13
CPU
Disk
Expander Interconnect
Network Controller
Port
Port

slide 52:

perf: Profiling
• Profiling (sampling) CPU activity:
# perf record -a -g -F 997 sleep 10
[ perf record: Woken up 44 times to write data ]
• -a: all CPUs
• -g: call stacks
• -F: Hertz
• sleep 10: duration to sample (dummy command)
• Generates a perf.data file
• Can profile other hardware events too, with call stacks
Sunday, February 24, 13

slide 53:

perf: Profiling, cont.
• Reading perf.data, forcing non-interactive mode (--stdio):
# perf report --stdio
[...]
# Overhead
Command
Shared Object
Symbol
# ........ ........... ................. ..............................
72.98%
swapper [kernel.kallsyms] [k] native_safe_halt
--- native_safe_halt
default_idle
cpu_idle
rest_init
start_kernel
x86_64_start_reservations
x86_64_start_kernel
9.43%
[...]
Sunday, February 24, 13
dd [kernel.kallsyms] [k] acpi_pm_read
--- acpi_pm_read
ktime_get_ts
|--87.75%-- __delayacct_blkio_start
io_schedule_timeout
balance_dirty_pages_ratelimited_nr
generic_file_buffered_write

slide 54:

perf: Profiling, cont.
• Flame Graphs support perf profiling data:
• Interactive SVG. Navigate to quantify and compare code paths
Sunday, February 24, 13

slide 55:

perf: Static Tracing
• Listing static tracepoints for block I/O:
$ perf list | grep block:
block:block_rq_abort
block:block_rq_requeue
block:block_rq_complete
block:block_rq_insert
block:block_rq_issue
block:block_bio_bounce
block:block_bio_complete
block:block_bio_backmerge
block:block_bio_frontmerge
block:block_bio_queue
block:block_getrq
block:block_sleeprq
block:block_plug
block:block_unplug
block:block_split
block:block_bio_remap
block:block_rq_remap
[Tracepoint event]
[Tracepoint event]
[Tracepoint event]
[Tracepoint event]
[Tracepoint event]
[Tracepoint event]
[Tracepoint event]
[Tracepoint event]
[Tracepoint event]
[Tracepoint event]
[Tracepoint event]
[Tracepoint event]
[Tracepoint event]
[Tracepoint event]
[Tracepoint event]
[Tracepoint event]
[Tracepoint event]
• Many useful probes already provided for kernel tracing:
$ perf list | grep Tracepoint | wc -l
Sunday, February 24, 13

slide 56:

perf: Static Tracepoints
net:
sock: skb:
Operating System
Hardware
Applications
DBs, all server types, ...
System Libraries
syscalls:
System Call Interface
VFS
ext4:
sched:
ext3/...
Sockets
ZFS
Scheduler
CPU
TCP/UDP
LVM
block: Block Device Interface
Ethernet
Virtual
Memory
vmscan:
kmem:
DRAM
Device Drivers
scsi:
irq:
I/O Bridge
I/O Controller
Disk
Sunday, February 24, 13
Disk
... more can be added as needed
Network Controller
Port
Port
device
stats
can be
inferred

slide 57:

perf: Dynamic Tracing
• Define custom probes from kernel code; eg, tcp_sendmsg():
# perf probe --add='tcp_sendmsg'
Add new event:
probe:tcp_sendmsg
(on tcp_sendmsg)
[...]
# perf record -e probe:tcp_sendmsg -aR -g sleep 5
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.091 MB perf.data (~3972 samples) ]
# perf report --stdio
[...]
# Overhead Command
Shared Object
# ........ ....... .................
100.00%
sshd [kernel.kallsyms]
--- tcp_sendmsg
sock_aio_write
do_sync_write
vfs_write
sys_write
system_call
__GI___libc_write
Sunday, February 24, 13
Symbol
[k] tcp_sendmsg
active traced call stacks from
arbitrary kernel locations!

slide 58:

perf: Dynamic Tracing, cont.
perf probe --add
Operating System
Applications
DBs, all server types, ...
System Libraries
System Call Interface
VFS
ext3/...
Sockets
ZFS
Block Device Interface
Scheduler
CPU
TCP/UDP
LVM
advanced activity:
refer to the
kernel source
code
Ethernet
Virtual
Memory
DRAM
Device Drivers
I/O Bridge
I/O Controller
Disk
Sunday, February 24, 13
Hardware
Disk
Network Controller
Port
Port
device
stats
can be
inferred

slide 59:

perf: Dynamic Tracing, cont.
• Fills in kernel observability gaps
• Awesome capability
• Takes some effort to use (waiting for the trace-dumpanalyze cycle, and using post-processors to rework the
output, or the post-scripting capability)
• Would be the awesomest tool ever, if it wasn’t for ...
Sunday, February 24, 13

slide 60:

DTrace
Sunday, February 24, 13

slide 61:

DTrace
• Programmable, real-time, dynamic and static tracing
• Perf analysis and troubleshooting, without restarting anything
• Used on Solaris, illumos/SmartOS, Mac OS X, FreeBSD, ...
• Two ports in development for Linux (that we know of):
• 1. dtrace4linux
• Mostly by Paul Fox
• 2. Oracle Enterprise Linux DTrace
• Steady progress
There are a couple of awesome books about DTrace too
Sunday, February 24, 13

slide 62:

DTrace: Installation
• dtrace4linux version:
1. https://github.com/dtrace4linux/dtrace
2. README:
tools/get-deps.pl
# if using Ubuntu
tools/get-deps-fedora.sh
# RedHat/Fedora
make all
make install
make load
(need to be root or have sudo access)
# make load
tools/load.pl
13:40:14 Syncing...
13:40:14 Loading: build-3.2.6-3.fc16.x86_64/driver/dtracedrv.ko
13:40:15 Preparing symbols...
13:40:15 Probes available: 281887
13:40:18 Time: 4s
• WARNING: still a prototype, can panic/freeze kernels.
I’m using it the lab to solve replicated production perf issues
Sunday, February 24, 13

slide 63:

DTrace: Programming
• Programming capabilities allow for powerful, efficient, oneliners and scripts. In-kernel custom filtering and aggregation.
# dtrace -n 'fbt::tcp_sendmsg:entry /execname == "sshd"/ {
@["bytes"] = quantize(arg3); }'
dtrace: description 'fbt::tcp_sendmsg:entry ' matched 1 probe
bytes
value ------------- Distribution ------------- count
16 |
32 |@@@@@@@@@@@@@@@@
64 |@@@@@@@@@@@@@
128 |@@@
256 |@@@@
512 |@@@
1024 |@
2048 |
4096 |
8192 |
• Example shows tcp_sendmsg() size dist for “sshd” PIDs
Sunday, February 24, 13

slide 64:

DTrace: Programming
• Programming capabilities allow for powerful, efficient, oneliners and scripts. In-kernel custom filtering and aggregation.
# dtrace -n 'fbt::tcp_sendmsg:entry /execname == "sshd"/ {
@["bytes"] = quantize(arg3); }'
dtrace: description 'fbt::tcp_sendmsg:entry ' matched 1 probe
bytes
filter
aggregation (summarizes)
value ------------- Distribution ------------- count
16 |
32 |@@@@@@@@@@@@@@@@
64 |@@@@@@@@@@@@@
128 |@@@
256 |@@@@
512 |@@@
kernel ->gt; user transfers
1024 |@
these these numbers
2048 |
4096 |
only
(pre-summarized)
8192 |
• Example shows tcp_sendmsg() size dist for “sshd” PIDs
these examples use dtrace4linux
Sunday, February 24, 13

slide 65:

DTrace: Real-Time
• Multiple GUIs use DTrace for real-time statistics. Eg, Joyent
Cloud Analytics, showing real-time cloud-wide syscall latency:
Sunday, February 24, 13

slide 66:

DTrace, cont.
• Has advanced capabilities, but not necessarily difficult;
You may just:
• use one-liners (google “DTrace one-liners”)
• use scripts (DTraceToolkit; DTrace book; google)
• tweak one-liners or scripts a little
• ask someone else to write the scripts you need
• Ideally, you learn DTrace and write your own
Sunday, February 24, 13

slide 67:

DTrace: Scripts
#!/usr/sbin/dtrace -s
fbt::vfs_read:entry
self->gt;start = timestamp;
13 line script to time
VFS reads by process name
fbt::vfs_read:return
/self->gt;start/
@[execname, "ns"] = quantize(timestamp - self->gt;start);
self->gt;start = 0;
# ./vfsread.d
dtrace: script './vfsread.d' matched 2 probes
cksum
value ------------- Distribution ------------- count
[...]
262144 |
524288 |@@@@@@@@@@
1048576 |
2097152 |
read latency distribution,
4194304 |
8388608 |@
0.5ms
->gt;
33ms
(disks)
16777216 |
33554432 |
Sunday, February 24, 13

slide 68:

DTrace: Basics
• CLI syntax:
dtrace -n ‘provider:module:function:name /predicate/ { action }’
probe description
optional
filter
do this when
probe “fires”
• provider – library of related probes
• module:function – shows where probe is located (for debug)
• name – name of probe
• Online reference and tutorial: http://dtrace.org/guide
Sunday, February 24, 13

slide 69:

DTrace: Providers
Applications
DBs, all server types, ...
System Libraries
System Call Interface
VFS
ext3/...
Sockets
ZFS
CPU
TCP/UDP
LVM
Block Device Interface
Scheduler
Ethernet
Virtual
Memory
DRAM
Device Drivers
I/O Bridge
I/O Controller
Disk
Sunday, February 24, 13
Disk
Network Controller
Port
Port

slide 70:

DTrace: Providers
syscall plockstat
tcp udp ip
pid
Applications
DBs, all server types, ...
System Libraries
System Call Interface
VFS
Sockets
ext3/...
fbt
profile
java javascript
node perl python
php ruby erlang
objc tcl ...
mysql postgres ...
ZFS
CPU
TCP/UDP
LVM
Block Device Interface
Scheduler
profile
sched
proc
Ethernet
Virtual
Memory
cpc
DRAM
Device Drivers
vminfo
cpc
I/O Bridge
infer
infer
I/O Controller
fbt and pid
are dynamic
Sunday, February 24, 13
Disk
Disk
Network Controller
Port
Port
fbt

slide 71:

DTrace: Linux Examples
• Following examples use fbt – kernel dynamic tracing
Sunday, February 24, 13

slide 72:

DTrace: ext4slower.d
• Show me:
• ext4 reads and writes
• slower than a specified latency (milliseconds)
• with time, process, direction, size, latency, and file name
# ./ext4slower.d 10
Tracing ext4 read/write slower than 10 ms
TIME
PROCESS
2013 Feb 22 17:17:02 cksum
2013 Feb 22 17:17:02 cksum
2013 Feb 22 17:17:03 cksum
2013 Feb 22 17:17:03 cksum
ms FILE
35 100m
16 1m
18 data1
23 data1
• I wrote this to answer: is ext4 to blame for latency outliers?
• Argument is latency you are looking for: here, 10+ ms
Sunday, February 24, 13

slide 73:

DTrace: ext4slower.d, cont.
• Extending vfs_read() example:
#!/usr/sbin/dtrace -s
#pragma D option quiet
#pragma D option defaultargs
#pragma D option switchrate=5
dtrace:::BEGIN
min_ns = $1 * 1000000;
printf("Tracing ext4 read/write slower than %d ms\n", $1);
printf("%-20s %-16s %1s %4s %6s %s\n", "TIME", "PROCESS",
"D", "KB", "ms", "FILE");
fbt::vfs_read:entry, fbt::vfs_write:entry
this->gt;file = (struct file *)arg0;
this->gt;fs = this->gt;file->gt;f_path.dentry->gt;d_inode->gt;i_sb->gt;s_type->gt;name;
• ... continued:
Sunday, February 24, 13

slide 74:

DTrace: ext4slower.d, cont.
fbt::vfs_read:entry, fbt::vfs_write:entry
/stringof(this->gt;fs) == "ext4"/
self->gt;start = timestamp;
self->gt;name = this->gt;file->gt;f_path.dentry->gt;d_name.name;
fbt::vfs_read:return, fbt::vfs_write:return
/self->gt;start && (this->gt;delta = timestamp - self->gt;start) >gt; min_ns/
this->gt;dir = probefunc == "vfs_read" ? "R" : "W";
printf("%-20Y %-16s %1s %4d %6d %s\n", walltimestamp,
execname, this->gt;dir, arg1 / 1024, this->gt;delta / 1000000,
stringof(self->gt;name));
fbt::vfs_read:return, fbt::vfs_write:return
self->gt;start = 0;
self->gt;name = 0;
• Immediately exonerate or blame ext4.
... should add more vfs_*() calls; or trace ext4 funcs directly
Sunday, February 24, 13

slide 75:

DTrace: tcpretransmit.d
• Show me:
• TCP retransmits
• destination IP address
• kernel stack (shows why)
• in real-time
• Don’t sniff all packets – only trace retransmits, to minimize
overhead
Sunday, February 24, 13

slide 76:

DTrace: tcpretransmit.d, cont.
# ./tcpretransmit.d
Tracing TCP retransmits... Ctrl-C to end.
2013 Feb 23 18:24:11: retransmit to 10.2.124.2, by:
kernel`tcp_retransmit_timer+0x1bd
kernel`tcp_write_timer+0x188
kernel`run_timer_softirq+0x12b
kernel`tcp_write_timer
kernel`__do_softirq+0xb8
kernel`read_tsc+0x9
kernel`sched_clock+0x9
kernel`sched_clock_local+0x25
kernel`call_softirq+0x1c
kernel`do_softirq+0x65
kernel`irq_exit+0x9e
kernel`smp_apic_timer_interrupt+0x6e
kernel`apic_timer_interrupt+0x6e
[...]
... can trace those stack functions directly for more detail
Sunday, February 24, 13

slide 77:

DTrace: tcpretransmit.d, cont.
• Source:
#!/usr/sbin/dtrace -s
#pragma D option quiet
dtrace:::BEGIN { trace("Tracing TCP retransmits... Ctrl-C to end.\n"); }
fbt::tcp_retransmit_skb:entry {
this->gt;so = (struct sock *)arg0;
this->gt;d = (unsigned char *)&this->gt;so->gt;__sk_common.skc_daddr;
printf("%Y: retransmit to %d.%d.%d.%d, by:", walltimestamp,
this->gt;d[0], this->gt;d[1], this->gt;d[2], this->gt;d[3]);
stack(99);
Sunday, February 24, 13

slide 78:

DTrace: Current State
• This was demoed on a prototype DTrace for Linux
• Right now (Feb 2013) not stable – will panic/freeze
• Needs other handholding to work around nits/bugs
• AFAIK, both DTrace ports welcome help (that means you!)
• Those examples were also fbt-based:
• Will probably need tweaks to match different kernels, since
the API is dynamically built from the kernel code
• DTrace stable providers solve that problem – but many
aren’t there on Linux yet
Sunday, February 24, 13

slide 79:

DTrace: Trying it out
• All providers are available to try on illumos/SmartOS
• illumos is the on-going fork of the OpenSolaris kernel
• SmartOS is Joyent’s illumos-based cloud OS (distro)
• Rough translation guide:
kernel: linux == illumos
distros: {ubuntu|CentOS|Fedora} == {SmartOS|OmniOS|OpenIndiana}
• DTrace implementation mature
• Joyent uses SmartOS as a hypervisor for running KVM Linux
on ZFS
Sunday, February 24, 13

slide 80:

DTrace: Other Capabilities
• Trace short lived processes
• Profile CPU usage
• Time any thread blocking event
• Investigate disk I/O latency
• Investigate network I/O latency
• Examine cache activity
• Investigate memory allocation: growth or leaks
• Investigate swapping (paging) in detail
• Follow network packets through the stack
• Examine lock contention
• ...
Sunday, February 24, 13

slide 81:

SystemTap
Sunday, February 24, 13

slide 82:

SystemTap
• Created when there
wasn’t DTrace for
Linux ports
• Static and dynamic
tracing, probes,
tapsets, scripts, ...
• I’ve used it a lot:
• panics/freezes
• slow startups
• for Linux only
• incompatible with D
Sunday, February 24, 13

slide 83:

Tools: Advanced, recap.
Operating System
Hardware
Applications
DBs, all server types, ...
System Libraries
System Call Interface
VFS
Sockets
ext3/...
ZFS
CPU
TCP/UDP
LVM
Block Device Interface
Scheduler
Ethernet
Virtual
Memory
DRAM
Device Drivers
Given the tools to
see everything,
how do you
use them?
I/O Bridge
I/O Controller
Disk
Sunday, February 24, 13
Disk
Network Controller
Port
Port

slide 84:

And More ...
• Other observability tools at all levels include:
• ps, pmap, traceroute, ntop, ss, lsof, oprofile, gprof,
kcachegrind, valgrind, google profiler, nfsiostat, cifsiostat,
latencytop, powertop, LLTng, ktap, ...
• And many experimental tools: micro-benchmarks
• So many tools it gets confusing – where do you start?
Sunday, February 24, 13

slide 85:

Methodologies
• Selected four:
• Streetlight Anti-Method
• Workload Characterization Method
• Drill-Down Analysis Method
• USE Method
• Methodologies give beginners a starting point, casual users a
checklist, and experts a reminder
Sunday, February 24, 13

slide 86:

Streetlight Anti-Method
Sunday, February 24, 13

slide 87:

Streetlight Anti-Method
• 1. Pick observability tools that are
• familiar
• found on the Internet
• found at random
• 2. Run tools
• 3. Look for obvious issues
• Included for comparison (don’t use this methodology)
Sunday, February 24, 13

slide 88:

Streetlight Anti-Method, cont.
• Named after an observational bias called the streetlight effect
A policeman sees a drunk looking under a streetlight,
and asks what he is looking for.
The drunk says he has lost his keys.
The policeman can't find them either,
and asks if he lost them under the streetlight.
The drunk replies:
“No, but this is where the light is best.”
Sunday, February 24, 13

slide 89:

Streetlight Anti-Method, cont.
top - 15:09:38 up 255 days, 16:54, 10 users, load average: 0.00, 0.03, 0.00
Tasks: 274 total,
1 running, 273 sleeping,
0 stopped,
0 zombie
Cpu(s): 0.7%us, 0.0%sy, 0.0%ni, 99.1%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st
Mem:
8181740k total, 7654228k used,
527512k free,
405616k buffers
Swap: 2932728k total,
125064k used, 2807664k free, 3826244k cached
PID USER
16876 root
3947 brendan
15841 joshw
16922 joshw
1 root
2 root
3 root
4 root
5 root
[...]
NI VIRT RES SHR S %CPU %MEM
TIME+ COMMAND
0 57596 17m 1972 S
4 0.2
3:00.60 python
0 19352 1552 1060 R
0 0.0
0:00.06 top
0 67144 23m 908 S
0 0.3 218:21.70 mosh-server
0 54924 11m 920 S
0 0.1 121:34.20 mosh-server
0 23788 1432 736 S
0 0.0
0:18.15 init
0 S
0 0.0
0:00.61 kthreadd
0 S
0 0.0
0:00.11 migration/0
0 S
0 0.0 18:43.09 ksoftirqd/0
0 S
0 0.0
0:00.00 watchdog/0
• Why are you still running top?
Sunday, February 24, 13

slide 90:

Streetlight Anti-Method, cont.
• Tools-based approach
• Inefficient:
• can take time before the right tool is found
• can be wasteful when investigating false positives
• Incomplete:
• don’t find the right tool, or,
• the right tool doesn’t exist
Sunday, February 24, 13

slide 91:

Workload Characterization Method
Sunday, February 24, 13

slide 92:

Workload Characterization Method
• 1. Who
• 2. Why
• 3. What
• 4. How
Sunday, February 24, 13

slide 93:

Workload Characterization Method
• 1. Who is causing the load? PID, UID, IP addr, ...
• 2. Why is the load called? code path
• 3. What is the load? IOPS, tput, direction, type
• 4. How is the load changing over time?
Sunday, February 24, 13

slide 94:

Workload Characterization Method, cont.
• Identifies issues of load
• Best performance wins are from eliminating unnecessary work
• Don’t assume you know what the workload is – characterize
• Many of the previous analysis tools included workload statistics
Sunday, February 24, 13

slide 95:

Workload Characterization Method, cont.
• Pros:
• Potentially largest wins
• Cons:
• Only solves a class of issues – load
• Time consuming, and can be discouraging – most
attributes examined will not be a problem
Sunday, February 24, 13

slide 96:

Drill-Down Analysis Method
Sunday, February 24, 13

slide 97:

Drill-Down Analysis Method
• 1. Start at highest level
• 2. Examine next-level details
• 3. Pick most interesting breakdown
• 4. If problem unsolved, go to 2
Sunday, February 24, 13

slide 98:

Drill-Down Analysis Method, cont.: Example
• For example, ext4 – identify latency origin top-down:
Drill-Down Analysis
Applications
User
Kernel
System Libraries
System Call Interface
VFS
ext4
Block Device Interface
Device Drivers
Dynamic Tracing / DTrace is well suited for this,
as it can dig through all layers with custom detail
Sunday, February 24, 13

slide 99:

Drill-Down Analysis: ext4
• eg, ext4_readpages() latency distribution (microseconds):
# dtrace -n 'fbt::ext4_readpages:entry { self->gt;ts = timestamp; }
fbt::ext4_readpages:return /self->gt;ts/ {
@["us"] = lquantize((timestamp - self->gt;ts) / 1000, 0, 10000, 250);
self->gt;ts = 0;
dtrace: description 'fbt::ext4_readpages:entry ' matched 2 probes
[...]
Sunday, February 24, 13
value ------------- Distribution ------------- count

slide 100:

Drill-Down Analysis: ext4
• ... can dig out more details as needed: file name, code path:
# dtrace -n 'fbt::ext4_readpages:entry {
this->gt;file = (struct file *)arg0;
this->gt;name = this->gt;file->gt;f_path.dentry->gt;d_name.name;
@[stringof(this->gt;name), stack()] = count();
dtrace: description 'fbt::ext4_readpages:entry ' matched 1 probe
^C[...]
foo8
kernel`__do_page_cache_readahead+0x1c7
kernel`ra_submit+0x21
kernel`ondemand_readahead+0x115
kernel`page_cache_async_readahead+0x80
kernel`radix_tree_lookup_slot+0xe
kernel`find_get_page+0x1e
kernel`generic_file_aio_read+0x48b
kernel`vma_merge+0x121
kernel`do_sync_read+0xd2
kernel`__switch_to+0x132
kernel`security_file_permission+0x93
kernel`rw_verify_area+0x61
# of
kernel`vfs_read+0xb0
occurrences kernel`sys_read+0x4a
kernel`system_call_fastpath+0x16
Sunday, February 24, 13

slide 101:

Drill-Down Analysis Method, cont.
• Moves from higher- to lower-level details based on findings:
environment-wide down to metal
• Pros:
• Will identify root cause(s)
• Cons:
• Time consuming – especially when drilling in the wrong
direction
Sunday, February 24, 13

slide 102:

USE Method
Sunday, February 24, 13

slide 103:

USE Method
• For every resource, check:
• 1. Utilization
• 2. Saturation
• 3. Errors
Sunday, February 24, 13

slide 104:

USE Method, cont.
• For every resource, check:
• 1. Utilization: time resource was busy, or degree used
• 2. Saturation: degree of queued extra work
• 3. Errors: any errors
Saturation
Errors
Sunday, February 24, 13
Utilization

slide 105:

USE Method, cont.
• Hardware Resources:
• CPUs
• Main Memory
• Network Interfaces
• Storage Devices
• Controllers
• Interconnects
• Find the functional diagram and examine every item in the
data path...
Sunday, February 24, 13

slide 106:

USE Method, cont.: Functional Diagram
Hardware
Memory
Bus
DRAM
For each check:
1. Utilization
2. Saturation
3. Errors
CPU
CPU
Interconnect
I/O Bus
I/O Bridge
Expander Interconnect
I/O Controller
Disk
Sunday, February 24, 13
CPU
Memory
Bus
Disk
Network Controller
Interface
Transports
Port
Port
DRAM

slide 107:

USE Method, cont.: Example Linux Checklist
http://dtrace.org/blogs/brendan/2012/03/07/the-use-method-linux-performance-checklist
Resource Type
CPU
CPU
CPU
Utilization
Saturation
Errors
Metric
per-cpu: mpstat -P ALL 1, “%idle”; sar -P ALL,
“%idle”; system-wide: vmstat 1, “id”; sar -u, “%idle”;
dstat -c, “idl”; per-process:top, “%CPU”; htop, “CPU
%”; ps -o pcpu; pidstat 1, “%CPU”; per-kernelthread: top/htop (“K” to toggle), where VIRT == 0
(heuristic).
[1]vmstat 1, “r” >gt; CPU count [2]; sar -q,
system-wide:
“runq-sz” >gt; CPU count; dstat -p, “run” >gt; CPU count;
per-process: /proc/PID/schedstat 2nd field
(sched_info.run_delay); perf sched latency (shows
“Average” and “Maximum” delay per-schedule); dynamic
tracing, eg, SystemTap schedtimes.stp “queued(us)” [3]
perf (LPE) if processor specific error events (CPC) are
available; eg, AMD64′s “04Ah Single-bit ECC Errors
Recorded by Scrubber” [4]
... etc for all combinations (would fill a dozen slides)
Sunday, February 24, 13

slide 108:

USE Method, cont.
• Some software resources can also be studied:
• Mutex Locks
• Thread Pools
• Process/Thread Capacity
• File Descriptor Capacity
• Consider possible USE metrics for each
Sunday, February 24, 13

slide 109:

USE Method, cont.
• This process may reveal missing metrics – those not provided
by your current toolset
• They are your known unknowns
• Much better than unknown unknowns
• More tools can be installed and developed to help
• So many top(1)s, but where is the interconnect-top?
• Full USE Method checklist may, practically, only be used for
critical issues
Sunday, February 24, 13

slide 110:

USE Method, cont.
• Resource-based approach
• Quick system health check, early in an investigation
• Pros:
• Complete: all resource bottlenecks and errors
• Not limited in scope by your current toolset
• No unknown unknowns – at least known unknowns
• Efficient: picks three metrics for each resource –
from what may be dozens available
• Cons:
• Limited to a class of issues
Sunday, February 24, 13

slide 111:

Other Methodologies
• Include:
• Blame-Someone-Else Anti-Method
• Tools Method
• Ad-Hoc Checklist Method
• Problem Statement Method
• Scientific Method
• Latency Analysis
• Stack Profile Method
• http://dtrace.org/blogs/brendan/2012/12/13/usenix-lisa-2012-performance-analysis-methodology/
Sunday, February 24, 13

slide 112:

Challenges
• Performance counter analysis (eg, bus or interconnect port
analysis) is time consuming – would like tools for convenience
• How about a “bustop” subcommand for perf?
• DTrace for Linux ports still in progress – will be awesome
when complete
Sunday, February 24, 13

slide 113:

Cloud Computing
• Performance may be limited by cloud resource controls, rather
than physical limits
• Hardware Virtualization complicates things – as a guest you
can’t analyze down to metal directly
• Hopefully the cloud provider provides an API for accessing
physical statistics, or does the analysis on your behalf
• We do analysis at Joyent (and our hypervisors have DTrace!)
• Free trial for new customers: good for $125 of usage value (~
one Small 1GB SmartMachine for 60 days). All prices subject
to change. Limited time only. Sign up at joyent.com
Sunday, February 24, 13

slide 114:

References
• Linux man pages, source, /Documentation
• USE Method: http://queue.acm.org/detail.cfm?id=2413037
• http://dtrace.org/blogs/brendan/2012/03/07/the-use-methodlinux-performance-checklist/
• http://dtrace.org/blogs/brendan/2012/12/13/usenix-lisa-2012performance-analysis-methodology/
• https://github.com/dtrace4linux, http://www.dtracebook.com,
http://illumos.org, http://smartos.org
• Upcoming: “Systems Performance” (Prentice Hall)
Sunday, February 24, 13

slide 115:

Thank you!
• email: brendan@joyent.com
• twitter: @brendangregg
• blog: http://dtrace.org/blogs/brendan
• blog resources:
• http://dtrace.org/blogs/brendan/tag/linux-2/
• http://dtrace.org/blogs/brendan/2012/02/29/the-use-method/
• http://dtrace.org/blogs/brendan/2012/03/07/the-use-method-linux-performancechecklist/
• http://dtrace.org/blogs/brendan/2011/12/16/flame-graphs/
• http://dtrace.org/blogs/brendan/2012/03/17/linux-kernel-performance-flamegraphs/
• http://dtrace.org/blogs/brendan/2011/10/15/using-systemtap/
Sunday, February 24, 13