Velocity2015_LinuxPerfTools.pdf

Velocity 2015: linux perf tools

Video: https://www.youtube.com/watch?v=FJW8nGV4jxY and https://www.youtube.com/watch?v=zrr2nUln9Kk

Tutorial slides for O'Reilly Velocity Santa Clara 2015, by Brendan Gregg.

Description: "There are many performance tools nowadays for Linux, but how do they all fit together, and when do we use them? This tutorial explains methodologies for using these tools, and provides a tour of four tool types: observability, benchmarking, tuning, and static tuning. Many tools will be discussed, including top, iostat, tcpdump, sar, perf_events, ftrace, SystemTap, sysdig, and others, as well observability frameworks in the Linux kernel: PMCs, tracepoints, kprobes, and uprobes.

This tutorial is updated and extended on an earlier talk that summarizes the Linux performance tool landscape. The value of this tutorial is not just learning that these tools exist and what they do, but hearing when and how they are used by a performance engineer to solve real world problems — important context that is typically not included in the standard documentation."

	next prev 1/142
	next prev 2/142
	next prev 3/142
	next prev 4/142
	next prev 5/142
	next prev 6/142
	next prev 7/142
	next prev 8/142
	next prev 9/142
	next prev 10/142
	next prev 11/142
	next prev 12/142
	next prev 13/142
	next prev 14/142
	next prev 15/142
	next prev 16/142
	next prev 17/142
	next prev 18/142
	next prev 19/142
	next prev 20/142
	next prev 21/142
	next prev 22/142
	next prev 23/142
	next prev 24/142
	next prev 25/142
	next prev 26/142
	next prev 27/142
	next prev 28/142
	next prev 29/142
	next prev 30/142
	next prev 31/142
	next prev 32/142
	next prev 33/142
	next prev 34/142
	next prev 35/142
	next prev 36/142
	next prev 37/142
	next prev 38/142
	next prev 39/142
	next prev 40/142
	next prev 41/142
	next prev 42/142
	next prev 43/142
	next prev 44/142
	next prev 45/142
	next prev 46/142
	next prev 47/142
	next prev 48/142
	next prev 49/142
	next prev 50/142
	next prev 51/142
	next prev 52/142
	next prev 53/142
	next prev 54/142
	next prev 55/142
	next prev 56/142
	next prev 57/142
	next prev 58/142
	next prev 59/142
	next prev 60/142
	next prev 61/142
	next prev 62/142
	next prev 63/142
	next prev 64/142
	next prev 65/142
	next prev 66/142
	next prev 67/142
	next prev 68/142
	next prev 69/142
	next prev 70/142
	next prev 71/142
	next prev 72/142
	next prev 73/142
	next prev 74/142
	next prev 75/142
	next prev 76/142
	next prev 77/142
	next prev 78/142
	next prev 79/142
	next prev 80/142
	next prev 81/142
	next prev 82/142
	next prev 83/142
	next prev 84/142
	next prev 85/142
	next prev 86/142
	next prev 87/142
	next prev 88/142
	next prev 89/142
	next prev 90/142
	next prev 91/142
	next prev 92/142
	next prev 93/142
	next prev 94/142
	next prev 95/142
	next prev 96/142
	next prev 97/142
	next prev 98/142
	next prev 99/142
	next prev 100/142
	next prev 101/142
	next prev 102/142
	next prev 103/142
	next prev 104/142
	next prev 105/142
	next prev 106/142
	next prev 107/142
	next prev 108/142
	next prev 109/142
	next prev 110/142
	next prev 111/142
	next prev 112/142
	next prev 113/142
	next prev 114/142
	next prev 115/142
	next prev 116/142
	next prev 117/142
	next prev 118/142
	next prev 119/142
	next prev 120/142
	next prev 121/142
	next prev 122/142
	next prev 123/142
	next prev 124/142
	next prev 125/142
	next prev 126/142
	next prev 127/142
	next prev 128/142
	next prev 129/142
	next prev 130/142
	next prev 131/142
	next prev 132/142
	next prev 133/142
	next prev 134/142
	next prev 135/142
	next prev 136/142
	next prev 137/142
	next prev 138/142
	next prev 139/142
	next prev 140/142
	next prev 141/142
	next prev 142/142

PDF: Velocity2015_LinuxPerfTools.pdf

Keywords (from pdftotext):

slide 1:

Linux	
  Performance	
  Tools	
  
Brendan Gregg
Senior Performance Architect
Performance Engineering Team
bgregg@netflix.com
@brendangregg

slide 2:

This	
  Tutorial	
  
• A tour of many Linux performance tools
– To show you what can be done
– With guidance for how to do it
• This includes objectives, discussion, live demos
– See the video of this tutorial
Observability	
  
Benchmarking	
  
Tuning	
  
Sta

slide 3:

• Massive AWS EC2 Linux cloud
– 10s of thousands of cloud instances
• FreeBSD for content delivery
– ~33% of US Internet traffic at night
• Over 50M subscribers
– Recently launched in ANZ
• Use Linux server tools as needed
– After cloud monitoring (Atlas, etc.) and
instance monitoring (Vector) tools

slide 4:

Agenda	
  
• Methodologies
• Tools
• Tool Types:
– Observability
– Benchmarking
– Tuning
– Static
• Profiling
• Tracing

slide 5:

Methodologies

slide 6:

Methodologies	
  
• Objectives:
– Recognize the Streetlight Anti-Method
– Perform the Workload Characterization Method
– Perform the USE Method
– Learn how to start with the questions, before using tools
– Be aware of other methodologies

slide 7:

My	
  system	
  is	
  slow…	
  
DEMO
DISCUSSION

slide 8:

Methodologies	
  
• There are dozens of performance tools for Linux
– Packages: sysstat, procps, coreutils, …
– Commercial products
• Methodologies can provide guidance for choosing and
using tools effectively
• A starting point, a process, and an ending point

slide 9:

An#-‐Methodologies	
  
• The lack of a deliberate methodology…

slide 10:

Street	
  Light	
  An

slide 11:

Drunk	
  Man	
  An

slide 12:

Blame	
  Someone	
  Else	
  An

slide 13:

Actual	
  Methodologies	
  
Problem Statement Method
Workload Characterization Method
USE Method
Off-CPU Analysis
CPU Profile Method
RTFM Method
Active Benchmarking (covered later)
Static Performance Tuning (covered later)

slide 14:

Problem	
  Statement	
  Method	
  
1. What makes you think there is a performance problem?
2. Has this system ever performed well?
3. What has changed recently? (Software? Hardware?
Load?)
4. Can the performance degradation be expressed in
terms of latency or run time?
5. Does the problem affect other people or applications (or
is it just you)?
6. What is the environment? Software, hardware,
instance types? Versions? Configuration?

slide 15:

Workload	
  Characteriza

slide 16:

The	
  USE	
  Method	
  
• For every resource, check:
Utilization
Saturation
Errors
X	
  
Resource	
  
U

slide 17:

USE	
  Method	
  for	
  Hardware	
  
• For every resource, check:
Utilization
Saturation
Errors
• Including busses & interconnects

slide 18:

Linux	
  USE	
  Method	
  Example	
  
hOp://www.brendangregg.com/USEmethod/use-‐linux.html

slide 19:

Oﬀ-‐CPU	
  Analysis	
  
On-‐CPU	
  Proﬁling	
  
Oﬀ-‐CPU	
  Proﬁling	
  
(everything	
  else)	
  
Thread	
  State	
  Transi

slide 20:

CPU	
  Proﬁle	
  Method	
  
1. Take a CPU profile
2. Understand all software in profile >gt; 1%
• Discovers a wide range of performance issues by their
CPU usage
Flame	
  Graph	
  
• Narrows software
to study

slide 21:

RTFM	
  Method	
  
• How to understand performance tools or metrics:
Man pages
Books
Web search
Co-workers
Prior talk slides/video (this one!)
Support services
Source code
Experimentation
Social

slide 22:

Tools

slide 23:

Tools	
  
• Objectives:
– Perform the USE Method for resource utilization
– Perform Workload Characterization for disks, network
– Perform the CPU Profile Method using flame graphs
– Have exposure to various observability tools:
• Basic: vmstat, iostat, mpstat, ps, top, …
• Intermediate: tcpdump, netstat, nicstat, pidstat, sar, …
• Advanced: ss, slaptop, perf_events, …
– Perform Active Benchmarking
– Understand tuning risks
– Perform Static Performance Tuning

slide 24:

Command	
  Line	
  Tools	
  
• Useful to study even if you never use them: GUIs and
commercial products often use the same interfaces
Kernel	
  
	
  
/proc,	
  	
  /sys,	
  …	
  
$ vmstat 1!
procs -----------memory---------- ---swap-- …!
r b
swpd
free
buff cache
so …!
9 0
0 29549320 29252 9299060
2 0
0 29547876 29252 9299332
4 0
0 29548124 29252 9299460
5 0
0 29548840 29252 9299592

slide 25:

Tool	
  Types	
  
Type
Characteristic
Observability
Watch activity. Safe, usually, depending on
resource overhead.
Benchmarking
Load test. Caution: production tests can
cause issues due to contention.
Tuning
Change. Danger: changes could hurt
performance, now or later with load.
Static
Check configuration. Should be safe.

slide 26:

Observability	
  Tools

slide 27:

How	
  do	
  you	
  measure	
  these?

slide 28:

Observability	
  Tools:	
  Basic	
  
uptime
top (or htop)
vmstat
iostat
mpstat
free

slide 29:

upgt; # of CPUs, may mean CPU saturation
– Don’t spend more than 5 seconds studying these

slide 30:

top	
  (or	
  htop)	
  
• System and per-process interval summary:
$ top - 18:50:26 up 7:43, 1 user, load average: 4.11, 4.91, 5.22!
Tasks: 209 total,
1 running, 206 sleeping,
0 stopped,
2 zombie!
Cpu(s): 47.1%us, 4.0%sy, 0.0%ni, 48.4%id, 0.0%wa, 0.0%hi, 0.3%si, 0.2%st!
Mem: 70197156k total, 44831072k used, 25366084k free,
36360k buffers!
Swap:
0k total,
0k used,
0k free, 11873356k cached!
PID USER
PR NI VIRT RES SHR S %CPU %MEM
TIME+ COMMAND
5738 apiprod
0 62.6g 29g 352m S 417 44.2
2144:15 java
1386 apiprod
0 17452 1388 964 R
0 0.0
0:00.02 top
1 root
0 24340 2272 1340 S
0 0.0
0:01.51 init
2 root
0 S
0 0.0
0:00.00 kthreadd
[…]!
• %CPU is summed across all CPUs
• Can miss short-lived processes (atop won’t)
• Can consume noticeable CPU to read /proc

slide 31:

htop

slide 32:

ps	
  
• Process status listing (eg, “ASCII art forest”):
$ ps -ef f!
UID
PID PPID
[…]!
root
root
28261 4546
prod
28287 28261
prod
28288 28287
prod
3156 28288
root
root
4969 4965
[…]!
C STIME TTY
STAT
TIME CMD!
0 11:08 ?
0 17:24 ?
0 17:24 ?
0 17:24 pts/0
0 19:15 pts/0
0 11:08 ?
0 11:08 ?
0:00 /usr/sbin/sshd -D!
0:00 \_ sshd: prod [priv]!
0:00
\_ sshd: prod@pts/0 !
0:00
\_ -bash!
0:00
\_ ps -ef f!
0:00 /bin/sh /usr/bin/svscanboot!
0:00 \_ svscan /etc/service!
• Custom fields:
$ ps -eo user,sz,rss,minflt,majflt,pcpu,args!
USER
RSS MINFLT MAJFLT %CPU COMMAND!
root
6085 2272 11928
24 0.0 /sbin/init!
[…]!

slide 33:

vmstat	
  
• Virtual memory statistics and more:
$ vmstat –Sm 1!
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----!
r b
swpd
free
buff cache
cs us sy id wa!
8 0
12 25 34 0 0!
7 0
0 205 186 46 13 0 0!
8 0
8 210 435 39 21 0 0!
8 0
0 218 219 42 17 0 0!
[…]!
• USAGE: vmstat [interval [count]]
• First output line has some summary since boot values
– Should be all; partial is confusing
• High level CPU summary
– “r” is runnable tasks

slide 34:

iostat	
  
• Block I/O (disk) stats. 1st output is since boot.
$ iostat -xmdz 1!
Linux 3.13.0-29 (db001-eb883efa)
08/18/2014
_x86_64_
Device:
rrqm/s
wrqm/s
r/s
w/s
rMB/s
xvda
xvdb
0.00 15299.00
xvdc
0.00 15271.00
md0
0.00 31082.00
(16 CPU)!
wMB/s \ ...!
0.00 / ...!
0.00 \ ...!
0.01 / ...!
0.01 \ ...!
Workload	
  
• Very useful
set of stats
... \ avgqu-sz
... /
... \
... /
... \
await r_await w_await
Resul

slide 35:

mpstat	
  
• Multi-processor statistics, per-CPU:
$ mpstat –P ALL 1!
[…]!
08:06:43 PM CPU
%usr
08:06:44 PM all 53.45
08:06:44 PM
0 49.49
08:06:44 PM
1 51.61
08:06:44 PM
2 58.16
08:06:44 PM
3 54.55
08:06:44 PM
4 47.42
08:06:44 PM
5 65.66
08:06:44 PM
6 50.00
[…]!
%nice
%sys %iowait
%irq
%soft %steal %guest
• Look for unbalanced workloads, hot CPUs.
%idle!
42.26!
45.45!
41.94!
33.67!
40.40!
49.48!
31.31!
47.92!

slide 36:

free	
  
• Main memory usage:
$ free -m!
total
Mem:
-/+ buffers/cache:
Swap:
used
free
3313!
• buffers: block device I/O cache
• cached: virtual page cache
shared
buffers
cached!
527!

slide 37:

Latency	
  is	
  now	
  much	
  higher…	
  
DEMO
DISCUSSION

slide 38:

Observability	
  Tools:	
  Basic

slide 39:

Observability	
  Tools:	
  Intermediate	
  
strace
tcpdump
netstat
nicstat
pidstat
swapon
lsof
sar (and collectl, dstat, etc.)

slide 40:

strace	
  
• System call tracer:
$ strace –tttT –p 313!
1408393285.779746 getgroups(0, NULL)
= 1 gt;!
1408393285.779873 getgroups(1, [0])
= 1 gt;!
1408393285.780797 close(3)
= 0 gt;!
1408393285.781338 write(1, "LinuxCon 2014!\n", 15LinuxCon 2014!!
) = 15 gt;!
• Eg, -ttt: time (us) since epoch; -T: syscall time (s)
• Translates syscall args
– Very helpful for solving system usage issues
• Currently has massive overhead (ptrace based)
– Can slow the target by >gt; 100x. Use extreme caution.

slide 41:

tcpdump	
  
• Sniff network packets for post analysis:
$ tcpdump -i eth0 -w /tmp/out.tcpdump!
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes!
^C7985 packets captured!
8996 packets received by filter!
1010 packets dropped by kernel!
# tcpdump -nr /tmp/out.tcpdump | head
reading from file /tmp/out.tcpdump, link-type EN10MB (Ethernet)
20:41:05.038437 IP 10.44.107.151.22 >gt; 10.53.237.72.46425: Flags [P.], seq 18...!
20:41:05.038533 IP 10.44.107.151.22 >gt; 10.53.237.72.46425: Flags [P.], seq 48...!
20:41:05.038584 IP 10.44.107.151.22 >gt; 10.53.237.72.46425: Flags [P.], seq 96...!
[…]!
• Study packet sequences with timestamps (us)
• CPU overhead optimized (socket ring buffers), but can
still be significant. Use caution.

slide 42:

netstat	
  
• Various network protocol statistics using -s:
• A multi-tool:
$ netstat –s!
-i: interface stats
-r: route table
default: list conns
• netstat -p: shows
process details!
• Per-second interval
with -c
[…]!
Tcp:!
736455 active connections openings!
176887 passive connection openings!
33 failed connection attempts!
1466 connection resets received!
3311 connections established!
91975192 segments received!
180415763 segments send out!
223685 segments retransmited!
2 bad segments received.!
39481 resets sent!
[…]!
TcpExt:!
12377 invalid SYN cookies received!
2982 delayed acks sent!
[…]!

slide 43:

nicstat	
  
• Network interface stats, iostat-like output:
$ ./nicstat 1!
Time
Int
21:21:43
21:21:43
eth0
Time
Int
21:21:44
21:21:44
eth0
Time
Int
21:21:45
21:21:45
eth0
[…]!
rKB/s
rKB/s
rKB/s
wKB/s
wKB/s
wKB/s
rPk/s
rPk/s
rPk/s
wPk/s
wPk/s
wPk/s
rAvs
wAvs %Util
4915.4 4915.4 0.00
139.8 0.00
rAvs
wAvs %Util
0.00 0.00
58.76 40441.3 0.00
rAvs
wAvs %Util
4400.8 4400.8 0.00
54.99 2979.1 0.00
• Check network throughput and interface %util
• I wrote this years ago; Tim Cook ported to Linux
Sat!
0.00!
0.00!
Sat!
0.00!
0.00!
Sat!
0.00!
0.00!

slide 44:

pidstat	
  
• Very useful process stats. eg, by-thread, disk I/O:
$ pidstat -t 1!
Linux 3.2.0-54 (db002-91befe03) !08/18/2014
!_x86_64_ !(8 CPU)!
08:57:52 PM
TGID
TID
%usr %system %guest
%CPU
08:57:54 PM
- 484.75
0.00 524.58
08:57:54 PM
08:57:54 PM
08:57:54 PM
08:57:54 PM
[…]!
$ pidstat -d 1!
[…]!
08:58:27 PM
PID
kB_rd/s
kB_wr/s kB_ccwr/s Command!
08:58:28 PM
0.00 java!
[…]!
• I usually prefer this over top(1)
CPU
Command!
java!
|__java!
|__java!
|__java!
|__java!

slide 45:

swapon	
  
• Show swap device usage:
$ swapon -s!
Filename
/dev/sda3
Type
partition
Size
• If you have swap enabled…
Used
Priority!
-1!

slide 46:

lsof	
  
• More a debug tool, lsof(8) shows file descriptor usage,
which for some apps, equals current active network
connections:
# lsof -iTCP -sTCP:ESTABLISHED!
COMMAND
PID USER FD TYPE DEVICE SIZE/OFF NODE NAME!
sshd
755 root 3r IPv4 13576887
0t0 TCP bgregg-test-i-f106:ssh->gt;prod100.netflix.com:
15241 (ESTABLISHED)!
platforms 2614 app1 8u IPv4
0t0 TCP localhost:33868->gt;localhost:5433 (ESTABLISHED)!
postgres 2648 app1 7u IPv4
0t0 TCP localhost:5433->gt;localhost:33868 (ESTABLISHED)!
epic_plug 2857 app1 7u IPv4
0t0 TCP localhost:33885->gt;localhost:5433 (ESTABLISHED)!
postgres 2892 app1 7u IPv4
0t0 TCP localhost:5433->gt;localhost:33885 (ESTABLISHED)!
[…]!
• I’d prefer to: echo /proc/PID/fd | wc -l!

slide 47:

sar	
  
• System Activity Reporter. Many stats, eg:
$ sar -n TCP,ETCP,DEV 1!
Linux 3.2.55 (test-e4f1a80b)
!08/18/2014
!_x86_64_ !(8 CPU)!
09:10:43 PM IFACE rxpck/s txpck/s
rxkB/s
txkB/s rxcmp/s txcmp/s
09:10:44 PM
09:10:44 PM
eth0 4114.00 4186.00 4537.46 28513.24
09:10:43 PM active/s passive/s
iseg/s
oseg/s!
09:10:44 PM
4107.00 22511.00!
09:10:43 PM atmptf/s estres/s retrans/s isegerr/s
orsts/s!
09:10:44 PM
1.00!
[…]!
• Archive or live mode: (interval [count])
• Well designed. Header naming convention, logical
groups: TCP, ETCP, DEV, EDEV, …
rxmcst/s!
0.00!
0.00!

slide 48:

Observability:	
  sar

slide 49:

Other	
  Tools	
  
• You may also use collectl, atop, dstat, or another
measure-all tool
• The tool isn’t important – it’s important to have a way to
measure everything
• In cloud environments, you are probably using a
monitoring product, developed in-house or commercial.
– We develop Atlas for cloud-wide monitoring, and Vector for
instance-level analysis (both NetflixOSS)
– Same method applies…

slide 50:

How	
  does	
  your	
  monitoring	
  tool	
  
measure	
  these?

slide 51:

App	
  is	
  taking	
  forever…	
  
DEMO
DISCUSSION

slide 52:

Observability	
  Tools:	
  Intermediate

slide 53:

Advanced	
  Observability	
  Tools	
  
• Misc:
– ltrace, ss, iptraf, ethtool, snmpget, lldptool, iotop, blktrace,
slabtop, /proc, pcstat
• CPU Performance Counters:
– perf_events, tiptop, rdmsr
• Advanced Tracers:
– perf_events, ftrace, eBPF, SystemTap, ktap, LTTng,
dtrace4linux, sysdig
• Some selected demos…

slide 54:

ss	
  
• More socket statistics:
$ ss -mop!
State
Recv-Q Send-Q
Local Address:Port
Peer Address:Port
CLOSE-WAIT 1
127.0.0.1:42295
127.0.0.1:28527
users:(("apacheLogParser",2702,3))!
! mem:(r1280,w0,f2816,t0)!
ESTAB
127.0.0.1:5433
127.0.0.1:41312
timer:(keepalive,36min,0) users:(("postgres",2333,7))!
! mem:(r0,w0,f0,t0)!
[…]!
$ ss –i!
State
Recv-Q Send-Q
Local Address:Port
Peer Address:Port
CLOSE-WAIT 1
127.0.0.1:42295
127.0.0.1:28527
cubic wscale:6,6 rto:208 rtt:9/6 ato:40 cwnd:10 send 145.6Mbps rcv_space:32792!
ESTAB
10.144.107.101:ssh
10.53.237.72:4532
cubic wscale:4,6 rto:268 rtt:71.5/3 ato:40 cwnd:10 send 1.5Mbps rcv_rtt:72
rcv_space:14480!
[…]!

slide 55:

iptraf

slide 56:

iotop	
  
• Block device I/O (disk) by process:
$ iotop!
Total DISK READ:
TID PRIO USER
959 be/4 root
6641 be/4 root
1 be/4 root
2 be/4 root
3 be/4 root
4 be/4 root
5 be/4 root
6 rt/4 root
[…]!
50.47 M/s | Total DISK WRITE:
59.21 M/s!
DISK READ DISK WRITE SWAPIN
IO>gt;
COMMAND
0.00 B/s
0.00 B/s 0.00 % 99.99 % [flush-202:1]!
50.47 M/s
82.60 M/s 0.00 % 32.51 % java –Dnop –X!
0.00 B/s
0.00 B/s 0.00 % 0.00 % init!
0.00 B/s
0.00 B/s 0.00 % 0.00 % [kthreadd]!
0.00 B/s
0.00 B/s 0.00 % 0.00 % [ksoftirqd/0]!
0.00 B/s
0.00 B/s 0.00 % 0.00 % [kworker/0:0]!
0.00 B/s
0.00 B/s 0.00 % 0.00 % [kworker/u:0]!
0.00 B/s
0.00 B/s 0.00 % 0.00 % [migration/0]!
• Needs kernel support enabled
– CONFIG_TASK_IO_ACCOUNTING

slide 57:

slabtop	
  
• Kernel slab allocator memory usage:
$ slabtop!
Active / Total Objects (% used)
: 4692768 / 4751161 (98.8%)!
Active / Total Slabs (% used)
: 129083 / 129083 (100.0%)!
Active / Total Caches (% used)
: 71 / 109 (65.1%)!
Active / Total Size (% used)
: 729966.22K / 738277.47K (98.9%)!
Minimum / Average / Maximum Object : 0.01K / 0.16K / 8.00K!
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
3565575 3565575 100%
0.10K 91425
365700K buffer_head!
314916 314066 99%
0.19K 14996
59984K dentry!
184192 183751 99%
0.06K
11512K kmalloc-64!
138618 138618 100%
0.94K
130464K xfs_inode!
138602 138602 100%
0.21K
29968K xfs_ili!
102116 99012 96%
0.55K
58352K radix_tree_node!
97482 49093 50%
0.09K
9284K kmalloc-96!
22695 20777 91%
0.05K
1068K shared_policy_node!
21312 21312 100%
0.86K
18432K ext4_inode_cache!
16288 14601 89%
0.25K
4072K kmalloc-256!
[…]!

slide 58:

pcstat	
  
• Show page cache residency by file:
# ./pcstat data0*!
|----------+----------------+------------+-----------+---------|!
| Name
| Size
| Pages
| Cached
| Percent |!
|----------+----------------+------------+-----------+---------|!
| data00
| 104857600
| 25600
| 25600
| 100.000 |!
| data01
| 104857600
| 25600
| 25600
| 100.000 |!
| data02
| 104857600
| 25600
| 4080
| 015.938 |!
| data03
| 104857600
| 25600
| 25600
| 100.000 |!
| data04
| 104857600
| 25600
| 16010
| 062.539 |!
| data05
| 104857600
| 25600
| 0
| 000.000 |!
|----------+----------------+------------+-----------+---------|!
• Uses the mincore(2) syscall. Useful for database
performance analysis.

slide 59:

perf_events	
  
• Provides the "perf" command
• In Linux source code: tools/perf
– Usually pkg added by linux-tools-common, etc.
• Multi-tool with many capabilities
– CPU profiling
– PMC profiling
– Static & dynamic tracing
• Covered later in Profiling & Tracing

slide 60:

slide 61:

rdmsr	
  
• Model Specific Registers (MSRs), unlike PMCs, can be
read by default in Xen guests
– Timestamp clock, temp, power, …
– Use rdmsr(1) from the msr-tools package to read them
– From https://github.com/brendangregg/msr-cloud-tools:
ec2-guest# ./showboost!
[...]!
TIME
C0_MCYC
C0_ACYC
UTIL RATIO
06:11:35
51%
116%
06:11:40
50%
115%
06:11:45
49%
115%
[...]!
ec2-guest# ./cputemp 1!
CPU1 CPU2 CPU3 CPU4!
61 61 60 59!
CPU	
  Temperature	
  
60 61 60 60!
[...]!
MHz!
2900!
2899!
2899!
Real	
  CPU	
  MHz

slide 62:

More	
  Advanced	
  Tools…	
  
• Some others worth mentioning:
Tool	
  
Descrip-on	
  
ltrace	
  
Library	
  call	
  tracer	
  
ethtool	
  
Mostly	
  interface	
  tuning;	
  some	
  stats	
  
snmpget	
  
SNMP	
  network	
  host	
  sta

slide 63:

Advanced	
  Tracers	
  
• Many options on Linux:
– perf_events, ftrace, eBPF, SystemTap, ktap, LTTng,
dtrace4linux, sysdig
• Most can do static and dynamic tracing
– Static: pre-defined events (tracepoints)
– Dynamic: instrument any software (kprobes, uprobes).
Custom metrics on-demand. Catch all.
• Many are in-development

slide 64:

Linux	
  Observability	
  Tools

slide 65:

Linux	
  Observability	
  Tools

slide 66:

Benchmarking	
  Tools

slide 67:

Benchmarking	
  Tools	
  
• Multi:
– UnixBench, lmbench, sysbench, perf bench
• FS/disk:
– dd, hdparm, fio
• App/lib:
– ab, wrk, jmeter, openssl
• Networking:
– ping, hping3, iperf, ttcp, traceroute, mtr, pchar

slide 68:

Benchmarking	
  
• ~100% of benchmarks are wrong
• Results are usually misleading:
you benchmark A, but actually measure B, and conclude
you measured C
• Common mistakes:
– Testing the wrong target: eg, FS cache instead of disk
– Choosing the wrong target: eg, disk instead of FS cache
… doesn’t resemble real world usage
– Invalid results: eg, bugs
• The energy needed to refute benchmarks is multiple
orders of magnitude bigger than to run them

slide 69:

Ac

slide 70:

lmbench	
  
• CPU, memory, and kernel micro-benchmarks
• Eg, memory latency by stride size:
$ lat_mem_rd 100m 128 >gt; out.latencies!
some R processing…!
L2	
  cache	
  
Main	
  
Memory	
  
L1	
  cache	
  
L3	
  cache

slide 71:

ﬁo	
  
• FS or disk I/O micro-benchmarks
$ fio --name=seqwrite --rw=write --bs=128k --size=122374m!
[…]!
seqwrite: (groupid=0, jobs=1): err= 0: pid=22321!
write: io=122374MB, bw=840951KB/s, iops=6569 , runt=149011msec!
clat (usec): min=41 , max=133186 , avg=148.26, stdev=1287.17!
lat (usec): min=44 , max=133188 , avg=151.11, stdev=1287.21!
bw (KB/s) : min=10746, max=1983488, per=100.18%, avg=842503.94,
stdev=262774.35!
cpu
: usr=2.67%, sys=43.46%, ctx=14284, majf=1, minf=24!
IO depths
: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >gt;=64=0.0%!
submit
: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >gt;=64=0.0%!
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >gt;=64=0.0%!
issued r/w/d: total=0/978992/0, short=0/0/0!
lat (usec): 50=0.02%, 100=98.30%, 250=1.06%, 500=0.01%, 750=0.01%!
lat (usec): 1000=0.01%!
lat (msec): 2=0.01%, 4=0.01%, 10=0.25%, 20=0.29%, 50=0.06%!
lat (msec): 100=0.01%, 250=0.01%!
• Results include basic latency distribution

slide 72:

pchar	
  
• Traceroute with bandwidth per hop!
$ pchar 10.71.83.1!
[…]!
4: 10.110.80.1 (10.110.80.1)!
Partial loss:
0 / 5 (0%)!
Partial char:
rtt = 9.351109 ms, (b = 0.004961 ms/B), r2 = 0.184105!
stddev rtt = 4.967992, stddev b = 0.006029!
Partial queueing: avg = 0.000000 ms (0 bytes)!
Hop char:
rtt = --.--- ms, bw = 1268.975773 Kbps!
Hop queueing:
avg = 0.000000 ms (0 bytes)!
5: 10.193.43.181 (10.193.43.181)!
Partial loss:
0 / 5 (0%)!
Partial char:
rtt = 25.461597 ms, (b = 0.011934 ms/B), r2 = 0.228707!
stddev rtt = 10.426112, stddev b = 0.012653!
Partial queueing: avg = 0.000000 ms (0 bytes)!
Hop char:
rtt = 16.110487 ms, bw = 1147.210397 Kbps!
Hop queueing:
avg = 0.000000 ms (0 bytes)!
[…]!
• Needs love. Based on pathchar (Linux 2.0.30).

slide 73:

Benchmarking	
  Tools

slide 74:

Tuning	
  Tools

slide 75:

Tuning	
  Tools	
  
• Generic interfaces:
– sysctl, /sys
• Many areas have custom tuning tools:
– Applications: their own config
– CPU/scheduler: nice, renice, taskset, ulimit, chcpu
– Storage I/O: tune2fs, ionice, hdparm, blockdev, …
– Network: ethtool, tc, ip, route
– Dynamic patching: stap, kpatch

slide 76:

Tuning	
  Methods	
  
• Scientific Method:
Question
Hypothesis
Prediction
Test
Analysis
• Any observational or benchmarking tests you can try
before tuning?
• Consider risks, and see previous tools

slide 77:

Tuning	
  Tools

slide 78:

Sta

slide 79:

Sta

slide 80:

CPU	
  Types	
  &	
  Flags	
  
$ more /proc/cpuinfo !
processor!
!: 0!
vendor_id!
!: GenuineIntel!
cpu family
!: 6!
model
!: 42!
model name
!: Intel(R) Core(TM) i5-2400 CPU @ 3.10GHz!
stepping !
!: 7!
microcode!
!: 0x1a!
cpu MHz !
!: 1600.000!
cache size
!: 6144 KB!
physical id
!: 0!
siblings !
!: 4!
core id !
!: 0!
cpu cores!
!: 4!
apicid
!: 0!
initial apicid!: 0!
fpu !
!: yes!
fpu_exception !: yes!
cpuid level
!: 13!
wp !
!: yes!
flags
!: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36
clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc ar!
ch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni
pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2
x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts
dtherm tpr_shadow vnmi flexpriority ept vpid!
[…]!
CPU	
  speed	
  s

slide 81:

CPU	
  Frequency	
  Scaling	
  
• Kernel may be configured to dynamically modify CPU
frequency:
# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies !
3101000 3100000 2900000 2700000 2500000 2300000 2100000 1900000 1700000 1600000 !
# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor !
ondemand!
– See Documentation/cpu-freq/governors.txt,
and scaling_governor == performance
• Not to be confused with Intel Turbo Boost (which is H/W)

slide 82:

Storage	
  Devices	
  
# cat /proc/scsi/scsi!
Attached devices:!
Host: scsi0 Channel: 00 Id: 00 Lun: 00!
Vendor: ATA
Model: ST3320413AS
Rev: JC65!
Type:
Direct-Access
ANSI SCSI revision: 05!
Host: scsi1 Channel: 00 Id: 00 Lun: 00!
Vendor: PLDS
Model: DVD-RW DH16ABSH Rev: YL32!
Type:
CD-ROM
ANSI SCSI revision: 05!
# lsscsi!
[0:0:0:0]
disk
ATA
ST3320413AS
JC65 /dev/sda !
[1:0:0:0]
cd/dvd PLDS
DVD-RW DH16ABSH YL32 /dev/sr0 !
• Micro-benchmarking disks (not file systems!) is also
useful for understanding their characteristics

slide 83:

Rou

slide 84:

etc…	
  
System messages: dmesg!
Network interface config: ifconfig –a; ip link
File system capacity: df -h!
Volume config: mdadm --misc -D /dev/md0 !
Storage device info: smartctl!
NUMA config: numactl -s; numactl -H!
PCI info: lspci
Installed kernel modules: lsmod!
Root crontab config: crontab –l!
Services: service --status-all!

slide 85:

Sta

slide 86:

Proﬁling

slide 87:

Proﬁling	
  
• Objectives:
– Profile CPU usage by stack sampling
– Generate CPU flame graphs
– Understand gotchas with stacks & symbols

slide 88:

CPU	
  Proﬁling	
  
• Record stacks at a timed interval: simple and effective
– Pros: Low (deterministic) overhead
– Cons: Coarse accuracy, but usually sufficient
stack	
  
samples:	
  
syscall	
  
on-‐CPU	
  
oﬀ-‐CPU	
  
block	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  interrupt

slide 89:

perf_events	
  
• Introduced earlier: multi-tool, profiler. Provides "perf".
usage: perf [--version] [--help] [OPTIONS] COMMAND [ARGS]!
The most commonly used perf commands are:!
annotate
Read perf.data (created by perf record) and display annotated code!
archive
Create archive with object files with build-ids found in perf.data file!
bench
General framework for benchmark suites!
buildid-cache
Manage build-id cache.!
buildid-list
List the buildids in a perf.data file!
data
Data file related processing!
diff
Read perf.data files and display the differential profile!
evlist
List the event names in a perf.data file!
inject
Filter to augment the events stream with additional information!
kmem
Tool to trace/measure kernel memory(slab) properties!
kvm
Tool to trace/measure kvm guest os!
list
List all symbolic event types!
lock
Analyze lock events!
mem
Profile memory accesses!
record
Run a command and record its profile into perf.data!
report
Read perf.data (created by perf record) and display the profile!
sched
Tool to trace/measure scheduler properties (latencies)!
script
Read perf.data (created by perf record) and display trace output!
stat
Run a command and gather performance counter statistics!
test
Runs sanity tests.!
timechart
Tool to visualize total system behavior during a workload!
top
System profiling tool.!
trace
strace inspired tool!
probe
Define new dynamic tracepoints!
See 'perf help COMMAND' for more information on a specific command.!

slide 90:

perf_events:	
  CPU	
  proﬁling	
  
• Sampling full stack traces at 99 Hertz, for 30 secs:
# perf record -F 99 -ag -- sleep 30!
[ perf record: Woken up 9 times to write data ]!
[ perf record: Captured and wrote 2.745 MB perf.data (~119930 samples) ]!
# perf report -n --stdio!
1.40%
java [kernel.kallsyms]
[k] _raw_spin_lock
--- _raw_spin_lock!
|--63.21%-- try_to_wake_up!
|--63.91%-- default_wake_function!
|--56.11%-- __wake_up_common!
__wake_up_locked!
ep_poll_callback!
__wake_up_common!
__wake_up_sync_key!
|--59.19%-- sock_def_readable!
[…78,000 lines truncated…]!

slide 91:

perf_events:	
  Full	
  "report"	
  Output

slide 92:

…	
  as	
  a	
  Flame	
  Graph

slide 93:

perf_events:	
  Flame	
  Graphs	
  
git clone --depth 1 https://github.com/brendangregg/FlameGraph!
cd FlameGraph!
perf record -F 99 -a –g -- sleep 30!
perf script | ./stackcollapse-perf.pl |./flamegraph.pl >gt; perf.svg!
• Flame Graphs:
– x-axis: alphabetical stack sort, to maximize merging
– y-axis: stack depth
– color: random, or hue can be a dimension (eg, diff)
• Interpretation:
– Top edge is on-CPU, beneath it is ancestry
• Currently made from Perl + JavaScript & SVG
• Easy to get working
– http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html

slide 94:

Mysterious	
  CPU	
  consumer…	
  
DEMO
DISCUSSION

slide 95:

perf_events:	
  Workﬂow	
  
list	
  events	
  
count	
  events	
  
capture	
  stacks	
  
perf list!
perf stat!
perf record!
Typical	
  
Workﬂow	
  
perf.data	
  
text	
  UI	
  
dump	
  proﬁle	
  
perf report!
ﬂame	
  graph	
  
visualiza

slide 96:

perf_events:	
  CPU	
  Flame	
  Graph	
  
Kernel	
  
TCP/IP	
  
Broken	
  
Java	
  stacks	
  
(missing	
  
frame	
  
pointer)	
  
GC	
  
Locks	
  
Time	
  
Idle	
  
thread	
  
epoll

slide 97:

perf_events: Mixed-Mode
CPU Flame Graph
Kernel	
  
JVM	
  
(C++)	
  
Fixed	
  Java	
  Stacks!	
  
(C)

slide 98:

perf_events:	
  Gotchas	
  
• Stack traces and symbols often don't work.
– Can be a significant project to fix them. It is worth it!
• C:
– stacks: --fno-omit-frame-pointer
• Java:
– stacks on x86: -XX:+PreserveFramePointer
(JDK-8068945 for JDK9, JDK-8072465 for JDK8)
– symbols: perf-map-agent (other solutions exist)
• Node.js:
– symbols: run v8 with --perf_basic_prof
http://www.brendangregg.com/blog/2015-02-27/linux-profiling-at-netflix.html

slide 99:

perf_events:	
  Counters	
  
• Performance Monitoring Counters (PMCs):
$ perf list | grep –i hardware!
cpu-cycles OR cycles
stalled-cycles-frontend OR idle-cycles-frontend
stalled-cycles-backend OR idle-cycles-backend
instructions
[…]!
branch-misses
bus-cycles
L1-dcache-loads
L1-dcache-load-misses
[…]!
rNNN (see 'perf list --help' on how to encode it)
mem:gt;[:access]
[Hardware event]!
[Hardware event]!
[Hardware event]!
[Hardware event]!
[Hardware event]!
[Hardware event]!
[Hardware cache event]!
[Hardware cache event]!
[Raw hardware event … !
[Hardware breakpoint]!
• Identify CPU cycle breakdowns, esp. stall types
• PMCs not enabled by-default in clouds (yet)
• Can be time-consuming to use (CPU manuals)
– Please develop front-ends. Eg, tiptop.

slide 100:

Tracing

slide 101:

Tracing	
  
• Objectives:
– Understand frameworks: tracepoints, kprobes, uprobes
– Understand mainline tracers: ftrace, perf_events, eBPF
– Awareness of other tracers: SystemTap, LTTng, ktap, sysdig
– Awareness of what tracing can accomplish (eg, perf-tools)

slide 102:

Tracing	
  Frameworks:	
  Tracepoints	
  
• Statically placed at logical places in the kernel
• Provides key event details as a “format” string

slide 103:

Tracing	
  Frameworks:	
  +	
  probes	
  
• kprobes: dynamic kernel tracing
– function calls, returns, line numbers
• uprobes: dynamic user-level tracing

slide 104:

Linux	
  Tracing	
  Tools	
  
irace	
  
perf_events	
  
LTTng	
  
ktap	
  
eBPF	
  
dtrace4linux	
   OEL	
  DTrace	
  
SystemTap	
  
sysdig	
  
• Many choices (too many?), all still in development

slide 105:

Linux	
  Tracing	
  is	
  Magic!	
  
irace	
  
perf_events	
  
LTTng	
  
ktap	
  
eBPF	
  
dtrace4linux	
   OEL	
  DTrace	
  
(Thanks Deirdré Straughan & General Zoi's Pony Creator)
SystemTap	
  
sysdig

slide 106:

Choosing	
  a	
  Tracer	
  
• Some companies standardize on one tracer
– eg, SystemTap, LTTng, …

slide 107:

Choosing	
  a	
  Tracer	
  
• My approach is:
Study	
  what	
  Linux	
  already	
  has	
  built-‐in	
  
(perf_events,	
  irace,	
  eBPF?)	
  
Y	
  
Is	
  it	
  
suﬃcient?	
  
Purpose?	
  
live	
  tracing,	
  
coun

slide 108:

irace

slide 109:

irace	
  
• Added by Steven Rostedt and others since 2.6.27
• Already enabled on our servers (3.2+)
– CONFIG_FTRACE, CONFIG_FUNCTION_PROFILER, …
– Use directly via /sys/kernel/debug/tracing:
linux-4.0.0+# ls /sys/kernel/debug/tracing/!
available_events
max_graph_depth
available_filter_functions options
available_tracers
per_cpu
buffer_size_kb
printk_formats
buffer_total_size_kb
README
current_tracer
saved_cmdlines
dyn_ftrace_total_info
saved_cmdlines_size
enabled_functions
set_event
events
set_ftrace_filter
free_buffer
set_ftrace_notrace
function_profile_enabled
set_ftrace_pid
instances
set_graph_function
kprobe_events
set_graph_notrace
kprobe_profile
snapshot!
stack_max_size!
stack_trace!
stack_trace_filter!
trace!
trace_clock!
trace_marker!
trace_options!
trace_pipe!
trace_stat!
tracing_cpumask!
tracing_max_latency!
tracing_on!
tracing_thresh!
– See Linux source: Documentation/trace/ftrace.txt

slide 110:

irace	
  Front-‐Ends	
  
• Steven wrote a front-end: trace-cmd
– Multi-tool, works well
• I've developed the "perf-tools" front-ends
– https://github.com/brendangregg/perf-tools
– Both single & multi-purpose, Unix-like
– Unsupported hacks: see WARNINGs
• perf-tools:
– single-purpose: iosnoop, iolatency, opensnoop
– multi-tools: funccount, funcgraph, kprobe

slide 111:

iosnoop	
  
• Block I/O (disk) events with latency:
# ./iosnoop –ts!
Tracing block I/O. Ctrl-C to end.!
STARTs
ENDs
COMM
5982800.302061 5982800.302679 supervise
5982800.302423 5982800.302842 supervise
5982800.304962 5982800.305446 supervise
5982800.305250 5982800.305676 supervise
[…]!
PID
TYPE DEV
202,1
202,1
202,1
202,1
BLOCK
BYTES LATms!
0.62!
0.42!
0.48!
0.43!
# ./iosnoop –h!
USAGE: iosnoop [-hQst] [-d device] [-i iotype] [-p PID] [-n name] [duration]!
-d device
# device string (eg, "202,1)!
-i iotype
# match type (eg, '*R*' for all reads)!
-n name
# process name to match on I/O issue!
-p PID
# PID to match on I/O issue!
# include queueing time in LATms!
# include start time of I/O (s)!
# include completion time of I/O (s)!
# this usage message!
duration
# duration seconds, and use buffers!
[…]!

slide 112:

iolatency	
  
• Block I/O (disk) latency distributions:
# ./iolatency !
Tracing block I/O. Output every 1 seconds. Ctrl-C to end.!
>gt;=(ms) .. gt; 1
: 2104
|######################################|!
1 ->gt; 2
: 280
|######
2 ->gt; 4
: 2
4 ->gt; 8
: 0
8 ->gt; 16
: 202
|####
>gt;=(ms) .. gt; 1
: 1144
|######################################|!
1 ->gt; 2
: 267
|#########
2 ->gt; 4
: 10
4 ->gt; 8
: 5
8 ->gt; 16
: 248
|#########
16 ->gt; 32
: 601
|####################
32 ->gt; 64
: 117
|####
[…]!

slide 113:

opensnoop	
  
• Trace open() syscalls showing filenames:
# ./opensnoop -t!
Tracing open()s. Ctrl-C to end.!
TIMEs
COMM
PID
postgres
postgres
postgres
postgres
postgres
postgres
postgres
svstat
svstat
stat
stat
stat
stat
stat
stat
[…]!
FD FILE!
0x8 /proc/self/oom_adj!
0x5 global/pg_filenode.map!
0x5 global/pg_internal.init!
0x5 base/16384/PG_VERSION!
0x5 base/16384/pg_filenode.map!
0x5 base/16384/pg_internal.init!
0x5 base/16384/11725!
0x4 supervise/ok!
0x4 supervise/status!
0x3 /etc/ld.so.cache!
0x3 /lib/x86_64-linux-gnu/libselinux…!
0x3 /lib/x86_64-linux-gnu/libc.so.6!
0x3 /lib/x86_64-linux-gnu/libdl.so.2!
0x3 /proc/filesystems!
0x3 /etc/nsswitch.conf!

slide 114:

tpoint	
  
• Who is creating disk I/O, and of what type?
# ./tpoint -H block:block_rq_insert!
tracepoint	
  
Tracing block:block_rq_insert. Ctrl-C to end.!
type	
  
size	
  (sectors)	
  
# tracer: nop!
dev	
  
oﬀset	
  
TASK-PID
CPU#
TIMESTAMP FUNCTION!
| |
flush-9:0-9318 [013] 1936182.007914: block_rq_insert: 202,16 W 0 () 160186560 + 8 [flush-9:0]!
flush-9:0-9318 [013] 1936182.007939: block_rq_insert: 202,16 W 0 () 280100936 + 8 [flush-9:0]!
java-9469 [014] 1936182.316184: block_rq_insert: 202,1 R 0 () 1319592 + 72 [java]!
java-9469 [000] 1936182.331270: block_rq_insert: 202,1 R 0 () 1125744 + 8 [java]!
java-9469 [000] 1936182.341418: block_rq_insert: 202,1 R 0 () 2699008 + 88 [java]!
java-9469 [000] 1936182.341419: block_rq_insert: 202,1 R 0 () 2699096 + 88 [java]!
java-9469 [000] 1936182.341419: block_rq_insert: 202,1 R 0 () 2699184 + 32 [java]!
java-9469 [000] 1936182.345870: block_rq_insert: 202,1 R 0 () 1320304 + 24 [java]!
java-9469 [000] 1936182.351590: block_rq_insert: 202,1 R 0 () 1716848 + 16 [java]!
^C!
Ending tracing...!
• tpoint traces a given tracepoint. -H prints the header.

slide 115:

tpoint	
  -‐l	
  
# ./tpoint -l!
block:block_bio_backmerge!
block:block_bio_bounce!
block:block_bio_complete!
block:block_bio_frontmerge!
block:block_bio_queue!
block:block_bio_remap!
block:block_getrq!
block:block_plug!
block:block_rq_abort!
block:block_rq_complete!
block:block_rq_insert!
block:block_rq_issue!
block:block_rq_remap!
block:block_rq_requeue!
[…]!
# ./tpoint –l | wc –l!
1257!
Lis

slide 116:

funccount	
  
• Count a kernel function call rate:
# ./funccount -i 1 'bio_*'!
Tracing "bio_*"... Ctrl-C to end.!
FUNC
COUNT!
bio_attempt_back_merge
26!
bio_get_nr_vecs
361!
bio_alloc
536!
bio_alloc_bioset
536!
bio_endio
536!
bio_free
536!
bio_fs_destructor
536!
bio_init
536!
bio_integrity_enabled
536!
bio_put
729!
bio_add_page
1004!
[...]!
Counts	
  are	
  in-‐kernel,	
  
for	
  low	
  overhead	
  
– -i: set an output interval (seconds), otherwise until Ctrl-C

slide 117:

funcgraph	
  
• Trace a graph of kernel code flow:
# ./funcgraph -Htp 5363 vfs_read!
Tracing "vfs_read" for PID 5363... Ctrl-C to end.!
# tracer: function_graph!
TIME
CPU DURATION
FUNCTION CALLS!
4346366.073832 |
| vfs_read() {!
4346366.073834 |
rw_verify_area() {!
4346366.073834 |
security_file_permission() {!
4346366.073834 |
apparmor_file_permission() {!
4346366.073835 |
0.153 us
common_file_perm();!
4346366.073836 |
0.947 us
4346366.073836 |
0.066 us
__fsnotify_parent();!
4346366.073836 |
0.080 us
fsnotify();!
4346366.073837 |
2.174 us
4346366.073837 |
2.656 us
4346366.073837 |
tty_read() {!
4346366.073837 |
0.060 us
tty_paranoia_check();!
[…]!

slide 118:

kprobe	
  
• Dynamically trace a kernel function call or return, with
variables, and in-kernel filtering:
# ./kprobe 'p:open do_sys_open filename=+0(%si):string' 'filename ~ "*stat"'!
Tracing kprobe myopen. Ctrl-C to end.!
postgres-1172 [000] d... 6594028.787166: open: (do_sys_open
+0x0/0x220) filename="pg_stat_tmp/pgstat.stat"!
postgres-1172 [001] d... 6594028.797410: open: (do_sys_open
+0x0/0x220) filename="pg_stat_tmp/pgstat.stat"!
postgres-1172 [001] d... 6594028.797467: open: (do_sys_open
+0x0/0x220) filename="pg_stat_tmp/pgstat.stat”!
^C!
Ending tracing...!
• Add -s for stack traces; -p for PID filter in-kernel.
• Quickly confirm kernel behavior; eg: did a tunable take
effect?

slide 119:

perf-‐tools	
  (so	
  far…)

slide 120:

perf-‐tools	
  (so	
  far…)

slide 121:

perf_events

slide 122:

perf_events	
  
• Powerful profiler (covered earlier)
• … and tracer:
– User-level and kernel dynamic tracing
– Kernel line tracing and local variables (debuginfo)
– Kernel filtering expressions
– Efficient in-kernel counts (perf stat)
• Not very programmable, yet
– Limited kernel summaries. Should improve with eBPF.

slide 123:

perf_events	
  Lis

slide 124:

Linux	
  Event	
  Sources

slide 125:

perf_events	
  Tracing	
  Tracepoints	
  
# perf record -e block:block_rq_complete -a sleep 10!
[ perf record: Woken up 1 times to write data ]!
[ perf record: Captured and wrote 0.428 MB perf.data (~18687 samples) ]!
# perf script!
run 30339 [000] 2083345.722767: block:block_rq_complete: 202,1 W () 12984648 + 8 [0]!
run 30339 [000] 2083345.722857: block:block_rq_complete: 202,1 W () 12986336 + 8 [0]!
run 30339 [000] 2083345.723180: block:block_rq_complete: 202,1 W () 12986528 + 8 [0]!
swapper
0 [000] 2083345.723489: block:block_rq_complete: 202,1 W () 12986496 + 8 [0]!
swapper
0 [000] 2083346.745840: block:block_rq_complete: 202,1 WS () 1052984 + 144 [0]!
supervise 30342 [000] 2083346.746571: block:block_rq_complete: 202,1 WS () 1053128 + 8 [0]!
supervise 30342 [000] 2083346.746663: block:block_rq_complete: 202,1 W () 12986608 + 8 [0]!
run 30342 [000] 2083346.747003: block:block_rq_complete: 202,1 W () 12986832 + 8 [0]!
[...]!
• If -g is used in "perf record", stack traces are included
• If "perf script" output is too verbose, try "perf report",
or making a flame graph

slide 126:

perf_events	
  Report	
  
# perf record –e skb:consume_skb -ag!
^C[ perf record: Woken up 1 times to write data ]!
[ perf record: Captured and wrote 0.065 MB perf.data (~2851 samples) ]!
# perf report -n --stdio!
[...]!
74.42% swapper [kernel.kallsyms] [k] consume_skb!
--- consume_skb!
arp_process!
arp_rcv!
Summarizing	
  stack	
  
__netif_receive_skb_core!
__netif_receive_skb!
traces	
  for	
  a	
  tracepoint	
  
netif_receive_skb!
virtnet_poll!
net_rx_action!
__do_softirq!
irq_exit!
do_IRQ!
ret_from_intr!
default_idle!
cpu_idle!
start_secondary!
[…]!

slide 127:

eBPF

slide 128:

eBPF	
  
• Extended BPF: programs on tracepoints
– High performance filtering: JIT
– In-kernel summaries: maps
– Developed by Alexei Starovoitov (Plumgrid)
• Currently being integrated in parts (Linux 3.18, 4.1, …)
# ./bitehist!
Tracing block device I/O... Interval 5 secs. Ctrl-C to end.!
kbytes
: count
distribution!
0 ->gt; 1
: 3
2 ->gt; 3
: 0
4 ->gt; 7
: 3395
|************************************* |!
8 ->gt; 15
: 1
16 ->gt; 31
: 2
32 ->gt; 63
: 738
|*******
64 ->gt; 127
: 3
in-‐kernel	
  
ummary	
  
128 ->gt; 255
: 1

slide 129:

eBPF	
  
• Example in-kernel latency heat map:

slide 130:

Other	
  Tracers

slide 131:

SystemTap	
  
• Fully programmable, fully featured
– Including access to user-level tracepoints
• Compiles tracing programs into kernel modules
– Needs a compiler, and takes time
• “Works great on Red Hat”
– I keep trying on other distros and have hit trouble in the past
– Make sure you are on the latest version (>gt;=2.6)
• "Works great with kernel debuginfo"
– Suited for kernel developers who have it handy
– A difficult requirement for our cloud environment

slide 132:

"Lightweight"	
  SystemTap	
  
• SystemTap can be used without kernel debuginfo
– Makes life harder, but some tasks are still possible
– providers: nd_syscall, kprobe.function, kernel.trace, …
– args via: int_arg(), uint_arg(), pointer_arg(), user_string()
• Something I've experimented with. Examples:
– https://github.com/brendangregg/systemtap-lwtools/
# stap -e 'global a; probe nd_syscall.write { a

slide 133:

ktap	
  
• Was a very promising new Linux tracer:
– Sampling, static & dynamic tracing
– Lightweight, simple. Uses bytecode.
– Suited for embedded devices
• Development suspended while eBPF integrates
• Will it restart?

slide 134:

sysdig	
  
• sysdig: Innovative new tracer. Simple expressions:
sysdig fd.type=file and evt.failed=true!
sysdig evt.type=open and fd.name contains /etc!
sysdig -p"%proc.name %fd.name" "evt.type=accept and proc.name!=httpd”!
• Replacement for strace? (or “perf trace” will)
• Programmable “chisels”. Eg, one of mine:
# sysdig -c fileslower 1!
TIME
PROCESS
2014-04-13 20:40:43.973 cksum
2014-04-13 20:40:44.187 cksum
2014-04-13 20:40:44.689 cksum
[…]!
TYPE
read
read
read
LAT(ms) FILE!
2 /mnt/partial.0.0!
1 /mnt/partial.0.0!
2 /mnt/partial.0.0!
• Currently syscalls and user-level processing only
– I'm not sure it can be optimized enough for kernel tracing,
unless it adopts eBPF for in-kernel processing & summaries

slide 135:

Present	
  &	
  Future	
  
• Present:
– ftrace & perf_events solving many needs today:
• PMC profiling, stack profiling, tracepoint & dynamic tracing, …
• Expected Future:
– eBPF for kernel summaries & advanced programs
– eBPF perf integration to improve ease of use
• Possible Future:
– eBPF high level language (ktap?)
– ftrace/eBPF integration
– Other tracer eBPF integration (SystemTap, LTTng, sysdig?)
– One of the other tracers going mainline?

slide 136:

The	
  Tracing	
  Landscape,	
  May	
  2015	
  
(less	
  brutal)	
  
(my	
  opinion)	
  
Ease	
  of	
  use	
  
sysdig	
  
perf	
  
stap	
  
irace	
  
(alpha)	
  
(brutal)	
  
dtrace4L.
ktap	
  
(mature)	
  
Stage	
  of	
  
Development	
  
eBPF	
  
Scope	
  &	
  Capability

slide 137:

In	
  Summary…

slide 138:

Methodologies	
  Summary	
  
• Objectives:
– Recognize the Streetlight Anti-Method
– Perform the Workload Characterization Method
– Perform the USE Method
– Be aware of other methodologies
Try to start with the questions (methodology), to help guide
your use of the tools

slide 139:

Tools	
  Summary	
  
• Objectives:
– Perform the USE Method for resource utilization
– Perform Workload Characterization for disks, network
– Have exposure to various observability tools:
• Basic: vmstat, iostat, mpstat, ps, top, …
• Intermediate: tcpdump, netstat, nicstat, pidstat, sar, …
• Advanced: ss, slaptop, perf_events, …
– Perform Active Benchmarking
– Understand tuning risks
– Perform Static Performance Tuning
Print out the tools diagrams for your office wall

slide 140:

Proﬁling	
  &	
  Tracing	
  Summary	
  
• Objectives:
– Profile CPU usage by stack sampling
– Generate CPU flame graphs
– Understand gotchas with stacks & symbols
– Understand frameworks: tracepoints, kprobes, uprobes
– Understand mainline tracers: ftrace, perf_events, eBPF
– Awareness of other tracers: SystemTap, LTTng, ktap, sysdig
– Awareness of what tracing can accomplish (eg, perf-tools)
I've hopefully turned some unknown unknowns into known
unknowns

slide 141:

References	
  &	
  Links	
  
Systems	
  Performance:	
  Enterprise	
  and	
  the	
  Cloud,	
  Pren

slide 142:

Thanks	
  
Questions?
http://slideshare.net/brendangregg
http://www.brendangregg.com
bgregg@netflix.com
@brendangregg