SREcon_2016_perf_checklists.pdf

USENIX SREcon 2016: Performance Checklists for SREs

Talk from SREcon2016 by Brendan Gregg.

Video: https://www.youtube.com/watch?v=zxCWXNigDpA

Video: https://www.usenix.org/conference/srecon16/program/presentation/gregg

Description: "There's limited time for performance analysis in the emergency room. When there is a performance-related site outage, the SRE team must analyze and solve complex performance issues as quickly as possible, and under pressure. Many performance tools and techniques are designed for a different environment: an engineer analyzing their system over the course of hours or days, and given time to try dozens of tools: profilers, tracers, monitoring tools, benchmarks, as well as different tunings and configurations. But when Netflix is down, minutes matter, and there's little time for such traditional systems analysis. As with aviation emergencies, short checklists and quick procedures can be applied by the on-call SRE staff to help solve performance issues as quickly as possible.

In this talk, I'll cover a checklist for Linux performance analysis in 60 seconds, as well as other methodology-derived checklists and procedures for cloud computing, with examples of performance issues for context. Whether you are solving crises in the SRE war room, or just have limited time for performance engineering, these checklists and approaches should help you find some quick performance wins. Safe flying."

	next prev 1/79
	next prev 2/79
	next prev 3/79
	next prev 4/79
	next prev 5/79
	next prev 6/79
	next prev 7/79
	next prev 8/79
	next prev 9/79
	next prev 10/79
	next prev 11/79
	next prev 12/79
	next prev 13/79
	next prev 14/79
	next prev 15/79
	next prev 16/79
	next prev 17/79
	next prev 18/79
	next prev 19/79
	next prev 20/79
	next prev 21/79
	next prev 22/79
	next prev 23/79
	next prev 24/79
	next prev 25/79
	next prev 26/79
	next prev 27/79
	next prev 28/79
	next prev 29/79
	next prev 30/79
	next prev 31/79
	next prev 32/79
	next prev 33/79
	next prev 34/79
	next prev 35/79
	next prev 36/79
	next prev 37/79
	next prev 38/79
	next prev 39/79
	next prev 40/79
	next prev 41/79
	next prev 42/79
	next prev 43/79
	next prev 44/79
	next prev 45/79
	next prev 46/79
	next prev 47/79
	next prev 48/79
	next prev 49/79
	next prev 50/79
	next prev 51/79
	next prev 52/79
	next prev 53/79
	next prev 54/79
	next prev 55/79
	next prev 56/79
	next prev 57/79
	next prev 58/79
	next prev 59/79
	next prev 60/79
	next prev 61/79
	next prev 62/79
	next prev 63/79
	next prev 64/79
	next prev 65/79
	next prev 66/79
	next prev 67/79
	next prev 68/79
	next prev 69/79
	next prev 70/79
	next prev 71/79
	next prev 72/79
	next prev 73/79
	next prev 74/79
	next prev 75/79
	next prev 76/79
	next prev 77/79
	next prev 78/79
	next prev 79/79

PDF: SREcon_2016_perf_checklists.pdf

Keywords (from pdftotext):

slide 1:

Performance	
  Checklists	
  
for	
  SREs	
  
Brendan Gregg
Senior Performance Architect

slide 2:

Performance	
  Checklists	
  
per instance:
uptime
dmesg -T | tail
vmstat 1
mpstat -P ALL 1
pidstat 1
iostat -xz 1
free -m
sar -n DEV 1
sar -n TCP,ETCP 1
10. top
cloud wide:
1.	
  RPS,	
  CPU	
  
2.	
  Volume	
  
3.	
  Instances	
  
4.	
  Scaling	
  
5.	
  CPU/RPS	
  
6.	
  Load	
  Avg	
  
7.	
  Java	
  Heap	
  
8.	
  ParNew	
  
9.	
  Latency	
  
10.	
  99th	
  Qle

slide 3:

slide 4:

Brendan	
  the	
  SRE	
  
• On the Perf Eng team & primary on-call rotation for Core:
our central SRE team
– we get paged on SPS dips (starts per second) & more
• In this talk I'll condense some perf engineering into SRE
timescales (minutes) using checklists

slide 5:

Performance	
  Engineering	
  
!=	
  
SRE	
  Performance	
  
Incident	
  Response

slide 6:

Performance	
  Engineering	
  
• Aim: best price/performance possible
– Can be endless: continual improvement
• Fixes can take hours, days, weeks, months
– Time to read docs & source code, experiment
– Can take on large projects no single team would staff
• Usually no prior "good" state
– No spot the difference. No starting point.
– Is now "good" or "bad"? Experience/instinct helps
• Solo/team work
At Netflix: The Performance Engineering team, with help from
developers

slide 7:

Performance	
  Engineering

slide 8:

Performance	
  Engineering	
  
stat tools
tracers
benchmarks
monitoring
dashboards
documentation
source code
tuning
PMCs
profilers
flame graphs

slide 9:

SRE	
  Perf	
  Incident	
  Response	
  
• Aim: resolve issue in minutes
– Quick resolution is king. Can scale up, roll back, redirect traffic.
– Must cope under pressure, and at 3am
• Previously was in a "good" state
– Spot the difference with historical graphs
• Get immediate help from all staff
– Must be social
• Reliability & perf issues often related
At Netflix, the Core team (5 SREs), with immediate help
from developers and performance engineers

slide 10:

SRE	
  Perf	
  Incident	
  Response

slide 11:

SRE	
  Perf	
  Incident	
  Response	
  
custom dashboards
central event logs
distributed system tracing
chat rooms
pager
ticket system

slide 12:

NeSlix	
  Cloud	
  Analysis	
  Process	
  
In summary…
Example SRE
response path
enumerated
Atlas	
  Alerts	
  
ICE	
  
1.	
  Check	
  Issue	
  
Cost	
  
Atlas	
  Dashboards	
  
2.	
  Check	
  Events	
  
Chronos	
  
Create	
  
New	
  Alert	
  
Plus some other
tools not pictured
Redirected	
  to	
  
a	
  new	
  Target	
  
3.	
  Drill	
  Down	
  
Atlas	
  Metrics	
  
4.	
  Check	
  Dependencies	
  
5.	
  Root	
  
Cause	
  
Mogul	
  
SSH,	
  instance	
  tools	
  
Salp

slide 13:

The	
  Need	
  for	
  Checklists	
  
Speed
Completeness
A Starting Point
An Ending Point
Reliability
Training
Perf checklists have historically
been created for perf engineering
(hours) not SRE response (minutes)
More on checklists: Gawande, A.,
The Checklist Manifesto. Metropolitan
Books, 2008
Boeing	
  707	
  Emergency	
  Checklist	
  (1969)

slide 14:

SRE	
  Checklists	
  at	
  NeSlix	
  
• Some shared docs
– PRE Triage Methodology
– go/triage: a checklist of dashboards
• Most "checklists" are really custom dashboards
– Selected metrics for both reliability and performance
• I maintain my own per-service and per-device checklists

slide 15:

SRE	
  Performance	
  Checklists	
  
The following are:
• Cloud performance checklists/dashboards
• SSH/Linux checklists (lowest common denominator)
• Methodologies for deriving cloud/instance checklists
Ad Hoc
Methodology
Checklists
Dashboards
Including aspirational: what we want to do & build as dashboards

slide 16:

1.	
  PRE	
  Triage	
  Checklist	
  
	
  
Our	
  iniQal	
  checklist	
  
NeSlix	
  speciﬁc

slide 17:

PRE	
  Triage	
  Checklist	
  
• Performance and Reliability Engineering checklist
– Shared doc with a hierarchal checklist with 66 steps total
1. Initial Impact
record timestamp
quantify: SPS, signups, support calls
check impact: regional or global?
check devices: device specific?
2. Time Correlations
1. pretriage dashboard
1. check for suspect NIWS client: error rates
2. check for source of error/request rate change
3. […dashboard specifics…]
Confirms, quantifies,
& narrows problem.
Helps you reason
about the cause.

slide 18:

PRE	
  Triage	
  Checklist.	
  cont.	
  
• 3. Evaluate Service Health
– perfvitals dashboard
– mogul dependency correlation
– by cluster/asg/node:
• latency: avg, 90 percentile
• request rate
• CPU: utilization, sys/user
• Java heap: GC rate, leaks
• memory
• load average
• thread contention (from Java)
• JVM crashes
• network: tput, sockets
• […]
custom dashboards

slide 19:

2.	
  predash	
  
	
  
IniQal	
  dashboard	
  
NeSlix	
  speciﬁc

slide 20:

predash	
  
Performance and Reliability Engineering dashboard
A list of selected dashboards suited for incident response

slide 21:

predash	
  
List of dashboards is its own checklist:
1. Overview
2. Client stats
3. Client errors & retries
4. NIWS HTTP errors
5. NIWS Errors by code
6. DRM request overview
7. DoS attack metrics
8. Push map
9. Cluster status

slide 22:

3.	
  perfvitals	
  
	
  
Service	
  dashboard

slide 23:

perfvitals	
  
1.	
  RPS,	
  CPU	
  
2.	
  Volume	
  
3.	
  Instances	
  
4.	
  Scaling	
  
5.	
  CPU/RPS	
  
6.	
  Load	
  Avg	
  
7.	
  Java	
  Heap	
  
8.	
  ParNew	
  
9.	
  Latency	
  
10.	
  99th	
  Qle

slide 24:

4.	
  Cloud	
  ApplicaQon	
  Performance	
  
Dashboard	
  
	
  
A	
  generic	
  example

slide 25:

Cloud	
  App	
  Perf	
  Dashboard	
  
1. Load
2. Errors
3. Latency
4. Saturation
5. Instances

slide 26:

Cloud	
  App	
  Perf	
  Dashboard	
  
1. Load
2. Errors
3. Latency
4. Saturation
5. Instances
problem	
  of	
  load	
  applied?	
  req/sec,	
  by	
  type	
  
errors,	
  Qmeouts,	
  retries	
  
response	
  Qme	
  average,	
  99th	
  -‐Qle,	
  distribuQon	
  
CPU	
  load	
  averages,	
  queue	
  length/Qme	
  
scale	
  up/down?	
  count,	
  state,	
  version	
  
All time series, for every application, and dependencies.
Draw a functional diagram with the entire data path.
Same as Google's "Four Golden Signals" (Latency, Traffic,
Errors, Saturation), with instances added due to cloud
– Beyer, B., Jones, C., Petoff, J., Murphy, N. Site Reliability Engineering.
O'Reilly, Apr 2016

slide 27:

5.	
  Bad	
  Instance	
  Dashboard	
  
	
  
An	
  An>gt;-‐Methodology

slide 28:

Bad	
  Instance	
  Dashboard	
  
Plot request time per-instance
Find the bad instance
Terminate bad instance
Someone else’s problem now!
In SRE incident response, if it works,
do it.
Bad	
  instance	
  
Terminate!	
  
95th	
  percenQle	
  latency	
  
(Atlas	
  Exploder)

slide 29:

Lots	
  More	
  Dashboards	
  
We have countless more,
mostly app specific and
reliability focused
• Most reliability incidents
involve time correlation with a
central log system
Sometimes, dashboards &
monitoring aren't enough.
Time for SSH.
NIWS HTTP errors:
Error	
  Types	
  
Regions	
  
Apps	
  
Time

slide 30:

6.	
  Linux	
  Performance	
  Analysis	
  
in	
  
60,000	
  milliseconds

slide 31:

Linux	
  Perf	
  Analysis	
  in	
  60s	
  
1. uptime
2. dmesg -T | tail
3. vmstat 1
4. mpstat -P ALL 1
5. pidstat 1
6. iostat -xz 1
7. free -m
8. sar -n DEV 1
9. sar -n TCP,ETCP 1
10. top

slide 32:

Linux	
  Perf	
  Analysis	
  in	
  60s	
  
1. uptime
2. dmesg -T | tail
3. vmstat 1
4. mpstat -P ALL 1
5. pidstat 1
6. iostat -xz 1
7. free -m
8. sar -n DEV 1
9. sar -n TCP,ETCP 1
10. top
load	
  averages	
  
kernel	
  errors	
  
overall	
  stats	
  by	
  Qme	
  
CPU	
  balance	
  
process	
  usage	
  
disk	
  I/O	
  
memory	
  usage	
  
network	
  I/O	
  
TCP	
  stats	
  
check	
  overview	
  
hap://techblog.neSlix.com/2015/11/linux-‐performance-‐analysis-‐in-‐60s.html

slide 33:

60s:	
  upQme,	
  dmesg,	
  vmstat	
  
$ uptime
23:51:26 up 21:31,
1 user,
load average: 30.02, 26.43, 19.02
$ dmesg | tail
[1880957.563150] perl invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
[...]
[1880957.563400] Out of memory: Kill process 18694 (perl) score 246 or sacrifice child
[1880957.563408] Killed process 18694 (perl) total-vm:1972392kB, anon-rss:1953348kB, file-rss:0kB
[2320864.954447] TCP: Possible SYN flooding on port 7001. Dropping request. Check SNMP counters.
$ vmstat 1
procs ---------memory---------- ---swap-- -----io---- -system-- ------cpu----r b swpd
free
buff cache
cs us sy id wa st
34 0
0 200889792 73708 591828
10 96 1 3 0 0
32 0
0 200889920 73708 591860
592 13284 4282 98 1 1 0 0
32 0
0 200890112 73708 591860
0 9501 2154 99 1 0 0 0
32 0
0 200889568 73712 591856
48 11900 2459 99 0 0 0 0
32 0
0 200890208 73712 591860
0 15898 4840 98 1 1 0 0

slide 34:

60s:	
  mpstat	
  
$ mpstat -P ALL 1
Linux 3.13.0-49-generic (titanclusters-xxxxx)
07:38:49 PM
07:38:50 PM
07:38:50 PM
07:38:50 PM
07:38:50 PM
07:38:50 PM
[...]
CPU
all
%usr
%nice
%sys %iowait
07/14/2015
%irq
_x86_64_ (32 CPU)
%soft
%steal
%guest
%gnice
%idle

slide 35:

60s:	
  pidstat	
  
$ pidstat 1
Linux 3.13.0-49-generic (titanclusters-xxxxx)
07/14/2015
_x86_64_
(32 CPU)
07:41:02 PM
UID
07:41:03 PM
07:41:03 PM
07:41:03 PM
07:41:03 PM
07:41:03 PM
07:41:03 PM 60004
PID
%usr %system
6521 1596.23
6564 1571.70
%guest
%CPU
0.00 1598.11
0.00 1579.25
CPU
Command
rcuos/0
mesos-slave
java
java
java
pidstat
07:41:03 PM
UID
07:41:04 PM
07:41:04 PM
07:41:04 PM
07:41:04 PM
07:41:04 PM 60004
PID
%usr %system
6521 1590.00
6564 1573.00
%guest
%CPU
0.00 1591.00
0.00 1583.00
CPU
Command
mesos-slave
java
java
snmp-pass
pidstat

slide 36:

60s:	
  iostat	
  
$ iostat -xmdz 1
Linux 3.13.0-29 (db001-eb883efa)
Device:
xvda
xvdb
xvdc
md0
rrqm/s
08/18/2014
wrqm/s
r/s
0.00 15299.00
0.00 15271.00
0.00 31082.00
w/s
_x86_64_
rMB/s
(16 CPU)
wMB/s \ ...
0.00 / ...
0.00 \ ...
0.01 / ...
0.01 \ ...
Workload	
  
... \ avgqu-sz
... /
... \
... /
... \
await r_await w_await
ResulQng	
  Performance	
  
svctm
%util

slide 37:

60s:	
  free,	
  sar	
  –n	
  DEV	
  
$ free -m
total
Mem:
-/+ buffers/cache:
Swap:
used
free
shared
$ sar -n DEV 1
Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015
buffers
_x86_64_
cached
(32 CPU)
12:16:48 AM
12:16:49 AM
12:16:49 AM
12:16:49 AM
IFACE rxpck/s
eth0 18763.00
docker0
txpck/s
rxkB/s
5032.00 20686.42
txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
12:16:49 AM
12:16:50 AM
12:16:50 AM
12:16:50 AM
IFACE rxpck/s
eth0 19763.00
docker0
txpck/s
rxkB/s
5101.00 21999.10
txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil

slide 38:

60s:	
  sar	
  –n	
  TCP,ETCP	
  
$ sar -n TCP,ETCP 1
Linux 3.13.0-49-generic (titanclusters-xxxxx)
(32 CPU)
12:17:19 AM
12:17:20 AM
active/s passive/s
12:17:19 AM
12:17:20 AM
atmptf/s
12:17:20 AM
12:17:21 AM
active/s passive/s
12:17:20 AM
12:17:21 AM
atmptf/s
iseg/s
07/14/2015
oseg/s
estres/s retrans/s isegerr/s
iseg/s
_x86_64_
orsts/s
oseg/s
estres/s retrans/s isegerr/s
orsts/s

slide 39:

60s:	
  top	
  
$ top
top - 00:15:40 up 21:56, 1 user, load average: 31.09, 29.87, 29.92
Tasks: 871 total,
1 running, 868 sleeping,
0 stopped,
2 zombie
%Cpu(s): 96.8 us, 0.4 sy, 0.0 ni, 2.7 id, 0.1 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 25190241+total, 24921688 used, 22698073+free,
60448 buffers
KiB Swap:
0 total,
0 used,
0 free.
554208 cached Mem
PID USER
20248 root
4213 root
66128 titancl+
5235 root
4299 root
1 root
2 root
3 root
5 root
6 root
8 root
PR NI
VIRT
RES
0 0.227t 0.012t
0 2722544 64640
0 38.227g 547004
0 20.015g 2.682g
0 -20
SHR S
18748 S
44232 S
1172 R
49996 S
16836 S
1496 S
0 S
0 S
0 S
0 S
0 S
%CPU %MEM
TIME+ COMMAND
3090 5.2 29812:58 java
23.5 0.0 233:35.37 mesos-slave
1.0 0.0
0:00.07 top
0.7 0.2
2:02.74 java
0.3 1.1 33:14.42 java
0.0 0.0
0:03.82 init
0.0 0.0
0:00.02 kthreadd
0.0 0.0
0:05.35 ksoftirqd/0
0.0 0.0
0:00.00 kworker/0:0H
0.0 0.0
0:06.94 kworker/u256:0
0.0 0.0
2:38.05 rcu_sched

slide 40:

Other	
  Analysis	
  in	
  60s	
  
• We need such checklists for:
– Java
– Cassandra
– MySQL
– Nginx
– etc…
• Can follow a methodology:
– Process of elimination
– Workload characterization
– Differential diagnosis
– Some summaries: http://www.brendangregg.com/methodology.html
• Turn checklists into dashboards (many do exist)

slide 41:

7.	
  Linux	
  Disk	
  Checklist

slide 42:

slide 43:

Linux	
  Disk	
  Checklist	
  
iostat –xnz 1
vmstat 1
df -h
ext4slower 10
bioslower 10
ext4dist 1
biolatency 1
cat /sys/devices/…/ioerr_cnt
smartctl -l error /dev/sda1

slide 44:

Linux	
  Disk	
  Checklist	
  
iostat –xnz 1
any	
  disk	
  I/O?	
  if	
  not,	
  stop	
  looking	
  
vmstat 1
is	
  this	
  swapping?	
  or,	
  high	
  sys	
  Qme?	
  
df -h
are	
  ﬁle	
  systems	
  nearly	
  full?	
  
ext4slower 10
(zfs*,	
  xfs*,	
  etc.)	
  slow	
  ﬁle	
  system	
  I/O?	
  
bioslower 10
if	
  so,	
  check	
  disks	
  
check	
  distribuQon	
  and	
  rate	
  
ext4dist 1
biolatency 1
if	
  interesQng,	
  check	
  disks	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  (if	
  available)	
  errors	
  
cat /sys/devices/…/ioerr_cnt
smartctl -l error /dev/sda1
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  (if	
  available)	
  errors	
  
Another short checklist. Won't solve everything. FS focused.
ext4slower/dist, bioslower, are from bcc/BPF tools.

slide 45:

ext4slower	
  
• ext4 operations slower than the threshold:
# ./ext4slower 1
Tracing ext4 operations slower than 1 ms
TIME
COMM
PID
T BYTES
OFF_KB
06:49:17 bash
R 128
06:49:17 cksum
R 39552
06:49:17 cksum
R 96
06:49:17 cksum
R 96
06:49:17 cksum
R 10320
06:49:17 cksum
R 65536
06:49:17 cksum
R 55400
06:49:17 cksum
R 36792
06:49:17 cksum
R 15008
[…]
LAT(ms) FILENAME
7.75 cksum
1.34 [
5.36 2to3-2.7
14.94 2to3-3.4
6.82 411toppm
4.01 a2p
8.77 ab
16.34 aclocal-1.14
19.31 acpi_listen
• Better indicator of application pain than disk I/O
• Measures & filters in-kernel for efficiency using BPF
– From https://github.com/iovisor/bcc

slide 46:

BPF	
  is	
  coming…	
  
Free	
  your	
  mind

slide 47:

BPF	
  
• That file system checklist should be a dashboard:
– FS & disk latency histograms, heatmaps, IOPS, outlier log
• Now possible with enhanced BPF (Berkeley Packet Filter)
– Built into Linux 4.x: dynamic tracing, filters, histograms
System dashboards of 2017+ should look very different

slide 48:

8.	
  Linux	
  Network	
  Checklist

slide 49:

Linux	
  Network	
  Checklist	
  
1. sar -n DEV,EDEV 1
2. sar -n TCP,ETCP 1
3. cat /etc/resolv.conf
4. mpstat -P ALL 1
5. tcpretrans
6. tcpconnect
7. tcpaccept
8. netstat -rnv
9. check firewall config
10. netstat -s

slide 50:

Linux	
  Network	
  Checklist	
  
1. sar -n DEV,EDEV 1
2. sar -n TCP,ETCP 1
3. cat /etc/resolv.conf
4. mpstat -P ALL 1
5. tcpretrans
6. tcpconnect
7. tcpaccept
8. netstat -rnv
9. check firewall config
10. netstat -s
tcp*, are from bcc/BPF tools
at	
  interface	
  limits?	
  or	
  use	
  nicstat	
  
acQve/passive	
  load,	
  retransmit	
  rate	
  
it's	
  always	
  DNS	
  
high	
  kernel	
  Qme?	
  single	
  hot	
  CPU?	
  
what	
  are	
  the	
  retransmits?	
  state?	
  
connecQng	
  to	
  anything	
  unexpected?	
  
unexpected	
  workload?	
  
any	
  ineﬃcient	
  routes?	
  
anything	
  blocking/throaling?	
  
play	
  252	
  metric	
  pickup

slide 51:

tcpretrans	
  
• Just trace kernel TCP retransmit functions for efficiency:
# ./tcpretrans
TIME
PID
01:55:05 0
01:55:05 0
01:55:17 0
[…]
IP LADDR:LPORT
4 10.153.223.157:22
4 10.153.223.157:22
4 10.153.223.157:22
T>gt; RADDR:RPORT
R>gt; 69.53.245.40:34619
R>gt; 69.53.245.40:34619
R>gt; 69.53.245.40:22957
STATE
ESTABLISHED
ESTABLISHED
ESTABLISHED
• From either bcc (BPF) or perf-tools (ftrace, older kernels)

slide 52:

9.	
  Linux	
  CPU	
  Checklist

slide 53:

(too many lines – should be a utilization heat map)

slide 54:

http://www.brendangregg.com/HeatMaps/subsecondoffset.html

slide 55:

$ perf script
[…]
java 14327 [022] 252764.179741: cycles:
java 14315 [014] 252764.183517: cycles:
java 14310 [012] 252764.185317: cycles:
java 14332 [015] 252764.188720: cycles:
java 14341 [019] 252764.191307: cycles:
java 14341 [019] 252764.198825: cycles:
java 14341 [019] 252764.207057: cycles:
java 14341 [019] 252764.215962: cycles:
java 14341 [019] 252764.225141: cycles:
java 14341 [019] 252764.234578: cycles:
[…]
7f36570a4932 SpinPause (/usr/lib/jvm/java-8
7f36570a4932 SpinPause (/usr/lib/jvm/java-8
7f36570a4932 SpinPause (/usr/lib/jvm/java-8
7f3658078350 pthread_cond_wait@@GLIBC_2.3.2
7f3656d150c8 ClassLoaderDataGraph::do_unloa
7f3656d140b8 ClassLoaderData::free_dealloca
7f3657192400 nmethod::do_unloading(BoolObje
7f3656ba807e Assembler::locate_operand(unsi
7f36571922e8 nmethod::do_unloading(BoolObje
7f3656ec4960 CodeHeap::block_start(void*) c

slide 56:

Linux	
  CPU	
  Checklist	
  
uptime
vmstat 1
mpstat -P ALL 1
pidstat 1
CPU flame graph
CPU subsecond offset heat map
perf stat -a -- sleep 10

slide 57:

Linux	
  CPU	
  Checklist	
  
uptime
load	
  averages	
  
vmstat 1
system-‐wide	
  uQlizaQon,	
  run	
  q	
  length	
  
mpstat -P ALL 1
CPU	
  balance	
  
pidstat 1
per-‐process	
  CPU	
  
CPU flame graph
CPU	
  proﬁling	
  
	
  	
  	
  	
  	
  	
  	
  	
  map
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  look	
  for	
  gaps	
  
CPU subsecond offset heat
perf stat -a -- sleep	
  	
  	
  	
  	
  10
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  IPC,	
  LLC	
  hit	
  raQo	
  
htop can do 1-4

slide 58:

htop

slide 59:

CPU	
  Flame	
  Graph

slide 60:

perf_events	
  CPU	
  Flame	
  Graphs	
  
• We have this automated in Netflix Vector:
git clone --depth 1 https://github.com/brendangregg/FlameGraph
cd FlameGraph
perf record -F 99 -a –g -- sleep 30
perf script | ./stackcollapse-perf.pl |./flamegraph.pl >gt; perf.svg
• Flame graph interpretation:
– x-axis: alphabetical stack sort, to maximize merging
– y-axis: stack depth
– color: random, or hue can be a dimension (eg, diff)
– Top edge is on-CPU, beneath it is ancestry
• Can also do Java & Node.js. Differentials.
• We're working on a d3 version for Vector

slide 61:

10.	
  Tools	
  Method	
  
	
  
An	
  An>gt;-‐Methodology

slide 62:

Tools	
  Method	
  
1. RUN EVERYTHING AND HOPE FOR THE BEST
For SRE response: a mental checklist to see what might
have been missed (no time to run them all)

slide 63:

Linux	
  Perf	
  Observability	
  Tools

slide 64:

Linux	
  StaQc	
  Performance	
  Tools

slide 65:

Linux	
  perf-‐tools	
  (mrace,	
  perf)

slide 66:

Linux	
  bcc	
  tools	
  (BPF)	
  
Needs	
  Linux	
  4.x	
  
CONFIG_BPF_SYSCALL=y

slide 67:

11.	
  USE	
  Method	
  
	
  
A	
  Methodology

slide 68:

The	
  USE	
  Method	
  
• For every resource, check:
Utilization
Saturation
Errors
X	
  
Resource	
  
UQlizaQon	
  
(%)	
  
• Definitions:
– Utilization: busy time
– Saturation: queue length or queued time
– Errors: easy to interpret (objective)
Used to generate checklists. Starts with the questions,
then finds the tools.

slide 69:

USE	
  Method	
  for	
  Hardware	
  
• For every resource, check:
Utilization
Saturation
Errors
• Including busses & interconnects

slide 70:

(hap://www.brendangregg.com/USEmethod/use-‐linux.html)

slide 71:

USE	
  Method	
  for	
  Distributed	
  Systems	
  
• Draw a service diagram, and for every service:
Utilization: resource usage (CPU, network)
Saturation: request queueing, timeouts
Errors
• Turn into a dashboard

slide 72:

NeSlix	
  Vector	
  
• Real time instance analysis tool
– https://github.com/netflix/vector
– http://techblog.netflix.com/2015/04/introducing-vector-netflixs-on-host.html
• USE method-inspired metrics
– More in development, incl. flame graphs

slide 73:

NeSlix	
  Vector

slide 74:

CPU:
utilization
NeSlix	
  Vector	
  
Network:
utilization
Memory:
utilization
Disk:
load
saturation
saturation
load
saturation
utilization
saturation

slide 75:

12.	
  Bonus:	
  External	
  Factor	
  Checklist

slide 76:

External	
  Factor	
  Checklist	
  
1. Sports ball?
2. Power outage?
3. Snow storm?
4. Internet/ISP down?
5. Vendor firmware update?
6. Public holiday/celebration?
7. Chaos Kong?
Social media searches (Twitter) often useful
– Can also be NSFW

slide 77:

Take	
  Aways	
  
• Checklists are great
– Speed, Completeness, Starting/Ending Point, Training
– Can be ad hoc, or from a methodology (USE method)
• Service dashboards
– Serve as checklists
– Metrics: Load, Errors, Latency, Saturation, Instances
• System dashboards with Linux BPF
– Latency histograms & heatmaps, etc. Free your mind.
Please create and share more checklists

slide 78:

References	
  
Netflix Tech Blog:
Linux Performance & BPF tools:
http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html
Heat maps:
http://www.brendangregg.com/USEmethod/use-linux.html
Flame Graphs:
http://www.brendangregg.com/linuxperf.html
https://github.com/iovisor/bcc#tools
USE Method Linux:
http://techblog.netflix.com/2015/11/linux-performance-analysis-in-60s.html
http://techblog.netflix.com/2015/02/sps-pulse-of-netflix-streaming.html
http://techblog.netflix.com/2015/04/introducing-vector-netflixs-on-host.html
http://cacm.acm.org/magazines/2010/7/95062-visualizing-system-latency/fulltext
http://www.brendangregg.com/heatmaps.html
Books:
Beyer, B., et al. Site Reliability Engineering. O'Reilly, Apr 2016
Gawande, A. The Checklist Manifesto. Metropolitan Books, 2008
Gregg, B. Systems Performance. Prentice Hall, 2013 (more checklists & methods!)
Thanks: Netflix Perf & Core teams for predash, pretriage, Vector, etc

slide 79:

Thanks	
  
http://slideshare.net/brendangregg
http://www.brendangregg.com
bgregg@netflix.com
@brendangregg
Netflix is hiring SREs!