Percona2016_LinuxSystemsPerf.pdf

PerconaLive 2016: Linux Systems Performance

Talk for PerconaLive 2016 by Brendan Gregg.

Video: https://www.youtube.com/watch?v=CbmEDXq7es0

Description: "Systems performance provides a different perspective for analysis and tuning, and can help you find performance wins for your databases, applications, and the kernel. However, most of us are not performance or kernel engineers, and have limited time to study this topic. This talk summarizes six important areas of Linux systems performance in 50 minutes: observability tools, methodologies, benchmarking, profiling, tracing, and tuning. Included are recipes for Linux performance analysis and tuning (using vmstat, mpstat, iostat, etc), overviews of complex areas including profiling (perf_events), static tracing (tracepoints), and dynamic tracing (kprobes, uprobes), and much advice about what is and isn't important to learn. This talk is aimed at everyone: DBAs, developers, operations, etc, and in any environment running Linux, bare-metal or the cloud."

	next prev 1/72
	next prev 2/72
	next prev 3/72
	next prev 4/72
	next prev 5/72
	next prev 6/72
	next prev 7/72
	next prev 8/72
	next prev 9/72
	next prev 10/72
	next prev 11/72
	next prev 12/72
	next prev 13/72
	next prev 14/72
	next prev 15/72
	next prev 16/72
	next prev 17/72
	next prev 18/72
	next prev 19/72
	next prev 20/72
	next prev 21/72
	next prev 22/72
	next prev 23/72
	next prev 24/72
	next prev 25/72
	next prev 26/72
	next prev 27/72
	next prev 28/72
	next prev 29/72
	next prev 30/72
	next prev 31/72
	next prev 32/72
	next prev 33/72
	next prev 34/72
	next prev 35/72
	next prev 36/72
	next prev 37/72
	next prev 38/72
	next prev 39/72
	next prev 40/72
	next prev 41/72
	next prev 42/72
	next prev 43/72
	next prev 44/72
	next prev 45/72
	next prev 46/72
	next prev 47/72
	next prev 48/72
	next prev 49/72
	next prev 50/72
	next prev 51/72
	next prev 52/72
	next prev 53/72
	next prev 54/72
	next prev 55/72
	next prev 56/72
	next prev 57/72
	next prev 58/72
	next prev 59/72
	next prev 60/72
	next prev 61/72
	next prev 62/72
	next prev 63/72
	next prev 64/72
	next prev 65/72
	next prev 66/72
	next prev 67/72
	next prev 68/72
	next prev 69/72
	next prev 70/72
	next prev 71/72
	next prev 72/72

PDF: Percona2016_LinuxSystemsPerf.pdf

Keywords (from pdftotext):

slide 1:

Apr,	
  2016	
  
Linux	
  Systems	
  
Performance	
  
Brendan Gregg
Senior Performance Architect

slide 2:

Systems	
  Performance	
  in	
  50	
  mins

slide 3:

slide 4:

Agenda	
  
A brief discussion of 6 facets of Linux performance:
Observability
Methodologies
Benchmarking
Profiling
Tracing
Tuning
Audience: Everyone (DBAs, developers, operations, …)

slide 5:

1.	
  Observability

slide 6:

How	
  do	
  you	
  measure	
  these?

slide 7:

Linux	
  Observability	
  Tools

slide 8:

Observability	
  Tools	
  
• Tools showcase common metrics
– Learning Linux tools is useful even if you never use them:
the same metrics are in GUIs
• We usually use these metrics via:
– Netflix Atlas: cloud-wide monitoring
– Netflix Vector: instance analysis
• Linux has many tools
– Plus many extra kernel sources of data that lack tools, are
harder to use, and are practically undocumented
• Some tool examples…

slide 9:

upGme	
  
• One way to print load averages:
$ uptime
07:42:06 up
8:16,
1 user,
load average: 2.27, 2.84, 2.91
• A measure of resource demand: CPUs + disks
– Other OSes only show CPUs: easier to interpret
• Exponentially-damped moving averages
• Time constants of 1, 5, and 15 minutes
– Historic trend without the line graph
• Load >gt; # of CPUs, may mean CPU saturation
– Don’t spend more than 5 seconds studying these

slide 10:

top	
  (or	
  htop)	
  
• System and per-process interval summary:
$ top - 18:50:26 up 7:43, 1 user, load average: 4.11, 4.91, 5.22
Tasks: 209 total,
1 running, 206 sleeping,
0 stopped,
2 zombie
Cpu(s): 47.1%us, 4.0%sy, 0.0%ni, 48.4%id, 0.0%wa, 0.0%hi, 0.3%si, 0.2%st
Mem: 70197156k total, 44831072k used, 25366084k free,
36360k buffers
Swap:
0k total,
0k used,
0k free, 11873356k cached
PID USER
5738 apiprod
1386 apiprod
1 root
2 root
[…]
NI VIRT RES SHR S %CPU %MEM
0 62.6g 29g 352m S 417 44.2
0 17452 1388 964 R
0 0.0
0 24340 2272 1340 S
0 0.0
0 S
0 0.0
TIME+ COMMAND
2144:15 java
0:00.02 top
0:01.51 init
0:00.00 kthreadd
• %CPU is summed across all CPUs
• Can miss short-lived processes (atop won’t)
• Can consume noticeable CPU to read /proc

slide 11:

htop

slide 12:

vmstat	
  
• Virtual memory statistics and more:
$ vmstat –Sm 1
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---r b
swpd
free
buff cache
cs us sy id wa
8 0
12 25 34 0 0
7 0
0 205 186 46 13 0 0
8 0
8 210 435 39 21 0 0
8 0
0 218 219 42 17 0 0
[…]
• USAGE: vmstat [interval [count]]
• First output line has some summary since boot values
– Should be all; partial is confusing
• High level CPU summary
– “r” is runnable tasks

slide 13:

iostat	
  
• Block I/O (disk) stats. 1st output is since boot.
$ iostat -xmdz 1
Linux 3.13.0-29 (db001-eb883efa)
Device:
xvda
xvdb
xvdc
md0
rrqm/s
08/18/2014
wrqm/s
r/s
0.00 15299.00
0.00 15271.00
0.00 31082.00
w/s
_x86_64_
rMB/s
(16 CPU)
wMB/s \ ...
0.00 / ...
0.00 \ ...
0.01 / ...
0.01 \ ...
Workload	
  
• Very useful
set of stats
... \ avgqu-sz
... /
... \
... /
... \
await r_await w_await
ResulGng	
  Performance	
  
svctm
%util

slide 14:

free	
  
• Main memory usage:
$ free -m
total
Mem:
-/+ buffers/cache:
Swap:
used
free
• buffers: block device I/O cache
• cached: virtual page cache
shared
buffers
cached

slide 15:

strace	
  
• System call tracer:
$ strace –tttT –p 313
1408393285.779746 getgroups(0, NULL)
= 1 gt;
1408393285.779873 getgroups(1, [0])
= 1 gt;
1408393285.780797 close(3)
= 0 gt;
1408393285.781338 write(1, "LinuxCon 2014!\n", 15LinuxCon 2014!
) = 15 gt;
• Eg, -ttt: time (us) since epoch; -T: syscall time (s)
• Translates syscall args
– Very helpful for solving system usage issues
• Currently has massive overhead (ptrace based)
– Can slow the target by >gt; 100x. Use extreme caution.

slide 16:

tcpdump	
  
• Sniff network packets for post analysis:
$ tcpdump -i eth0 -w /tmp/out.tcpdump
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
^C7985 packets captured
8996 packets received by filter
1010 packets dropped by kernel
# tcpdump -nr /tmp/out.tcpdump | head
reading from file /tmp/out.tcpdump, link-type EN10MB (Ethernet)
20:41:05.038437 IP 10.44.107.151.22 >gt; 10.53.237.72.46425: Flags [P.], seq 18...
20:41:05.038533 IP 10.44.107.151.22 >gt; 10.53.237.72.46425: Flags [P.], seq 48...
20:41:05.038584 IP 10.44.107.151.22 >gt; 10.53.237.72.46425: Flags [P.], seq 96...
[…]
• Study packet sequences with timestamps (us)
• CPU overhead optimized (socket ring buffers), but can
still be significant. Use caution.

slide 17:

netstat	
  
• Various network protocol statistics using -s:
• A multi-tool:
$ netstat –s
-i: interface stats
-r: route table
default: list conns
• netstat -p: shows
process details!
• Per-second interval
with -c
[…]
Tcp:
736455 active connections openings
176887 passive connection openings
33 failed connection attempts
1466 connection resets received
3311 connections established
91975192 segments received
180415763 segments send out
223685 segments retransmited
2 bad segments received.
39481 resets sent
[…]
TcpExt:
12377 invalid SYN cookies received
2982 delayed acks sent
[…]

slide 18:

slabtop	
  
• Kernel slab allocator memory usage:
$ slabtop
Active / Total Objects (% used)
: 4692768 / 4751161 (98.8%)
Active / Total Slabs (% used)
: 129083 / 129083 (100.0%)
Active / Total Caches (% used)
: 71 / 109 (65.1%)
Active / Total Size (% used)
: 729966.22K / 738277.47K (98.9%)
Minimum / Average / Maximum Object : 0.01K / 0.16K / 8.00K
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
3565575 3565575 100%
0.10K 91425
365700K buffer_head
314916 314066 99%
0.19K 14996
59984K dentry
184192 183751 99%
0.06K
11512K kmalloc-64
138618 138618 100%
0.94K
130464K xfs_inode
138602 138602 100%
0.21K
29968K xfs_ili
102116 99012 96%
0.55K
58352K radix_tree_node
97482 49093 50%
0.09K
9284K kmalloc-96
22695 20777 91%
0.05K
1068K shared_policy_node
21312 21312 100%
0.86K
18432K ext4_inode_cache
16288 14601 89%
0.25K
4072K kmalloc-256
[…]

slide 19:

pcstat	
  
• Show page cache residency by file:
# ./pcstat data0*
|----------+----------------+------------+-----------+---------|
| Name
| Size
| Pages
| Cached
| Percent |
|----------+----------------+------------+-----------+---------|
| data00
| 104857600
| 25600
| 25600
| 100.000 |
| data01
| 104857600
| 25600
| 25600
| 100.000 |
| data02
| 104857600
| 25600
| 4080
| 015.938 |
| data03
| 104857600
| 25600
| 25600
| 100.000 |
| data04
| 104857600
| 25600
| 16010
| 062.539 |
| data05
| 104857600
| 25600
| 0
| 000.000 |
|----------+----------------+------------+-----------+---------|
• Uses the mincore(2) syscall. Useful for database
performance analysis.

slide 20:

perf_events	
  
• Provides the "perf" command
• In Linux source code: tools/perf
– Usually pkg added by linux-tools-common, etc.
• Multi-tool with many capabilities
– CPU profiling
– PMC profiling
– Static & dynamic tracing
• Covered later in Profiling & Tracing

slide 21:

Where	
  do	
  you	
  start?...and	
  stop?

slide 22:

2.	
  Methodologies

slide 23:

An#-‐Methodologies	
  
• The lack of a deliberate methodology…
• Street Light Anti-Method:
– 1. Pick observability tools that are
• Familiar
• Found on the Internet
• Found at random
– 2. Run tools
– 3. Look for obvious issues
• Drunk Man Anti-Method:
– Tune things at random until the problem goes away

slide 24:

Methodologies	
  
Linux Performance Analysis in 60 seconds
The USE method
CPU Profile Method
Resource Analysis
Workload Analysis
Others include:
– Workload characterization
– Drill-down analysis
– Off-CPU analysis
– Static performance tuning
– 5 whys
– …

slide 25:

Linux	
  Perf	
  Analysis	
  in	
  60s	
  
1. uptime
2. dmesg | tail
3. vmstat 1
4. mpstat -P ALL 1
5. pidstat 1
6. iostat -xz 1
7. free -m
8. sar -n DEV 1
9. sar -n TCP,ETCP 1
10. top

slide 26:

Linux	
  Perf	
  Analysis	
  in	
  60s	
  
1. uptime
2. dmesg | tail
3. vmstat 1
4. mpstat -P ALL 1
5. pidstat 1
6. iostat -xz 1
7. free -m
8. sar -n DEV 1
9. sar -n TCP,ETCP 1
10. top
load	
  averages	
  
kernel	
  errors	
  
overall	
  stats	
  by	
  Gme	
  
CPU	
  balance	
  
process	
  usage	
  
disk	
  I/O	
  
memory	
  usage	
  
network	
  I/O	
  
TCP	
  stats	
  
check	
  overview	
  
hTp://techblog.neVlix.com/2015/11/linux-‐performance-‐analysis-‐in-‐60s.html

slide 27:

The	
  USE	
  Method	
  
• For every resource, check:
Utilization
Saturation
Errors
X	
  
Resource	
  
UGlizaGon	
  
(%)	
  
• Definitions:
– Utilization: busy time
– Saturation: queue length or queued time
– Errors: easy to interpret (objective)
• Helps if you have a functional (block) diagram of your
system / software / environment, showing all resources
Start with the questions, then find the tools

slide 28:

USE	
  Method	
  for	
  Hardware	
  
• For every resource, check:
Utilization
Saturation
Errors
• Including busses & interconnects

slide 29:

(hTp://www.brendangregg.com/USEmethod/use-‐linux.html)

slide 30:

CPU	
  Proﬁle	
  Method	
  
1. Take a CPU profile
2. Understand all software in profile >gt; 1%
• Discovers a wide range of issues by their CPU usage
– Directly: CPU consumers
– Indirectly: initialization
of I/O, locks, times, ...
• Narrows target of study
to only its running
components
Flame	
  Graph

slide 31:

Resource	
  Analysis	
  
• Typical approach for system performance analysis:
begin with system tools & metrics
Workload	
  
• Pros:
– Generic
– Aids resource
perf tuning
ApplicaGon	
  
	
  
	
   Libraries	
  
System	
  
• Cons:
– Uneven coverage
– False positives
System	
  Calls	
  
Kernel	
  
Hardware	
  
Analysis

slide 32:

Workload	
  Analysis	
  
• Begin with application metrics & context
• Pros:
– Accurate, proportional metrics
– App context
• Cons:
– App specific
– Difficult to dig from
app to resource
Workload	
  
ApplicaGon	
  
	
  
	
   Libraries	
  
System	
  
System	
  Calls	
  
Kernel	
  
Hardware	
  
Analysis

slide 33:

3.	
  Benchmarking

slide 34:

~100% of benchmarks are
wrong

slide 35:

Benchmarking	
  
• Apart from observational analysis, benchmarking is a
useful form of experimental analysis
– Try observational first; benchmarks can perturb
• However, benchmarking is error prone:
– Testing the wrong target: eg, FS cache instead of disk
– Choosing the wrong target: eg, disk instead of FS cache
… doesn’t resemble real world usage
– Invalid results: eg, bugs
– Misleading results: you benchmark A, but actually measure B,
and conclude you measured C
• The energy needed to refute benchmarks is multiple
orders of magnitude bigger than to run them

slide 36:

Benchmark	
  Examples	
  
• Micro benchmarks:
– File system maximum cached read operations/sec
– Network maximum throughput
• Macro (application) benchmarks:
– Simulated application maximum request rate
• Bad benchmarks:
– gitpid() in a tight loop
– Context switch timing

slide 37:

The	
  Benchmark	
  Paradox	
  
• Benchmarking is used for product evaluations
– Eg, evaluating cloud vendors
• The Benchmark Paradox:
– If your product’s chances of winning a benchmark are 50/50,
you’ll usually lose
– http://www.brendangregg.com/blog/2014-05-03/the-benchmark-paradox.html
• Solving this seeming paradox (and benchmarking in
general)…

slide 38:

For any given benchmark result,
ask:
why isn’t it 10x?

slide 39:

AcGve	
  Benchmarking	
  
• Root cause performance analysis while the benchmark is
still running
– Use the earlier observability tools
– Identify the limiter (or suspected limiter) and include it with the
benchmark results
– Answer: why not 10x?
• This takes time, but uncovers most mistakes

slide 40:

4.	
  Proﬁling

slide 41:

Proﬁling	
  
• Can you do this?
“As an experiment to investigate the performance of the
resulting TCP/IP implementation ... the 11/750 is CPU
saturated, but the 11/780 has about 30% idle time. The time
spent in the system processing the data is spread out among
handling for the Ethernet (20%), IP packet processing
(10%), TCP processing (30%), checksumming (25%), and
user system call handling (15%), with no single part of the
handling dominating the time in the system.”

slide 42:

Proﬁling	
  
• Can you do this?
“As an experiment to investigate the performance of the
resulting TCP/IP implementation ... the 11/750 is CPU
saturated, but the 11/780 has about 30% idle time. The time
spent in the system processing the data is spread out among
handling for the Ethernet (20%), IP packet processing
(10%), TCP processing (30%), checksumming (25%), and
user system call handling (15%), with no single part of the
handling dominating the time in the system.”
–	
  Bill	
  Joy,	
  1981,	
  TCP-‐IP	
  Digest,	
  Vol	
  1	
  #6	
  
hTps://www.rfc-‐editor.org/rfc/museum/tcp-‐ip-‐digest/tcp-‐ip-‐digest.v1n6.1

slide 43:

perf_events	
  
• Introduced earlier: multi-tool, profiler. Provides "perf".
usage: perf [--version] [--help] [OPTIONS] COMMAND [ARGS]
The most commonly used perf commands are:
annotate
Read perf.data (created by perf record) and display annotated code
archive
Create archive with object files with build-ids found in perf.data file
bench
General framework for benchmark suites
buildid-cache
Manage build-id cache.
buildid-list
List the buildids in a perf.data file
data
Data file related processing
diff
Read perf.data files and display the differential profile
evlist
List the event names in a perf.data file
inject
Filter to augment the events stream with additional information
kmem
Tool to trace/measure kernel memory(slab) properties
kvm
Tool to trace/measure kvm guest os
list
List all symbolic event types
lock
Analyze lock events
mem
Profile memory accesses
record
Run a command and record its profile into perf.data
report
Read perf.data (created by perf record) and display the profile
sched
Tool to trace/measure scheduler properties (latencies)
script
Read perf.data (created by perf record) and display trace output
stat
Run a command and gather performance counter statistics
test
Runs sanity tests.
timechart
Tool to visualize total system behavior during a workload
top
System profiling tool.
trace
strace inspired tool
probe
Define new dynamic tracepoints
See 'perf help COMMAND' for more information on a specific command.

slide 44:

perf_events:	
  CPU	
  proﬁling	
  
• Sampling full stack traces at 99 Hertz, for 30 secs:
# perf record -F 99 -ag -- sleep 30
[ perf record: Woken up 9 times to write data ]
[ perf record: Captured and wrote 2.745 MB perf.data (~119930 samples) ]
# perf report -n --stdio
1.40%
java [kernel.kallsyms]
[k] _raw_spin_lock
--- _raw_spin_lock
|--63.21%-- try_to_wake_up
|--63.91%-- default_wake_function
|--56.11%-- __wake_up_common
__wake_up_locked
ep_poll_callback
__wake_up_common
__wake_up_sync_key
|--59.19%-- sock_def_readable
[…78,000 lines truncated…]

slide 45:

perf_events:	
  Full	
  "report"	
  Output

slide 46:

…	
  as	
  a	
  Flame	
  Graph

slide 47:

perf_events:	
  Flame	
  Graphs	
  
git clone --depth 1 https://github.com/brendangregg/FlameGraph
cd FlameGraph
perf record -F 99 -a –g -- sleep 30
perf script | ./stackcollapse-perf.pl |./flamegraph.pl >gt; perf.svg
• Flame Graphs:
– x-axis: alphabetical stack sort, to maximize merging
– y-axis: stack depth
– color: random, or hue can be a dimension (eg, diff)
• Interpretation:
– top edge is on-CPU, beneath it is ancestry
• Easy to get working
– http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html
• Also in Mar/Apr issue of ACMQ

slide 48:

	
  
• The first ever flame graph was generated for MySQL
• This background is the output of a DTrace CPU profile,
which only printed unique stacks with counts
Size	
  of	
  
one	
  stack

slide 49:

1 	
  Flame	
  Graph:	
  MySQL	
  
…	
  same	
  data

slide 50:

perf_events:	
  Workﬂow	
  
list	
  events	
  
count	
  events	
  
capture	
  stacks	
  
perf list
perf stat
perf record
Typical	
  
Workﬂow	
  
perf.data	
  
text	
  UI	
  
dump	
  proﬁle	
  
perf report
ﬂame	
  graph	
  
visualizaGon	
  
perf script
stackcollapse-perf.pl
flamegraph.pl

slide 51:

perf_events:	
  Counters	
  
• Performance Monitoring Counters (PMCs):
$ perf list | grep –i hardware
cpu-cycles OR cycles
stalled-cycles-frontend OR idle-cycles-frontend
stalled-cycles-backend OR idle-cycles-backend
instructions
[…]
branch-misses
bus-cycles
L1-dcache-loads
L1-dcache-load-misses
[…]
rNNN (see 'perf list --help' on how to encode it)
mem:gt;[:access]
[Hardware event]
[Hardware event]
[Hardware event]
[Hardware event]
[Hardware event]
[Hardware event]
[Hardware cache event]
[Hardware cache event]
[Raw hardware event …
[Hardware breakpoint]
• Identify CPU cycle breakdowns, esp. stall types
• PMCs not enabled by-default in clouds (yet)
• Can be time-consuming to use (CPU manuals)

slide 52:

5.	
  Tracing

slide 53:

Linux	
  Event	
  Sources

slide 54:

Tracing	
  Stack	
  
add-‐on	
  tools:	
   trace-cmd,	
  perf-‐tools,	
  bcc,	
  …	
  
front-‐end	
  tools:	
  
tracing	
  frameworks:	
  
back-‐end	
  instrumentaGon:	
  
perf	
  
jrace,	
  perf_events,	
  BPF	
  
tracepoints,	
  kprobes,	
  uprobes	
  
in	
  
Linux

slide 55:

jrace	
  
• Added by Steven Rostedt and others since 2.6.27
– CONFIG_FTRACE, CONFIG_FUNCTION_PROFILER, …
– See Linux source: Documentation/trace/ftrace.txt
– A collection of powerful features, good for hacking
• Use directly via /sys/kernel/debug/tracing (not easy):
linux-4.0.0+# ls /sys/kernel/debug/tracing/
available_events
max_graph_depth
available_filter_functions options
available_tracers
per_cpu
buffer_size_kb
printk_formats
[…]
stack_max_size
stack_trace
stack_trace_filter
trace
• Or via front-ends:
– Steven's trace-cmd
– my perf-tools: iosnoop, iolatency, funccount, kprobe, …

slide 56:

jrace:	
  perf-‐tools	
  iosnoop	
  
• Block I/O (disk) events with latency:
# ./iosnoop –ts
Tracing block I/O. Ctrl-C to end.
STARTs
ENDs
COMM
5982800.302061 5982800.302679 supervise
5982800.302423 5982800.302842 supervise
5982800.304962 5982800.305446 supervise
5982800.305250 5982800.305676 supervise
[…]
PID
TYPE DEV
202,1
202,1
202,1
202,1
BLOCK
BYTES LATms
# ./iosnoop –h
USAGE: iosnoop [-hQst] [-d device] [-i iotype] [-p PID] [-n name] [duration]
-d device
# device string (eg, "202,1)
-i iotype
# match type (eg, '*R*' for all reads)
-n name
# process name to match on I/O issue
-p PID
# PID to match on I/O issue
# include queueing time in LATms
# include start time of I/O (s)
# include completion time of I/O (s)
# this usage message
duration
# duration seconds, and use buffers
[…]

slide 57:

jrace:	
  perf-‐tools	
  iolatency	
  
• Block I/O (disk) latency distributions:
# ./iolatency
Tracing block I/O. Output every 1 seconds. Ctrl-C to end.
>gt;=(ms) .. gt; 1
1 ->gt; 2
2 ->gt; 4
4 ->gt; 8
8 ->gt; 16
: I/O
: 2104
: 280
: 2
: 0
: 202
|Distribution
|######################################|
|######
|####
>gt;=(ms) .. gt; 1
1 ->gt; 2
2 ->gt; 4
4 ->gt; 8
8 ->gt; 16
16 ->gt; 32
32 ->gt; 64
[…]
: I/O
: 1144
: 267
: 10
: 5
: 248
: 601
: 117
|Distribution
|######################################|
|#########
|#########
|####################
|####

slide 58:

jrace:	
  perf-‐tools	
  funccount	
  
• Count a kernel function call rate:
# ./funccount -i 1 'bio_*'
Tracing "bio_*"... Ctrl-C to end.
FUNC
bio_attempt_back_merge
bio_get_nr_vecs
bio_alloc
bio_alloc_bioset
bio_endio
bio_free
bio_fs_destructor
bio_init
bio_integrity_enabled
bio_put
bio_add_page
COUNT
Counts	
  are	
  in-‐kernel,	
  
for	
  low	
  overhead	
  
[...]
– -i: set an output interval (seconds), otherwise until Ctrl-C

slide 59:

jrace:	
  perf-‐tools	
  uprobe	
  
• Dynamically trace user-level functions; eg, MySQL:
# ./uprobe 'p:dispatch_command /opt/mysql/bin/mysqld:_Z16dispatch_command19enum_server_commandP3THDPcj
+0(%dx):string'
Tracing uprobe dispatch_command (p:dispatch_command /opt/mysql/bin/mysqld:0x2dbd40 +0(%dx):string).
Ctrl-C to end.
mysqld-2855 [001] d... 19956674.509085: dispatch_command: (0x6dbd40) arg1="show tables"
mysqld-2855 [001] d... 19956675.541155: dispatch_command: (0x6dbd40) arg1="SELECT * FROM
numbers where number >gt; 32000"
– Filter on string match; eg, "SELECT":
# ./uprobe 'p:dispatch_command /opt/mysql/bin/mysqld:_Z16dispatch_command19enum_server_commandP3THDPcj
cmd=+0(%dx):string' 'cmd ~ "SELECT*"'
Tracing uprobe dispatch_command (p:dispatch_command /opt/mysql/bin/mysqld:0x2dbd40 cmd=
+0(%dx):string). Ctrl-C to end.
mysqld-2855 [001] d... 19956754.619958: dispatch_command: (0x6dbd40) cmd="SELECT * FROM
numbers where number >gt; 32000"
mysqld-2855 [001] d... 19956755.060125: dispatch_command: (0x6dbd40) cmd="SELECT * FROM
numbers where number >gt; 32000"
• Ok for hacking, but not friendly; need perf_events/BPF

slide 60:

perf_events	
  
• Powerful profiler (covered earlier) and tracer:
– User-level and kernel dynamic tracing
– Kernel line tracing and local variables (debuginfo)
– Kernel filtering expressions
– Efficient in-kernel counts (perf stat)
• Intended as the official Linux tracer/profiler
• Becoming more programmable with BPF support (2016)
– Search lkml for "perf" and "BPF"

slide 61:

perf_events:	
  LisGng	
  Events	
  
# perf list
cpu-cycles OR cycles
instructions
cache-references
cache-misses
branch-instructions OR branches
[…]
skb:kfree_skb
skb:consume_skb
skb:skb_copy_datagram_iovec
net:net_dev_xmit
net:net_dev_queue
net:netif_receive_skb
net:netif_rx
[…]
block:block_rq_abort
block:block_rq_requeue
block:block_rq_complete
block:block_rq_insert
block:block_rq_issue
block:block_bio_bounce
block:block_bio_complete
[…]
[Hardware event]
[Hardware event]
[Hardware event]
[Hardware event]
[Hardware event]
[Tracepoint event]
[Tracepoint event]
[Tracepoint event]
[Tracepoint event]
[Tracepoint event]
[Tracepoint event]
[Tracepoint event]
[Tracepoint event]
[Tracepoint event]
[Tracepoint event]
[Tracepoint event]
[Tracepoint event]
[Tracepoint event]
[Tracepoint event]

slide 62:

perf_events:	
  Tracing	
  Tracepoints	
  
# perf record -e block:block_rq_complete -a sleep 10
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.428 MB perf.data (~18687 samples) ]
# perf script
run 30339 [000] 2083345.722767: block:block_rq_complete: 202,1 W () 12984648 + 8 [0]
run 30339 [000] 2083345.722857: block:block_rq_complete: 202,1 W () 12986336 + 8 [0]
run 30339 [000] 2083345.723180: block:block_rq_complete: 202,1 W () 12986528 + 8 [0]
swapper
0 [000] 2083345.723489: block:block_rq_complete: 202,1 W () 12986496 + 8 [0]
swapper
0 [000] 2083346.745840: block:block_rq_complete: 202,1 WS () 1052984 + 144 [0]
supervise 30342 [000] 2083346.746571: block:block_rq_complete: 202,1 WS () 1053128 + 8 [0]
supervise 30342 [000] 2083346.746663: block:block_rq_complete: 202,1 W () 12986608 + 8 [0]
run 30342 [000] 2083346.747003: block:block_rq_complete: 202,1 W () 12986832 + 8 [0]
[...]
• If -g is used in "perf record", stack traces are included
• If "perf script" output is too verbose, try "perf report",
or making a flame graph

slide 63:

BPF	
  
• Enhanced Berkeley Packet Filter, now just "BPF"
– Enhancements added in Linux 3.15, 3.19, 4.1, 4.5, …
• Provides programmatic tracing
– measure latency, custom histograms, …
System dashboards of 2017+ should look very different

slide 64:

BPF:	
  bcc	
  ext4slower	
  
• ext4 operations slower than the threshold:
# ./ext4slower 1
Tracing ext4 operations slower than 1 ms
TIME
COMM
PID
T BYTES
OFF_KB
06:49:17 bash
R 128
06:49:17 cksum
R 39552
06:49:17 cksum
R 96
06:49:17 cksum
R 96
06:49:17 cksum
R 10320
06:49:17 cksum
R 65536
06:49:17 cksum
R 55400
06:49:17 cksum
R 36792
06:49:17 cksum
R 15008
[…]
LAT(ms) FILENAME
7.75 cksum
1.34 [
5.36 2to3-2.7
14.94 2to3-3.4
6.82 411toppm
4.01 a2p
8.77 ab
16.34 aclocal-1.14
19.31 acpi_listen
• Better indicator of application pain than disk I/O
• Measures & filters in-kernel for efficiency using BPF
– https://github.com/iovisor/bcc

slide 65:

BPF:	
  bcc	
  tools	
  (early	
  2016)

slide 66:

6.	
  Tuning

slide 67:

Ubuntu	
  Trusty	
  Tuning:	
  Early	
  2016	
  (1/2)	
  
CPU
schedtool –B PID
disable Ubuntu apport (crash reporter)
also upgrade to Xenial (better HT scheduling)
Virtual Memory
vm.swappiness = 0
# from 60
kernel.numa_balancing = 0 # temp workaround
Huge Pages
echo never >gt; /sys/kernel/mm/transparent_hugepage/enabled
File System
vm.dirty_ratio = 80
# from 40
vm.dirty_background_ratio = 5
# from 10
vm.dirty_expire_centisecs = 12000
# from 3000
mount -o defaults,noatime,discard,nobarrier …
Storage I/O
/sys/block/*/queue/rq_affinity
/sys/block/*/queue/scheduler
noop
/sys/block/*/queue/nr_requests

slide 68:

Ubuntu	
  Trusty	
  Tuning:	
  Early	
  2016	
  (2/2)	
  
Storage (continued)
/sys/block/*/queue/read_ahead_kb 256
mdadm –chunk=64 …
Networking
net.core.somaxconn = 1000
net.core.netdev_max_backlog = 5000
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_wmem = 4096 12582912 16777216
net.ipv4.tcp_rmem = 4096 12582912 16777216
net.ipv4.tcp_max_syn_backlog = 8096
net.ipv4.tcp_slow_start_after_idle = 0
net.ipv4.tcp_tw_reuse = 1
net.ipv4.ip_local_port_range = 10240 65535
net.ipv4.tcp_abort_on_overflow = 1
# maybe
Hypervisor (Xen)
echo tsc >gt; /sys/devices/…/current_clocksource
Plus PVHVM (HVM), SR-IOV, …

slide 69:

Summary	
  
A brief discussion of 6 facets of Linux performance:
Observability
Methodologies
Benchmarking
Profiling
Tracing
Tuning

slide 70:

Takeaways	
  
Some things to print out for your office wall:
uptime
dmesg -T | tail
vmstat 1
mpstat -P ALL 1
pidstat 1
iostat -xz 1
free -m
sar -n DEV 1
sar -n TCP,ETCP 1
top

slide 71:

More	
  Links	
  
Netflix Tech Blog on Linux:
Linux Performance:
https://www.kernel.org/doc/Documentation/trace/ftrace.txt
https://github.com/brendangregg/perf-tools
USE Method Linux:
https://perf.wiki.kernel.org/index.php/Main_Page
http://www.brendangregg.com/perf.html
Linux ftrace:
http://www.brendangregg.com/linuxperf.html
Linux perf_events:
http://techblog.netflix.com/2015/11/linux-performance-analysis-in-60s.html
http://techblog.netflix.com/2015/08/netflix-at-velocity-2015-linux.html
http://www.brendangregg.com/USEmethod/use-linux.html
Flame Graphs:
http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html
http://queue.acm.org/detail.cfm?id=2927301

slide 72:

Thanks	
  
Questions?
http://slideshare.net/brendangregg
http://www.brendangregg.com
bgregg@netflix.com
@brendangregg