YOW2018_CloudPerfRCANetflix.pdf

YOW! 2018: Cloud Performance Root Cause Analysis at Netflix

Keynote by Brendan Gregg for YOW! 2018.

Video: https://www.youtube.com/watch?v=tAY8PnfrS_k

Description: "At Netflix, improving the performance of our cloud means happier customers and lower costs, and involves root cause analysis of applications, runtimes, operating systems, and hypervisors, in an environment of 150k cloud instances that undergo numerous production changes each week. Apart from the developers who regularly optimize their own code, we also have a dedicated performance team to help with any issue across the cloud, and to build tooling to aid in this analysis. In this session we will summarize the Netflix environment, procedures, and tools we use and build to do root cause analysis on cloud performance issues. The analysis performed may be cloud-wide, using self-service GUIs such as our open source Atlas tool, or focused on individual instances, and use our open source Vector tool, flame graphs, Java debuggers, and tooling that uses Linux perf, ftrace, and bcc/eBPF. You can use these open source tools in the same way to find performance wins in your own environment."

	next prev 1/101
	next prev 2/101
	next prev 3/101
	next prev 4/101
	next prev 5/101
	next prev 6/101
	next prev 7/101
	next prev 8/101
	next prev 9/101
	next prev 10/101
	next prev 11/101
	next prev 12/101
	next prev 13/101
	next prev 14/101
	next prev 15/101
	next prev 16/101
	next prev 17/101
	next prev 18/101
	next prev 19/101
	next prev 20/101
	next prev 21/101
	next prev 22/101
	next prev 23/101
	next prev 24/101
	next prev 25/101
	next prev 26/101
	next prev 27/101
	next prev 28/101
	next prev 29/101
	next prev 30/101
	next prev 31/101
	next prev 32/101
	next prev 33/101
	next prev 34/101
	next prev 35/101
	next prev 36/101
	next prev 37/101
	next prev 38/101
	next prev 39/101
	next prev 40/101
	next prev 41/101
	next prev 42/101
	next prev 43/101
	next prev 44/101
	next prev 45/101
	next prev 46/101
	next prev 47/101
	next prev 48/101
	next prev 49/101
	next prev 50/101
	next prev 51/101
	next prev 52/101
	next prev 53/101
	next prev 54/101
	next prev 55/101
	next prev 56/101
	next prev 57/101
	next prev 58/101
	next prev 59/101
	next prev 60/101
	next prev 61/101
	next prev 62/101
	next prev 63/101
	next prev 64/101
	next prev 65/101
	next prev 66/101
	next prev 67/101
	next prev 68/101
	next prev 69/101
	next prev 70/101
	next prev 71/101
	next prev 72/101
	next prev 73/101
	next prev 74/101
	next prev 75/101
	next prev 76/101
	next prev 77/101
	next prev 78/101
	next prev 79/101
	next prev 80/101
	next prev 81/101
	next prev 82/101
	next prev 83/101
	next prev 84/101
	next prev 85/101
	next prev 86/101
	next prev 87/101
	next prev 88/101
	next prev 89/101
	next prev 90/101
	next prev 91/101
	next prev 92/101
	next prev 93/101
	next prev 94/101
	next prev 95/101
	next prev 96/101
	next prev 97/101
	next prev 98/101
	next prev 99/101
	next prev 100/101
	next prev 101/101

PDF: YOW2018_CloudPerfRCANetflix.pdf

Keywords (from pdftotext):

slide 1:

Cloud Performance
Root Cause Analysis
at Netflix
Brendan Gregg
Senior Performance Architect
Cloud and Platform Engineering
YOW! Conference Australia
Nov-Dec 2018

slide 2:

Experience: CPU Dips

slide 3:

slide 4:

# perf record -F99 -a
# perf script
[…]
java 14327 [022] 252764.179741: cycles:
java 14315 [014] 252764.183517: cycles:
java 14310 [012] 252764.185317: cycles:
java 14332 [015] 252764.188720: cycles:
java 14341 [019] 252764.191307: cycles:
java 14341 [019] 252764.198825: cycles:
java 14341 [019] 252764.207057: cycles:
java 14341 [019] 252764.215962: cycles:
java 14341 [019] 252764.225141: cycles:
java 14341 [019] 252764.234578: cycles:
[…]
7f36570a4932 SpinPause (/usr/lib/jvm/java-8
7f36570a4932 SpinPause (/usr/lib/jvm/java-8
7f36570a4932 SpinPause (/usr/lib/jvm/java-8
7f3658078350 pthread_cond_wait@@GLIBC_2.3.2
7f3656d150c8 ClassLoaderDataGraph::do_unloa
7f3656d140b8 ClassLoaderData::free_dealloca
7f3657192400 nmethod::do_unloading(BoolObje
7f3656ba807e Assembler::locate_operand(unsi
7f36571922e8 nmethod::do_unloading(BoolObje
7f3656ec4960 CodeHeap::block_start(void*) c

slide 5:

slide 6:

slide 7:

Observability
Methodology
Velocity

slide 8:

Root Cause Analysis at Netflix
Devices
gRPC
Zuul 2
Load
Ribbon
Hystrix
Eureka
Service
Tomcat
JVM
Instances (Linux)
AZ 3
AZ 1
AZ 2
ASG 1
ELB
ASG Cluster
Application
Netflix
Roots
ASG 2
Atlas
Chronos
Zipkin
Vector
sar, *stat
ftrace
bcc/eBPF
bpftrace
PMCs, MSRs

slide 9:

Agenda
1. The Netflix Cloud
2. Methodology
3. Cloud Analysis
4. Instance Analysis

slide 10:

Since 2014
Asgard → Spinnaker Spinnaker
Salp → Spinnaker Zipkin
gRPC adoption
New Atlas UI & Lumen
Java frame pointer
eBPF bcc & bpftrace
PMCs in EC2
From Clouds to Roots (2014 presentation): Old Atlas UI

slide 11:

>gt;150k AWS EC2 server instances
~34% US Internet traffic at night
>gt;130M members
Performance is customer satisfaction & Netflix cost

slide 12:

Acronyms
AWS: Amazon Web Services
EC2: AWS Elastic Compute 2 (cloud instances)
S3: AWS Simple Storage Service (object store)
ELB: AWS Elastic Load Balancers
SQS: AWS Simple Queue Service
SES: AWS Simple Email Service
CDN: Content Delivery Network
OCA: Netflix Open Connect Appliance (streaming CDN)
QoS: Quality of Service
AMI: Amazon Machine Image (instance image)
ASG: Auto Scaling Group
AZ: Availability Zone
NIWS: Netflix Internal Web Service framework (Ribbon)
gRPC: gRPC Remote Procedure Calls
MSR: Model Specific Register (CPU info register)
PMC: Performance Monitoring Counter (CPU perf counter)
eBPF: extended Berkeley Packet Filter (kernel VM)

slide 13:

1. The Netflix Cloud
Overview

slide 14:

The Netflix Cloud
EC2
ELB
Cassandra
Applications
(Services)
Elasticsearch
EVCache
SES
SQS

slide 15:

Netflix
Microservices
Authentication
Web Site API
User Data
Personalization
EC2
Client
Devices
Streaming API
Viewing Hist.
DRM
QoS Logging
OCA CDN
CDN Steering
Encoding

slide 16:

Freedom and Responsibility
Culture deck memo is true
https://jobs.netflix.com/culture
Deployment freedom
Purchase and use cloud instances without approvals
Netflix environment changes fast!

slide 17:

Cloud Technologies
Usually open source
Linux, Java, Cassandra,
Node.js, …
http://netflix.github.io/

slide 18:

Cloud Instances
Linux (Ubuntu)
Optional Apache,
memcached, nonJava apps (incl.
Node.js, golang)
Atlas monitoring,
S3 log rotation,
ftrace, perf,
bcc/eBPF
Java (JDK 8)
GC and
thread
dump
logging
Tomcat
Application war files, base
servlet, platform, hystrix,
health check, metrics (Servo)
Typical BaseAMI

slide 19:

5 Key Issues
And How the Netflix Cloud is
Architected to Solve Them

slide 20:

1. Load Increases → Spinnaker Auto Scaling Groups
Instances automatically
added or removed by a
custom scaling policy
Alerts & monitoring used
to check scaling is sane
Good for customers: Fast workaround
Good for engineers: Fix later, 9-5
ASG
CloudWatch, Servo
Scaling Policy
loadavg, latency, …
Instance
Instance
Instance
Instance

slide 21:

2. Bad Push → Spinnaker ASG Cluster Rollback
ASG red black clusters: how code
versions are deployed
Fast rollback for issues
Traffic managed by Elastic Load
Balancers (ELBs)
Automated Canary Analysis (ACA)
for testing
ASG
Cluster
prod1
ELB
Canary
ASG-v010
ASG-v011
Instance
Instance
Instance
Instance
Instance
Instance

slide 22:

3. Instance Failure → Spinnaker Hystrix Timeouts
Hystrix: latency and fault tolerance
for dependency services
Fallbacks, degradation, fast fail and rapid
recovery, timeouts, load shedding, circuit
breaker, realtime monitoring
Plus Ribbon or gRPC for more fault tolerance
Tomcat
Application
get A
Hystrix
>gt;100ms
Dependency
Dependency

slide 23:

4. Region failure → Spinnaker Zuul 2 Reroute Traffic
All device traffic goes through the Zuul 2 proxy: dynamic routing, monitoring,
resiliency, security
Region or AZ failure: reroute traffic to another region
Zuul 2, DNS
Region 1
Region 2
Monitoring
Region 3

slide 24:

5. Overlooked Issue → Spinnaker Chaos Engineering
lnstances: termination
(Resilience)
Availability Zones: artificial failures
Latency: artificial delays by ChAP
Conformity: kills non-best-practices instances
Doctor: health checks
Janitor: kills unused instances
Security: checks violations
10-18: geographic issues

slide 25:

A Resilient Architecture
Devices
gRPC
Zuul 2
Load
Ribbon
Hystrix
Eureka
Service
Tomcat
JVM
Instances (Linux)
AZ 3
AZ 1
AZ 2
ASG 1
ELB
Some services vary:
- Apache Web Server
- Node.js & Prana
- golang
ASG Cluster
Application
Netflix
Chaos Engineering
ASG 2

slide 26:

2. Methodology
Cloud & Instance

slide 27:

Why Do Root Cause Perf Analysis?
Netflix
Application
ASG Cluster
… ASG 2
Often for:
High latency
Growth
Upgrades
ELB
ASG 1
AZ 3
AZ 2
AZ 1
Instances (Linux)
JVM
Tomcat
Service

slide 28:

Cloud Methodologies
Resource Analysis
Metric and event correlations
Latency Drilldowns
RED Method
For each microservice, check:
- Rate
- Errors
- Duration
Service A
Service C
Service B
Service D

slide 29:

Instance Methodologies
Log Analysis
Micro-benchmarking
Drill-down analysis
USE Method
For each resource, check:
- Utilization
- Saturation
- Errors
CPU
Memory
Disk
Controller
Network
Controller
Disk
Net
Disk
Net

slide 30:

Bad Instance Anti-Method
1. Plot request latency per-instance
2. Find the bad instance
3. Terminate it
4. Someone else’s problem now!
Bad instance latency
Terminate!
Could be an early warning of a bigger issue

slide 31:

3. Cloud Analysis
Atlas, Lumen, Chronos, ...

slide 32:

Netflix Cloud Analysis Process
Example path
enumerated
Atlas Alerts
PICSOU
Slack
Cost
Chat
1. Check Issue
Atlas/Lumen Dashboards
2. Check Events
Chronos
Create
New Alert
Plus some other
tools not pictured
Redirected to
a new Target
3. Drill Down
Atlas Metrics
4. Check Dependencies
5. Root Cause
Instance Analysis
Slalom
Zipkin

slide 33:

Atlas: Alerts
Custom alerts on streams per second (SPS) changes, CPU usage, latency, ASG
growth, client errors, …

slide 34:

slide 35:

Winston: Automated Diagnostics & Remediation
Links to Atlas
Dashboards & Metrics
Chronos: Possible Related Events

slide 36:

Atlas: Dashboards

slide 37:

Atlas: Dashboards
Netflix perf vitals dashboard
1. RPS, CPU
2. Volume
3. Instances
4. Scaling
5. CPU/RPS
6. Load avg
7. Java heap
8. ParNew
9. Latency
10. 99th tile

slide 38:

Atlas & Lumen: Custom Dashboards
Dashboards are a checklist methodology: what to show first,
second, third...
Starting point for issues
1. Confirm and quantify issue
2. Check historic trend
3. Atlas metrics to drill down
Lumen: more flexible dashboards
eg, go/burger

slide 39:

Atlas: Metrics

slide 40:

Atlas: Metrics
Region
Application
Metrics
Presentation
Interactive
graph
Summary
statistics
Time range

slide 41:

Atlas: Metrics
All metrics in one system
System metrics: CPU usage, disk I/O, memory, …
Application metrics: latency percentiles, errors, …
Filters or breakdowns by region,
application, ASG, metric, instance
URL has session state: shareable

slide 42:

Chronos: Change Tracking

slide 43:

Chronos: Change Tracking
Scope
Time Range
Event Log

slide 44:

Slalom: Dependency Graphing

slide 45:

Slalom: Dependency Graphing
Dependency
App
Traffic Volume

slide 46:

Zipkin UI: Dependency Tracing
Dependency
Latency

slide 47:

PICSOU: AWS Usage
Breakdowns
Cost per hour
Details (redacted)

slide 48:

Slack: Chat
Latency is high in us-east-1
Sorry
We just did a bad push

slide 49:

Netflix Cloud Analysis Process
Example path
enumerated
Atlas Alerts
PICSOU
Slack
Cost
Chat
1. Check Issue
Atlas/Lumen Dashboards
2. Check Events
Chronos
Create
New Alert
Plus some other
tools not pictured
Redirected to
a new Target
3. Drill Down
Atlas Metrics
4. Check Dependencies
5. Root Cause
Instance Analysis
Slalom
Zipkin

slide 50:

Generic Cloud Analysis Process
Example path
enumerated
Alerts
Usage Reports
1. Check Issue
Cost
Custom Dashboards
2. Check Events
Change Tracking
Create
New Alert
Plus other tools
as needed
Messaging
Redirected to
a new Target
3. Drill Down
Metric Analysis
4. Check Dependencies
5. Root Cause
Instance Analysis
Chat
Dependency Analysis

slide 51:

4. Instance Analysis
1. Statistics
2. Profiling
3. Tracing
4. Processor Analysis

slide 52:

slide 53:

slide 54:

1. Statistics

slide 55:

Linux Tools
vmstat, pidstat, sar, etc, used mostly normally
$ sar -n TCP,ETCP,DEV 1
Linux 4.15.0-1027-aws (xxx)
09:43:53 PM IFACE
09:43:54 PM
09:43:54 PM eth0
rxpck/s
12/03/2018
txpck/s
rxkB/s
txkB/s
33744.00 19361.43 28065.36
09:43:53 PM
09:43:54 PM
active/s passive/s
09:43:53 PM
09:43:54 PM
[…]
atmptf/s
_x86_64_ (48 CPU)
iseg/s
rxcmp/s
txcmp/s rxmcst/s %ifutil
oseg/s
estres/s retrans/s isegerr/s
orsts/s
Micro benchmarking can be used to investigate hypervisor
behavior that can’t be observed directly

slide 56:

Exception: Containers
Most Linux tools are still not container aware
From the container, will show the full host
We expose cgroup metrics in our cloud GUIs: Vector

slide 57:

Vector: Instance/Container Analysis

slide 58:

2. Profiling

slide 59:

Experience:
“ZFS is eating my CPUs”

slide 60:

CPU Mixed-Mode Flame Graph
Application (truncated)
38% kernel time (why?)

slide 61:

Zoomed

slide 62:

2014: Java Profiling
Java Profilers
System Profilers

slide 63:

2018: Java Profiling
Kernel
Java
JVM
CPU Mixed-mode Flame Graph

slide 64:

CPU Flame Graph

slide 65:

CPU Flame Chart (same data)

slide 66:

CPU Flame Graphs
g()
e()
f()
d()
c()
i()
b()
h()
a()

slide 67:

CPU Flame Graphs
Y-axis: stack depth
0 at bottom
0 at top == icicle graph
X-axis: alphabet
Top edge:
Who is running on CPU
And how much (width)
Time == flame chart
Color: random
g()
Hues often used for
language types
Can be a dimension
eg, CPI
e()
Ancestry
f()
d()
c()
i()
b()
h()
a()

slide 68:

Application Profiling
Primary approach:
CPU mixed-mode flame graphs (eg, via Linux perf)
May need frame pointers (eg, Java -XX:+PreserveFramePointer)
May need a symbol file (eg, Java perf-map-agent, Node.js --perf-basic-prof)
Secondary:
Application profiler (eg, via Lightweight Java Profiler)
Application logs

slide 69:

Vector: Push-button Flame Graphs

slide 70:

Future: eBPF-based Profiling
Linux 2.6
Linux 4.9
perf record
profile.py
perf.data
perf script
stackcollapse-perf.pl
flamegraph.pl
flamegraph.pl

slide 71:

3. Tracing

slide 72:

slide 73:

Core Linux Tracers
Ftrace 2.6.27+ Tracing views
Plus other kernel tech:
kprobes, uprobes
perf
2.6.31+ Official profiler & tracer
eBPF
4.9+
Programmatic engine
bcc
Complex tools
bpftrace -
Short scripts

slide 74:

Experience: Disk %Busy

slide 75:

# iostat –x 1
[…]
avg-cpu: %user %nice %system %iowait %steal
Device:
xvda
xvdb
xvdj
[…]
rrqm/s
wrqm/s
r/s
0.00 139.00
w/s
rkB/s
0.00 1056.00
%idle
wkB/s avgrq-sz avgqu-sz
await r_await w_await svctm %util
0.00 0.00 0.00
0.00 0.00 0.00
0.00 6.30 87.60

slide 76:

# /apps/perf-tools/bin/iolatency 10
Tracing block I/O. Output every 10 seconds. Ctrl-C to end.
>gt;=(ms) .. gt; 1
1 ->gt; 2
2 ->gt; 4
4 ->gt; 8
8 ->gt; 16
16 ->gt; 32
32 ->gt; 64
64 ->gt; 128
: I/O
: 421
: 95
: 48
: 108
: 363
: 66
: 3
: 7
|Distribution
|######################################|
|#########
|#####
|##########
|#################################
|######

slide 77:

# /apps/perf-tools/bin/iosnoop
Tracing block I/O. Ctrl-C to end.
COMM
PID
TYPE DEV
BLOCK
java
30603 RM
202,144 1670768496
cat
202,0
cat
202,0
cat
202,0
java
30603 RM
202,144 620864512
java
30603 RM
202,144 584767616
java
30603 RM
202,144 601721984
java
30603 RM
202,144 603721568
java
30603 RM
202,144 61067936
java
30603 RM
202,144 1678557024
java
30603 RM
202,144 55299456
java
30603 RM
202,144 1625084928
java
30603 RM
202,144 618895408
java
30603 RM
202,144 581318480
java
30603 RM
202,144 1167348016
java
30603 RM
202,144 51561280
[...]
BYTES
LATms

slide 78:

# perf record -e block:block_rq_issue --filter rwbs ~ "*M*" -g -a
# perf report -n –stdio
[...]
# Overhead
Samples
Command
Shared Object
Symbol
# ........ ............ ............ ................. ....................
70.70%
java [kernel.kallsyms] [k] blk_peek_request
--- blk_peek_request
do_blkif_request
__blk_run_queue
queue_unplugged
blk_flush_plug_list
blk_finish_plug
_xfs_buf_ioapply
xfs_buf_iorequest
|--88.84%-- _xfs_buf_read
xfs_buf_read_map
|--87.89%-- xfs_trans_read_buf_map
|--97.96%-- xfs_imap_to_bp
xfs_iread
xfs_iget
xfs_lookup
xfs_vn_lookup
lookup_real
__lookup_hash
lookup_slow
path_lookupat
filename_lookup
user_path_at_empty
user_path_at
vfs_fstatat
|--99.48%-- SYSC_newlstat
sys_newlstat
system_call_fastpath
__lxstat64
|Lsun/nio/fs/UnixNativeDispatcher;.lstat0
0x7f8f963c847c

slide 79:

slide 80:

# /usr/share/bcc/tools/biosnoop
TIME(s)
COMM
PID
tar
tar
tar
tar
tar
tar
tar
tar
tar
tar
tar
tar
tar
tar
tar
tar
tar
tar
tar
tar
tar
[...]
DISK
xvda
xvda
xvda
xvda
xvda
xvda
xvda
xvda
xvda
xvda
xvda
xvda
xvda
xvda
xvda
xvda
xvda
xvda
xvda
xvda
xvda
SECTOR
BYTES
LAT(ms)

slide 81:

eBPF

slide 82:

eBPF: extended Berkeley Packet Filter
User-Defined BPF Programs
SDN Configuration
DDoS Mitigation
Kernel
Runtime
Event Targets
verifier
sockets
Intrusion Detection
Container Security
kprobes
BPF
Observability
Firewalls (bpfilter)
Device Drivers
uprobes
tracepoints
BPF
actions
perf_events

slide 83:

slide 84:

bcc
# /usr/share/bcc/tools/tcplife
PID
COMM
LADDR
2509 java
2509 java
2509 java
2509 java
2509 java
12030 upload-mes 127.0.0.1
12030 upload-mes 127.0.0.1
3964 mesos-slav 127.0.0.1
12021 upload-sys 127.0.0.1
2509 java
2235 dockerd
2235 dockerd
[...]
LPORT RADDR
8078 100.82.130.159
8078 100.82.78.215
60778 100.82.207.252
38884 100.82.208.178
4243 127.0.0.1
34020 127.0.0.1
21196 127.0.0.1
7101 127.0.0.1
34022 127.0.0.1
8078 127.0.0.1
13730 100.82.136.233
34314 100.82.64.53
RPORT TX_KB RX_KB MS
0 5.44
0 135.32
13 15126.87
0 15568.25
0 0.61
0 3.38
0 12.61
0 12.64
0 15.28
372 15.31
4 18.50
8 56.73

slide 85:

bpftrace
# biolatency.bt
Attaching 3 probes...
Tracing block device I/O... Hit Ctrl-C to end.
@usecs:
[256, 512)
[512, 1K)
[1K, 2K)
[2K, 4K)
[4K, 8K)
[8K, 16K)
[16K, 32K)
[32K, 64K)
[64K, 128K)
[128K, 256K)
2 |
10 |@
426 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
230 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@
9 |@
128 |@@@@@@@@@@@@@@@
68 |@@@@@@@@
0 |
0 |
10 |@

slide 86:

bpftrace: biolatency.bt
#!/usr/local/bin/bpftrace
BEGIN
printf("Tracing block device I/O... Hit Ctrl-C to end.\n");
kprobe:blk_account_io_start
@start[arg0] = nsecs;
kprobe:blk_account_io_completion
/@start[arg0]/
@usecs = hist((nsecs - @start[arg0]) / 1000);
delete(@start[arg0]);

slide 87:

Future: eBPF GUIs

slide 88:

4. Processor Analysis

slide 89:

What “90% CPU Utilization” might suggest:
What it typically means on the Netflix cloud:

slide 90:

PMCs
Performance Monitoring Counters help you analyze stalls
Some instances (eg. Xen-based m4.16xl) have the architectural set:

slide 91:

Instructions Per Cycle (IPC)
“good*”
>gt;2.0
Instruction bound
IPC
“bad”
* probably; exception: spin locks

slide 92:

PMCs: EC2 Xen Hypervisor
# perf stat -a -- sleep 30
Performance counter stats for 'system wide':
1,103,112
189,173
4,044
2,057,164,531,949
gt;
gt;
1,357,979,592,699
243,244,156,173
4,391,259,112
task-clock (msec)
context-switches
cpu-migrations
page-faults
cycles
stalled-cycles-frontend
stalled-cycles-backend
instructions
branches
branch-misses
64.034 CPUs utilized
0.574 K/sec
0.098 K/sec
0.002 K/sec
1.071 GHz
(100.00%)
(100.00%)
(100.00%)
0.66 insns per cycle
126.617 M/sec
1.81% of all branches
(75.01%)
(74.99%)
(75.00%)
(75.00%)
30.001112466 seconds time elapsed
# ./pmcarch 1
CYCLES
INSTRUCTIONS
[...]
IPC BR_RETIRED
0.66 4692322525
0.65 5286747667
0.70 4616980753
0.69 5055959631
BR_MISPRED
BMR% LLCREF
1.95 780435112
1.81 751335355
1.87 709841242
1.83 787333902
LLCMISS
LLC%

slide 93:

PMCs: EC2 Nitro Hypervisor
Some instance types (large, Nitro-based) support most PMCs!
Meltdown KPTI patch TLB miss analysis on a c5.9xl:
nopti:
# tlbstat -C0 1
K_CYCLES
K_INSTR
[...]
pti, nopcid:
# tlbstat -C0 1
K_CYCLES
K_INSTR
[...]
IPC DTLB_WALKS ITLB_WALKS K_DTLBCYC
0.86 565
0.86 950
0.86 396
K_ITLBCYC
DTLB% ITLB%
0.00 0.00
0.00 0.00
0.00 0.00
IPC DTLB_WALKS ITLB_WALKS K_DTLBCYC
0.10 89709496
0.10 88829158
0.10 89683045
0.10 79055465
K_ITLBCYC
DTLB% ITLB%
27.40 22.63
27.28 22.52
27.29 22.55
27.40 22.63
worst case

slide 94:

MSRs
Model Specific Registers
System config info, including current clock rate:
# showboost
Base CPU MHz : 2500
Set CPU MHz : 2500
Turbo MHz(s) : 3100 3200 3300 3500
Turbo Ratios : 124% 128% 132% 140%
CPU 0 summary every 1 seconds...
TIME
23:39:07
23:39:08
23:39:09
C0_MCYC
C0_ACYC
UTIL
64%
70%
99%
RATIO
MHz

slide 95:

Summary
Take-aways

slide 96:

Take Aways
1. Get push-button CPU flame graphs: kernel & user
2. Check out eBPF perf tools: bcc, bpftrace
3. Measure IPC as well as CPU utilization using PMCs
90% CPU busy:
… really means:

slide 97:

Observability
Methodology
Velocity

slide 98:

Observability
Statistics, Flame Graphs, eBPF Tracing, Cloud PMCs
Methodology
USE method, RED method, Drill-down Analysis, …
Velocity
Self-service GUIs: Vector, FlameScope, …

slide 99:

Resources
2014 talk From Clouds to Roots: http://www.slideshare.net/brendangregg/netflix-from-clouds-to-roots
http://www.youtube.com/watch?v=H-E0MQTID0g
Chaos: https://medium.com/netflix-techblog/chap-chaos-automation-platform-53e6d528371f https://principlesofchaos.org/
Atlas: https://github.com/Netflix/Atlas
Atlas: https://medium.com/netflix-techblog/introducing-atlas-netflixs-primary-telemetry-platform-bd31f4d8ed9a
RED method: https://thenewstack.io/monitoring-microservices-red-method/
USE method: https://queue.acm.org/detail.cfm?id=2413037
Winston: https://medium.com/netflix-techblog/introducing-winston-event-driven-diagnostic-and-remediation-platform-46ce39aa81cc
Lumen: https://medium.com/netflix-techblog/lumen-custom-self-service-dashboarding-for-netflix-8c56b541548c
Flame graphs: http://www.brendangregg.com/flamegraphs.html
Java flame graphs: https://medium.com/netflix-techblog/java-in-flames-e763b3d32166
Vector: http://vectoross.io https://github.com/Netflix/Vector
FlameScope: https://github.com/Netflix/FlameScope
Tracing ponies: thanks Deirdré Straughan & General Zoi's Pony Creator
ftrace: http://lwn.net/Articles/608497/ - usually already in your kernel
perf: http://www.brendangregg.com/perf.html - perf is usually packaged in linux-tools-common
tcplife: https://github.com/iovisor/bcc - often available as a bcc or bcc-tools package
bpftrace: https://github.com/iovisor/bpftrace
pmcarch: https://github.com/brendangregg/pmc-cloud-tools
showboost: https://github.com/brendangregg/msr-cloud-tools - also try turbostat

slide 100:

Netflix Tech Blog

slide 101:

Thank you.
Brendan Gregg
@brendangregg