SBSRE_perf_meetup_aug2017.pdf

South Bay SRE meetup 2017: Netflix Performance Engineering

Talk by the Netflix Performance Engineering team for SBSRE 2017.

Video: https://www.youtube.com/watch?v=i5Ml9uY2rBw

Description: "A look into how Netflix measures and tunes performance for our clients and the streaming service."

This includes a section starting on slide 61 by Brendan Gregg titled "Netflix PMCs on the Cloud" showing low-level CPU performance analysis using PMCs on AWS EC2. (PMCs are performance monitoring counters from the performance monitoring unit [PMU]).

	next prev 1/70
	next prev 2/70
	next prev 3/70
	next prev 4/70
	next prev 5/70
	next prev 6/70
	next prev 7/70
	next prev 8/70
	next prev 9/70
	next prev 10/70
	next prev 11/70
	next prev 12/70
	next prev 13/70
	next prev 14/70
	next prev 15/70
	next prev 16/70
	next prev 17/70
	next prev 18/70
	next prev 19/70
	next prev 20/70
	next prev 21/70
	next prev 22/70
	next prev 23/70
	next prev 24/70
	next prev 25/70
	next prev 26/70
	next prev 27/70
	next prev 28/70
	next prev 29/70
	next prev 30/70
	next prev 31/70
	next prev 32/70
	next prev 33/70
	next prev 34/70
	next prev 35/70
	next prev 36/70
	next prev 37/70
	next prev 38/70
	next prev 39/70
	next prev 40/70
	next prev 41/70
	next prev 42/70
	next prev 43/70
	next prev 44/70
	next prev 45/70
	next prev 46/70
	next prev 47/70
	next prev 48/70
	next prev 49/70
	next prev 50/70
	next prev 51/70
	next prev 52/70
	next prev 53/70
	next prev 54/70
	next prev 55/70
	next prev 56/70
	next prev 57/70
	next prev 58/70
	next prev 59/70
	next prev 60/70
	next prev 61/70
	next prev 62/70
	next prev 63/70
	next prev 64/70
	next prev 65/70
	next prev 66/70
	next prev 67/70
	next prev 68/70
	next prev 69/70
	next prev 70/70

PDF: SBSRE_perf_meetup_aug2017.pdf

Keywords (from pdftotext):

slide 1:

Netflix
Performance Meetup

slide 2:

Global Client Performance
Fast Metrics

slide 3:

3G in Kazakhstan

slide 4:

Making the Internet fast
is slow.
Global Internet:
faster (better networking)
slower (broader reach, congestion)
Don't wait for it, measure it and deal
Working app >gt; Feature rich app

slide 5:

We need to know what the Internet looks like,
without averages, seeing the full distribution.

slide 6:

Logging Anti-Patterns
Averages
Sampling
Can't see the distribution
Missed data
Outliers heavily distort
Rare events
∞, 0, negatives, errors
Problems aren’t equal in
Population
Instead, use the client as a map-reducer and send up aggregated
data, less often.

slide 7:

Sizing up the Internet.

slide 8:

Infinite (free) compute power!

slide 9:

slide 10:

Get median, 95th, etc.
Calculate the inverse empirical cumulative
distribution function by math.
o ...or just use R which is free and knows how
to do it already
>gt; library(HistogramTools)
>gt; iecdf

slide 12:

slide 13:

Data >gt; Opinions.

slide 14:

Better than debating opinions.
"We live in a
"No one really minds the
50ms world!"
spinner."
"Why should we spend
time on that instead of
COOLFEATURE?"
"There's no way that the
client makes that many
requests.”
Architecture is hard. Make it cheap to experiment where your users really are.

slide 15:

We built Daedalus
Fast
DNS Time
Elsewhere
Slow

slide 16:

Interpret the data
Visual → Numerical, need the IECDF for
Percentiles
ƒ(0.50) = 50th (median)
ƒ(0.95) = 95th
Cluster to get pretty colors similar experiences.
(k-means, hierarchical, etc.)

Practical Teleportation.
Go there!
Abstract analysis - hard
Feeling reality is much simpler than looking at graphs. Build!

slide 22:

Make a Reality Lab.

slide 23:

slide 24:

Don't guess.
Developing a model based on
production data, without missing the
distribution of samples (network, render,
responsiveness) will lead to better
software.
Global reach doesn't need to be scary.
@gcirino42 http://blogofsomeguy.com

slide 25:

Icarus
Martin Spier
@spiermar
Performance Engineering @ Netflix

slide 26:

slide 27:

Problem & Motivation
Real-user performance monitoring solution
More insight into the App performance
(as perceived by real users)
Too many variables to trust synthetic
tests and labs
Prioritize work around App performance
Track App improvement progress over time
Detect issues, internal and external

slide 28:

Device Diversity
● Netflix runs on all sorts of devices
● Smart TVs, Gaming Consoles, Mobile Phones, Cable TV boxes, ...
● Consistently evaluate performance

slide 29:

slide 30:

What are we monitoring?
User Actions
(or things users do in the App)
App Startup
User Navigation
Playing a Title
Internal App metrics

slide 31:

What are we measuring?
● When does the timer start and stop?
● Time-to-Interactive (TTI)
○ Interactive, even if
some items were not fully
loaded and rendered
● Time-to-Render (TTR)
○ Everything above the fold
(visible without scrolling)
is rendered
● Play Delay
● Meaningful for what we are monitoring

slide 32:

High-dimensional Data
● Complex device categorization
● Geo regions, subregions, countries
● Highly granular network
classifications
● High volume of A/B tests
● Different facets of the same user action
○ Cold, suspended and backgrounded
App startups
○ Target view/page on App startup

Data Sketches
● Data structures that approximately
resemble a much larger data set
● Preserve essential features!
● Significantly smaller!
● Faster to operate on!

slide 37:

t-Digest
● t-Digest data structure
● Rank-based statistics
(such as quantiles)
● Parallel friendly
(can be merged!)
● Very fast!
● Really accurate!
https://github.com/tdunning/t-digest

slide 38:

+ t-Digest sketches

slide 39:

slide 40:

iOS Median Comparison, Break by Country

slide 41:

iOS Median Comparison, Break by Country + iPhone 6S Plus

slide 42:

CDFs by UI Version

slide 43:

Warm Startup Rate

slide 44:

A/B Cell Comparison

slide 45:

Anomaly Detection

slide 46:

Going Forward
● Resource utilization metrics
● Device profiling
○ Instrumenting client code
● Explore other visualizations
○ Frequency heat maps
● Connection between perceived
performance, acquisition and
retention
@spiermar

slide 47:

Netflix
Autoscaling for experts
Vadim

slide 48:

Savings!
● Mid-tier stateless services are ~2/3rd of the total
● Savings - 30% of mid-tier footprint (roughly 30K instances)
○ Higher savings if we break it down by region
○ Even higher savings on services that scale well

slide 49:

Why we autoscale - philosophical reasons

slide 50:

Why we autoscale - pragmatic reasons
** Hack-day project
Encoding
Precompute
Failover
Red/black pushes
Curing cancer**
And more...

slide 51:

Should you autoscale?
Benefits
● On-demand capacity: direct $$ savings
● RI capacity: re-purposing spare capacity
However, for each server group, beware of
● Uneven distribution of traffic
● Sticky traffic
● Bursty traffic
● Small ASG sizes (

slide 52:

Autoscaling impacts availability - true or false?
* If done correctly
Under-provisioning, however, can impact availability
● Autoscaling is not a problem
● The real problem is not knowing performance characteristics of the
service

slide 53:

AWS autoscaling mechanics
ASG scaling policy
CloudWatch alarm
Aggregated metric feed
Notification
Tunables
Metric
● Threshold
● # of eval periods
● Scaling amount
● Warmup time

slide 54:

What metric to scale on?
Resource
utilization
Throughput
Pros
Tracks a direct measure of work
Linear scaling
Predictable
Requires less adjustment over time
Cons
Thresholds tend to drift over time
Prone to changes in request mixture
Less predictable
More oscillation / jitter

slide 55:

Autoscaling on multiple metrics
Proceed with caution
● Harder to reason about scaling behavior
● Different metrics might contradict each
other, causing oscillation
Typical Netflix configuration:
● Scale-up policy on throughput
● Scale-down policy on throughput
● Emergency scale-up policy on CPU, aka
“the hammer rule”

slide 56:

Well-behaved autoscaling

slide 57:

Common mistakes - “no rush” scaling
Problem: scaling amounts too
small, cooldown too long
Effect: scaling lags behind the
traffic flow. Not enough
capacity at peak, capacity
wasted in trough
Remedy: increase scaling
amounts, migrate to step
policies

slide 58:

Common mistakes - twitchy scaling
Problem: Scale-up policy is
too aggressive
Effect: unnecessary
capacity churn
Remedy: reduce scale-up
amount, increase the # of
eval periods

slide 59:

Common mistakes - should I stay or should I go
Problem: -up and -down
thresholds are too close to each
other
Effect: constant capacity
oscillation
Remedy: move -up and -down
thresholds farther apart

slide 60:

AWS target tracking - your best bet!
Think of it as a step policy with auto-steps
You can also think of it as a thermostat
Accounts for the rate of change in monitored metric
Pick a metric, set the target value and warmup time - that’s it!
Step
Target-tracking

slide 61:

Netflix
PMCs on the Cloud
Brendan

slide 62:

90% CPU utilization:
Busy
Waiting
(“idle”)

slide 63:

90% CPU utilization:
Waiting
(“idle”)
Busy
Reality:
Busy
Waiting
(“stalled”)
Waiting
(“idle”)

slide 64:

# perf stat -a -- sleep 10
Performance counter stats for 'system wide':
7,562
1,157
109,734
gt;
gt;
gt;
gt;
gt;
gt;
task-clock (msec)
context-switches
cpu-migrations
page-faults
cycles
stalled-cycles-frontend
stalled-cycles-backend
instructions
branches
branch-misses
10.001715965 seconds time elapsed
8.000 CPUs utilized
0.095 K/sec
0.014 K/sec
0.001 M/sec
(100.00%)
(100.00%)
(100.00%)
Performance
Monitoring Counters
(PMCs) in most clouds

slide 65:

# perf stat -a -- sleep 10
Performance counter stats for 'system wide':
641320.173626 task-clock (msec)
1,047,222 context-switches
83,420 cpu-migrations
38,905 page-faults
655,419,788,755 cycles
gt; stalled-cycles-frontend
gt; stalled-cycles-backend
536,830,399,277 instructions
97,103,651,128 branches
1,230,478,597 branch-misses
64.122 CPUs utilized
0.002 M/sec
0.130 K/sec
0.061 K/sec
1.022 GHz
[100.00%]
[100.00%]
[100.00%]
0.82 insns per cycle
151.412 M/sec
1.27% of all branches
[75.02%]
[75.02%]
[74.99%]
[75.02%]
10.001622154 seconds time elapsed
AWS EC2 m4.16xl

slide 66:

Interpreting IPC & Actionable Items
IPC: Instructions Per Cycle (invert of CPI)
● IPC gt; 1.0: likely instruction bound
Reduce code execution, eliminate unnecessary work, cache operations,
improve algorithm order. Can analyze using CPU flame graphs.
Faster CPUs.

slide 67:

Intel Architectural PMCs
Event Name
Umask
Event S.
Example Event Mask Mnemonic
UnHalted Core Cycles
00H
3CH
CPU_CLK_UNHALTED.THREAD_P
Instruction Retired
00H
C0H
INST_RETIRED.ANY_P
UnHalted Reference Cycles
01H
3CH
CPU_CLK_THREAD_UNHALTED.REF_XCLK
LLC Reference
4FH
2EH
LONGEST_LAT_CACHE.REFERENCE
LLC Misses
41H
2EH
LONGEST_LAT_CACHE.MISS
Branch Instruction Retired
00H
C4H
BR_INST_RETIRED.ALL_BRANCHES
Branch Misses Retired
00H
C5H
BR_MISP_RETIRED.ALL_BRANCHES
Now available in AWS EC2 on full dedicated hosts (eg, m4.16xl, …)

slide 68:

# pmcarch 1
CYCLES
[...]
INSTRUCTIONS
IPC BR_RETIRED
0.71 11760496978
0.78 10665897008
0.82 9538082731
0.78 12672090735
0.67 10542795714
BR_MISPRED
BMR% LLCREF
1.48 1542464817
1.48 1361315177
1.44 1272163733
1.43 1685112288
1.37 1204703117
LLCMISS
LLC%
https://github.com/brendangregg/pmc-cloud-tools
tiptop Tasks: 96 total,
PID [ %CPU] %SYS
35.3 28.5
1319+
[root]
3 displayed
Mcycle
screen
Minstr
IPC
%MISS
%BMIS
0: default
%BUS COMMAND
0.0 java
0.0 nm-applet
0.0 dbus-daemo

slide 69:

Netflix
Performance Meetup

slide 70:

Netflix
Performance Meetup