Systems Performance 2nd Ed.



BPF Performance Tools book

Recent posts:
Blog index
About
RSS

South Bay SRE meetup 2017: Netflix Performance Engineering

Talk by the Netflix Performance Engineering team for SBSRE 2017.

Video: https://www.youtube.com/watch?v=i5Ml9uY2rBw

Description: "A look into how Netflix measures and tunes performance for our clients and the streaming service."

This includes a section starting on slide 61 by Brendan Gregg titled "Netflix PMCs on the Cloud" showing low-level CPU performance analysis using PMCs on AWS EC2. (PMCs are performance monitoring counters from the performance monitoring unit [PMU]).

next
prev
1/70
next
prev
2/70
next
prev
3/70
next
prev
4/70
next
prev
5/70
next
prev
6/70
next
prev
7/70
next
prev
8/70
next
prev
9/70
next
prev
10/70
next
prev
11/70
next
prev
12/70
next
prev
13/70
next
prev
14/70
next
prev
15/70
next
prev
16/70
next
prev
17/70
next
prev
18/70
next
prev
19/70
next
prev
20/70
next
prev
21/70
next
prev
22/70
next
prev
23/70
next
prev
24/70
next
prev
25/70
next
prev
26/70
next
prev
27/70
next
prev
28/70
next
prev
29/70
next
prev
30/70
next
prev
31/70
next
prev
32/70
next
prev
33/70
next
prev
34/70
next
prev
35/70
next
prev
36/70
next
prev
37/70
next
prev
38/70
next
prev
39/70
next
prev
40/70
next
prev
41/70
next
prev
42/70
next
prev
43/70
next
prev
44/70
next
prev
45/70
next
prev
46/70
next
prev
47/70
next
prev
48/70
next
prev
49/70
next
prev
50/70
next
prev
51/70
next
prev
52/70
next
prev
53/70
next
prev
54/70
next
prev
55/70
next
prev
56/70
next
prev
57/70
next
prev
58/70
next
prev
59/70
next
prev
60/70
next
prev
61/70
next
prev
62/70
next
prev
63/70
next
prev
64/70
next
prev
65/70
next
prev
66/70
next
prev
67/70
next
prev
68/70
next
prev
69/70
next
prev
70/70

PDF: SBSRE_perf_meetup_aug2017.pdf

Keywords (from pdftotext):

slide 1:
    Netflix
    Performance Meetup
    
slide 2:
    Global Client Performance
    Fast Metrics
    
slide 3:
    3G in Kazakhstan
    
slide 4:
    Making the Internet fast
    is slow.
    Global Internet:
    faster (better networking)
    slower (broader reach, congestion)
    Don't wait for it, measure it and deal
    Working app >gt; Feature rich app
    
slide 5:
    We need to know what the Internet looks like,
    without averages, seeing the full distribution.
    
slide 6:
    Logging Anti-Patterns
    Averages
    Sampling
    Can't see the distribution
    Missed data
    Outliers heavily distort
    Rare events
    ∞, 0, negatives, errors
    Problems aren’t equal in
    Population
    Instead, use the client as a map-reducer and send up aggregated
    data, less often.
    
slide 7:
    Sizing up the Internet.
    
slide 8:
    Infinite (free) compute power!
    
slide 9:
slide 10:
    Get median, 95th, etc.
    Calculate the inverse empirical cumulative
    distribution function by math.
    o ...or just use R which is free and knows how
    to do it already
    >gt; library(HistogramTools)
    >gt; iecdf 
slide 12:
slide 13:
    Data >gt; Opinions.
    
slide 14:
    Better than debating opinions.
    "We live in a
    "No one really minds the
    50ms world!"
    spinner."
    "Why should we spend
    time on that instead of
    COOLFEATURE?"
    "There's no way that the
    client makes that many
    requests.”
    Architecture is hard. Make it cheap to experiment where your users really are.
    
slide 15:
    We built Daedalus
    Fast
    DNS Time
    Elsewhere
    Slow
    
slide 16:
    Interpret the data
    Visual → Numerical, need the IECDF for
    Percentiles
    ƒ(0.50) = 50th (median)
    ƒ(0.95) = 95th
    Cluster to get pretty colors similar experiences.
    (k-means, hierarchical, etc.)
    
slide 17:
slide 18:
slide 19:
slide 20:
slide 21:
    Practical Teleportation.
    Go there!
    Abstract analysis - hard
    Feeling reality is much simpler than looking at graphs. Build!
    
slide 22:
    Make a Reality Lab.
    
slide 23:
slide 24:
    Don't guess.
    Developing a model based on
    production data, without missing the
    distribution of samples (network, render,
    responsiveness) will lead to better
    software.
    Global reach doesn't need to be scary.
    @gcirino42 http://blogofsomeguy.com
    
slide 25:
    Icarus
    Martin Spier
    @spiermar
    Performance Engineering @ Netflix
    
slide 26:
slide 27:
    Problem & Motivation
    Real-user performance monitoring solution
    More insight into the App performance
    (as perceived by real users)
    Too many variables to trust synthetic
    tests and labs
    Prioritize work around App performance
    Track App improvement progress over time
    Detect issues, internal and external
    
slide 28:
    Device Diversity
    ● Netflix runs on all sorts of devices
    ● Smart TVs, Gaming Consoles, Mobile Phones, Cable TV boxes, ...
    ● Consistently evaluate performance
    
slide 29:
slide 30:
    What are we monitoring?
    User Actions
    (or things users do in the App)
    App Startup
    User Navigation
    Playing a Title
    Internal App metrics
    
slide 31:
    What are we measuring?
    ● When does the timer start and stop?
    ● Time-to-Interactive (TTI)
    ○ Interactive, even if
    some items were not fully
    loaded and rendered
    ● Time-to-Render (TTR)
    ○ Everything above the fold
    (visible without scrolling)
    is rendered
    ● Play Delay
    ● Meaningful for what we are monitoring
    
slide 32:
    High-dimensional Data
    ● Complex device categorization
    ● Geo regions, subregions, countries
    ● Highly granular network
    classifications
    ● High volume of A/B tests
    ● Different facets of the same user action
    ○ Cold, suspended and backgrounded
    App startups
    ○ Target view/page on App startup
    
slide 33:
slide 34:
slide 35:
slide 36:
    Data Sketches
    ● Data structures that approximately
    resemble a much larger data set
    ● Preserve essential features!
    ● Significantly smaller!
    ● Faster to operate on!
    
slide 37:
    t-Digest
    ● t-Digest data structure
    ● Rank-based statistics
    (such as quantiles)
    ● Parallel friendly
    (can be merged!)
    ● Very fast!
    ● Really accurate!
    https://github.com/tdunning/t-digest
    
slide 38:
    + t-Digest sketches
    
slide 39:
slide 40:
    iOS Median Comparison, Break by Country
    
slide 41:
    iOS Median Comparison, Break by Country + iPhone 6S Plus
    
slide 42:
    CDFs by UI Version
    
slide 43:
    Warm Startup Rate
    
slide 44:
    A/B Cell Comparison
    
slide 45:
    Anomaly Detection
    
slide 46:
    Going Forward
    ● Resource utilization metrics
    ● Device profiling
    ○ Instrumenting client code
    ● Explore other visualizations
    ○ Frequency heat maps
    ● Connection between perceived
    performance, acquisition and
    retention
    @spiermar
    
slide 47:
    Netflix
    Autoscaling for experts
    Vadim
    
slide 48:
    Savings!
    ● Mid-tier stateless services are ~2/3rd of the total
    ● Savings - 30% of mid-tier footprint (roughly 30K instances)
    ○ Higher savings if we break it down by region
    ○ Even higher savings on services that scale well
    
slide 49:
    Why we autoscale - philosophical reasons
    
slide 50:
    Why we autoscale - pragmatic reasons
    ** Hack-day project
    Encoding
    Precompute
    Failover
    Red/black pushes
    Curing cancer**
    And more...
    
slide 51:
    Should you autoscale?
    Benefits
    ● On-demand capacity: direct $$ savings
    ● RI capacity: re-purposing spare capacity
    However, for each server group, beware of
    ● Uneven distribution of traffic
    ● Sticky traffic
    ● Bursty traffic
    ● Small ASG sizes (
slide 52:
    Autoscaling impacts availability - true or false?
    * If done correctly
    Under-provisioning, however, can impact availability
    ● Autoscaling is not a problem
    ● The real problem is not knowing performance characteristics of the
    service
    
slide 53:
    AWS autoscaling mechanics
    ASG scaling policy
    CloudWatch alarm
    Aggregated metric feed
    Notification
    Tunables
    Metric
    ● Threshold
    ● # of eval periods
    ● Scaling amount
    ● Warmup time
    
slide 54:
    What metric to scale on?
    Resource
    utilization
    Throughput
    Pros
    Tracks a direct measure of work
    Linear scaling
    Predictable
    Requires less adjustment over time
    Cons
    Thresholds tend to drift over time
    Prone to changes in request mixture
    Less predictable
    More oscillation / jitter
    
slide 55:
    Autoscaling on multiple metrics
    Proceed with caution
    ● Harder to reason about scaling behavior
    ● Different metrics might contradict each
    other, causing oscillation
    Typical Netflix configuration:
    ● Scale-up policy on throughput
    ● Scale-down policy on throughput
    ● Emergency scale-up policy on CPU, aka
    “the hammer rule”
    
slide 56:
    Well-behaved autoscaling
    
slide 57:
    Common mistakes - “no rush” scaling
    Problem: scaling amounts too
    small, cooldown too long
    Effect: scaling lags behind the
    traffic flow. Not enough
    capacity at peak, capacity
    wasted in trough
    Remedy: increase scaling
    amounts, migrate to step
    policies
    
slide 58:
    Common mistakes - twitchy scaling
    Problem: Scale-up policy is
    too aggressive
    Effect: unnecessary
    capacity churn
    Remedy: reduce scale-up
    amount, increase the # of
    eval periods
    
slide 59:
    Common mistakes - should I stay or should I go
    Problem: -up and -down
    thresholds are too close to each
    other
    Effect: constant capacity
    oscillation
    Remedy: move -up and -down
    thresholds farther apart
    
slide 60:
    AWS target tracking - your best bet!
    Think of it as a step policy with auto-steps
    You can also think of it as a thermostat
    Accounts for the rate of change in monitored metric
    Pick a metric, set the target value and warmup time - that’s it!
    Step
    Target-tracking
    
slide 61:
    Netflix
    PMCs on the Cloud
    Brendan
    
slide 62:
    90% CPU utilization:
    Busy
    Waiting
    (“idle”)
    
slide 63:
    90% CPU utilization:
    Waiting
    (“idle”)
    Busy
    Reality:
    Busy
    Waiting
    (“stalled”)
    Waiting
    (“idle”)
    
slide 64:
    # perf stat -a -- sleep 10
    Performance counter stats for 'system wide':
    7,562
    1,157
    109,734
    gt;
    gt;
    gt;
    gt;
    gt;
    gt;
    task-clock (msec)
    context-switches
    cpu-migrations
    page-faults
    cycles
    stalled-cycles-frontend
    stalled-cycles-backend
    instructions
    branches
    branch-misses
    10.001715965 seconds time elapsed
    8.000 CPUs utilized
    0.095 K/sec
    0.014 K/sec
    0.001 M/sec
    (100.00%)
    (100.00%)
    (100.00%)
    Performance
    Monitoring Counters
    (PMCs) in most clouds
    
slide 65:
    # perf stat -a -- sleep 10
    Performance counter stats for 'system wide':
    641320.173626 task-clock (msec)
    1,047,222 context-switches
    83,420 cpu-migrations
    38,905 page-faults
    655,419,788,755 cycles
    gt; stalled-cycles-frontend
    gt; stalled-cycles-backend
    536,830,399,277 instructions
    97,103,651,128 branches
    1,230,478,597 branch-misses
    64.122 CPUs utilized
    0.002 M/sec
    0.130 K/sec
    0.061 K/sec
    1.022 GHz
    [100.00%]
    [100.00%]
    [100.00%]
    0.82 insns per cycle
    151.412 M/sec
    1.27% of all branches
    [75.02%]
    [75.02%]
    [74.99%]
    [75.02%]
    10.001622154 seconds time elapsed
    AWS EC2 m4.16xl
    
slide 66:
    Interpreting IPC & Actionable Items
    IPC: Instructions Per Cycle (invert of CPI)
    ● IPC gt; 1.0: likely instruction bound
    Reduce code execution, eliminate unnecessary work, cache operations,
    improve algorithm order. Can analyze using CPU flame graphs.
    Faster CPUs.
    
slide 67:
    Intel Architectural PMCs
    Event Name
    Umask
    Event S.
    Example Event Mask Mnemonic
    UnHalted Core Cycles
    00H
    3CH
    CPU_CLK_UNHALTED.THREAD_P
    Instruction Retired
    00H
    C0H
    INST_RETIRED.ANY_P
    UnHalted Reference Cycles
    01H
    3CH
    CPU_CLK_THREAD_UNHALTED.REF_XCLK
    LLC Reference
    4FH
    2EH
    LONGEST_LAT_CACHE.REFERENCE
    LLC Misses
    41H
    2EH
    LONGEST_LAT_CACHE.MISS
    Branch Instruction Retired
    00H
    C4H
    BR_INST_RETIRED.ALL_BRANCHES
    Branch Misses Retired
    00H
    C5H
    BR_MISP_RETIRED.ALL_BRANCHES
    Now available in AWS EC2 on full dedicated hosts (eg, m4.16xl, …)
    
slide 68:
    # pmcarch 1
    CYCLES
    [...]
    INSTRUCTIONS
    IPC BR_RETIRED
    0.71 11760496978
    0.78 10665897008
    0.82 9538082731
    0.78 12672090735
    0.67 10542795714
    BR_MISPRED
    BMR% LLCREF
    1.48 1542464817
    1.48 1361315177
    1.44 1272163733
    1.43 1685112288
    1.37 1204703117
    LLCMISS
    LLC%
    https://github.com/brendangregg/pmc-cloud-tools
    tiptop Tasks: 96 total,
    PID [ %CPU] %SYS
    35.3 28.5
    1319+
    [root]
    3 displayed
    Mcycle
    screen
    Minstr
    IPC
    %MISS
    %BMIS
    0: default
    %BUS COMMAND
    0.0 java
    0.0 nm-applet
    0.0 dbus-daemo
    
slide 69:
    Netflix
    Performance Meetup
    
slide 70:
    Netflix
    Performance Meetup