YOW! 2018: Cloud Performance Root Cause Analysis at Netflix
Keynote by Brendan Gregg for YOW! 2018.Video: https://www.youtube.com/watch?v=tAY8PnfrS_k
Description: "At Netflix, improving the performance of our cloud means happier customers and lower costs, and involves root cause analysis of applications, runtimes, operating systems, and hypervisors, in an environment of 150k cloud instances that undergo numerous production changes each week. Apart from the developers who regularly optimize their own code, we also have a dedicated performance team to help with any issue across the cloud, and to build tooling to aid in this analysis. In this session we will summarize the Netflix environment, procedures, and tools we use and build to do root cause analysis on cloud performance issues. The analysis performed may be cloud-wide, using self-service GUIs such as our open source Atlas tool, or focused on individual instances, and use our open source Vector tool, flame graphs, Java debuggers, and tooling that uses Linux perf, ftrace, and bcc/eBPF. You can use these open source tools in the same way to find performance wins in your own environment."
PDF: YOW2018_CloudPerfRCANetflix.pdf
Keywords (from pdftotext):
slide 1:
Cloud Performance Root Cause Analysis at Netflix Brendan Gregg Senior Performance Architect Cloud and Platform Engineering YOW! Conference Australia Nov-Dec 2018slide 2:
Experience: CPU Dipsslide 3:
slide 4:
# perf record -F99 -a # perf script […] java 14327 [022] 252764.179741: cycles: java 14315 [014] 252764.183517: cycles: java 14310 [012] 252764.185317: cycles: java 14332 [015] 252764.188720: cycles: java 14341 [019] 252764.191307: cycles: java 14341 [019] 252764.198825: cycles: java 14341 [019] 252764.207057: cycles: java 14341 [019] 252764.215962: cycles: java 14341 [019] 252764.225141: cycles: java 14341 [019] 252764.234578: cycles: […] 7f36570a4932 SpinPause (/usr/lib/jvm/java-8 7f36570a4932 SpinPause (/usr/lib/jvm/java-8 7f36570a4932 SpinPause (/usr/lib/jvm/java-8 7f3658078350 pthread_cond_wait@@GLIBC_2.3.2 7f3656d150c8 ClassLoaderDataGraph::do_unloa 7f3656d140b8 ClassLoaderData::free_dealloca 7f3657192400 nmethod::do_unloading(BoolObje 7f3656ba807e Assembler::locate_operand(unsi 7f36571922e8 nmethod::do_unloading(BoolObje 7f3656ec4960 CodeHeap::block_start(void*) cslide 5:
slide 6:
slide 7:
Observability Methodology Velocityslide 8:
Root Cause Analysis at Netflix Devices gRPC Zuul 2 Load Ribbon Hystrix Eureka Service Tomcat JVM Instances (Linux) AZ 3 AZ 1 AZ 2 ASG 1 ELB ASG Cluster Application Netflix Roots ASG 2 Atlas Chronos Zipkin Vector sar, *stat ftrace bcc/eBPF bpftrace PMCs, MSRsslide 9:
Agenda 1. The Netflix Cloud 2. Methodology 3. Cloud Analysis 4. Instance Analysisslide 10:
Since 2014 Asgard → Spinnaker Spinnaker Salp → Spinnaker Zipkin gRPC adoption New Atlas UI & Lumen Java frame pointer eBPF bcc & bpftrace PMCs in EC2 From Clouds to Roots (2014 presentation): Old Atlas UIslide 11:
>gt;150k AWS EC2 server instances ~34% US Internet traffic at night >gt;130M members Performance is customer satisfaction & Netflix costslide 12:
Acronyms AWS: Amazon Web Services EC2: AWS Elastic Compute 2 (cloud instances) S3: AWS Simple Storage Service (object store) ELB: AWS Elastic Load Balancers SQS: AWS Simple Queue Service SES: AWS Simple Email Service CDN: Content Delivery Network OCA: Netflix Open Connect Appliance (streaming CDN) QoS: Quality of Service AMI: Amazon Machine Image (instance image) ASG: Auto Scaling Group AZ: Availability Zone NIWS: Netflix Internal Web Service framework (Ribbon) gRPC: gRPC Remote Procedure Calls MSR: Model Specific Register (CPU info register) PMC: Performance Monitoring Counter (CPU perf counter) eBPF: extended Berkeley Packet Filter (kernel VM)slide 13:
1. The Netflix Cloud Overviewslide 14:
The Netflix Cloud EC2 ELB Cassandra Applications (Services) Elasticsearch EVCache SES SQSslide 15:
Netflix Microservices Authentication Web Site API User Data Personalization EC2 Client Devices Streaming API Viewing Hist. DRM QoS Logging OCA CDN CDN Steering Encodingslide 16:
Freedom and Responsibility Culture deck memo is true https://jobs.netflix.com/culture Deployment freedom Purchase and use cloud instances without approvals Netflix environment changes fast!slide 17:
Cloud Technologies Usually open source Linux, Java, Cassandra, Node.js, … http://netflix.github.io/slide 18:
Cloud Instances Linux (Ubuntu) Optional Apache, memcached, nonJava apps (incl. Node.js, golang) Atlas monitoring, S3 log rotation, ftrace, perf, bcc/eBPF Java (JDK 8) GC and thread dump logging Tomcat Application war files, base servlet, platform, hystrix, health check, metrics (Servo) Typical BaseAMIslide 19:
5 Key Issues And How the Netflix Cloud is Architected to Solve Themslide 20:
1. Load Increases → Spinnaker Auto Scaling Groups Instances automatically added or removed by a custom scaling policy Alerts & monitoring used to check scaling is sane Good for customers: Fast workaround Good for engineers: Fix later, 9-5 ASG CloudWatch, Servo Scaling Policy loadavg, latency, … Instance Instance Instance Instanceslide 21:
2. Bad Push → Spinnaker ASG Cluster Rollback ASG red black clusters: how code versions are deployed Fast rollback for issues Traffic managed by Elastic Load Balancers (ELBs) Automated Canary Analysis (ACA) for testing ASG Cluster prod1 ELB Canary ASG-v010 ASG-v011 Instance Instance Instance Instance Instance Instanceslide 22:
3. Instance Failure → Spinnaker Hystrix Timeouts Hystrix: latency and fault tolerance for dependency services Fallbacks, degradation, fast fail and rapid recovery, timeouts, load shedding, circuit breaker, realtime monitoring Plus Ribbon or gRPC for more fault tolerance Tomcat Application get A Hystrix >gt;100ms Dependency Dependencyslide 23:
4. Region failure → Spinnaker Zuul 2 Reroute Traffic All device traffic goes through the Zuul 2 proxy: dynamic routing, monitoring, resiliency, security Region or AZ failure: reroute traffic to another region Zuul 2, DNS Region 1 Region 2 Monitoring Region 3slide 24:
5. Overlooked Issue → Spinnaker Chaos Engineering lnstances: termination (Resilience) Availability Zones: artificial failures Latency: artificial delays by ChAP Conformity: kills non-best-practices instances Doctor: health checks Janitor: kills unused instances Security: checks violations 10-18: geographic issuesslide 25:
A Resilient Architecture Devices gRPC Zuul 2 Load Ribbon Hystrix Eureka Service Tomcat JVM Instances (Linux) AZ 3 AZ 1 AZ 2 ASG 1 ELB Some services vary: - Apache Web Server - Node.js & Prana - golang ASG Cluster Application Netflix Chaos Engineering ASG 2slide 26:
2. Methodology Cloud & Instanceslide 27:
Why Do Root Cause Perf Analysis? Netflix Application ASG Cluster … ASG 2 Often for: High latency Growth Upgrades ELB ASG 1 AZ 3 AZ 2 AZ 1 Instances (Linux) JVM Tomcat Serviceslide 28:
Cloud Methodologies Resource Analysis Metric and event correlations Latency Drilldowns RED Method For each microservice, check: - Rate - Errors - Duration Service A Service C Service B Service Dslide 29:
Instance Methodologies Log Analysis Micro-benchmarking Drill-down analysis USE Method For each resource, check: - Utilization - Saturation - Errors CPU Memory Disk Controller Network Controller Disk Net Disk Netslide 30:
Bad Instance Anti-Method 1. Plot request latency per-instance 2. Find the bad instance 3. Terminate it 4. Someone else’s problem now! Bad instance latency Terminate! Could be an early warning of a bigger issueslide 31:
3. Cloud Analysis Atlas, Lumen, Chronos, ...slide 32:
Netflix Cloud Analysis Process Example path enumerated Atlas Alerts PICSOU Slack Cost Chat 1. Check Issue Atlas/Lumen Dashboards 2. Check Events Chronos Create New Alert Plus some other tools not pictured Redirected to a new Target 3. Drill Down Atlas Metrics 4. Check Dependencies 5. Root Cause Instance Analysis Slalom Zipkinslide 33:
Atlas: Alerts Custom alerts on streams per second (SPS) changes, CPU usage, latency, ASG growth, client errors, …slide 34:
slide 35:
Winston: Automated Diagnostics & Remediation Links to Atlas Dashboards & Metrics Chronos: Possible Related Eventsslide 36:
Atlas: Dashboardsslide 37:
Atlas: Dashboards Netflix perf vitals dashboard 1. RPS, CPU 2. Volume 3. Instances 4. Scaling 5. CPU/RPS 6. Load avg 7. Java heap 8. ParNew 9. Latency 10. 99th tileslide 38:
Atlas & Lumen: Custom Dashboards Dashboards are a checklist methodology: what to show first, second, third... Starting point for issues 1. Confirm and quantify issue 2. Check historic trend 3. Atlas metrics to drill down Lumen: more flexible dashboards eg, go/burgerslide 39:
Atlas: Metricsslide 40:
Atlas: Metrics Region Application Metrics Presentation Interactive graph Summary statistics Time rangeslide 41:
Atlas: Metrics All metrics in one system System metrics: CPU usage, disk I/O, memory, … Application metrics: latency percentiles, errors, … Filters or breakdowns by region, application, ASG, metric, instance URL has session state: shareableslide 42:
Chronos: Change Trackingslide 43:
Chronos: Change Tracking Scope Time Range Event Logslide 44:
Slalom: Dependency Graphingslide 45:
Slalom: Dependency Graphing Dependency App Traffic Volumeslide 46:
Zipkin UI: Dependency Tracing Dependency Latencyslide 47:
PICSOU: AWS Usage Breakdowns Cost per hour Details (redacted)slide 48:
Slack: Chat Latency is high in us-east-1 Sorry We just did a bad pushslide 49:
Netflix Cloud Analysis Process Example path enumerated Atlas Alerts PICSOU Slack Cost Chat 1. Check Issue Atlas/Lumen Dashboards 2. Check Events Chronos Create New Alert Plus some other tools not pictured Redirected to a new Target 3. Drill Down Atlas Metrics 4. Check Dependencies 5. Root Cause Instance Analysis Slalom Zipkinslide 50:
Generic Cloud Analysis Process Example path enumerated Alerts Usage Reports 1. Check Issue Cost Custom Dashboards 2. Check Events Change Tracking Create New Alert Plus other tools as needed Messaging Redirected to a new Target 3. Drill Down Metric Analysis 4. Check Dependencies 5. Root Cause Instance Analysis Chat Dependency Analysisslide 51:
4. Instance Analysis 1. Statistics 2. Profiling 3. Tracing 4. Processor Analysisslide 52:
slide 53:
slide 54:
1. Statisticsslide 55:
Linux Tools vmstat, pidstat, sar, etc, used mostly normally $ sar -n TCP,ETCP,DEV 1 Linux 4.15.0-1027-aws (xxx) 09:43:53 PM IFACE 09:43:54 PM 09:43:54 PM eth0 rxpck/s 12/03/2018 txpck/s rxkB/s txkB/s 33744.00 19361.43 28065.36 09:43:53 PM 09:43:54 PM active/s passive/s 09:43:53 PM 09:43:54 PM […] atmptf/s _x86_64_ (48 CPU) iseg/s rxcmp/s txcmp/s rxmcst/s %ifutil oseg/s estres/s retrans/s isegerr/s orsts/s Micro benchmarking can be used to investigate hypervisor behavior that can’t be observed directlyslide 56:
Exception: Containers Most Linux tools are still not container aware From the container, will show the full host We expose cgroup metrics in our cloud GUIs: Vectorslide 57:
Vector: Instance/Container Analysisslide 58:
2. Profilingslide 59:
Experience: “ZFS is eating my CPUs”slide 60:
CPU Mixed-Mode Flame Graph Application (truncated) 38% kernel time (why?)slide 61:
Zoomedslide 62:
2014: Java Profiling Java Profilers System Profilersslide 63:
2018: Java Profiling Kernel Java JVM CPU Mixed-mode Flame Graphslide 64:
CPU Flame Graphslide 65:
CPU Flame Chart (same data)slide 66:
CPU Flame Graphs g() e() f() d() c() i() b() h() a()slide 67:
CPU Flame Graphs Y-axis: stack depth 0 at bottom 0 at top == icicle graph X-axis: alphabet Top edge: Who is running on CPU And how much (width) Time == flame chart Color: random g() Hues often used for language types Can be a dimension eg, CPI e() Ancestry f() d() c() i() b() h() a()slide 68:
Application Profiling Primary approach: CPU mixed-mode flame graphs (eg, via Linux perf) May need frame pointers (eg, Java -XX:+PreserveFramePointer) May need a symbol file (eg, Java perf-map-agent, Node.js --perf-basic-prof) Secondary: Application profiler (eg, via Lightweight Java Profiler) Application logsslide 69:
Vector: Push-button Flame Graphsslide 70:
Future: eBPF-based Profiling Linux 2.6 Linux 4.9 perf record profile.py perf.data perf script stackcollapse-perf.pl flamegraph.pl flamegraph.plslide 71:
3. Tracingslide 72:
slide 73:
Core Linux Tracers Ftrace 2.6.27+ Tracing views Plus other kernel tech: kprobes, uprobes perf 2.6.31+ Official profiler & tracer eBPF 4.9+ Programmatic engine bcc Complex tools bpftrace - Short scriptsslide 74:
Experience: Disk %Busyslide 75:
# iostat –x 1 […] avg-cpu: %user %nice %system %iowait %steal Device: xvda xvdb xvdj […] rrqm/s wrqm/s r/s 0.00 139.00 w/s rkB/s 0.00 1056.00 %idle wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util 0.00 0.00 0.00 0.00 0.00 0.00 0.00 6.30 87.60slide 76:
# /apps/perf-tools/bin/iolatency 10 Tracing block I/O. Output every 10 seconds. Ctrl-C to end. >gt;=(ms) ..slide 77:gt; 1 1 ->gt; 2 2 ->gt; 4 4 ->gt; 8 8 ->gt; 16 16 ->gt; 32 32 ->gt; 64 64 ->gt; 128 : I/O : 421 : 95 : 48 : 108 : 363 : 66 : 3 : 7 |Distribution |######################################| |######### |##### |########## |################################# |######
# /apps/perf-tools/bin/iosnoop Tracing block I/O. Ctrl-C to end. COMM PID TYPE DEV BLOCK java 30603 RM 202,144 1670768496 cat 202,0 cat 202,0 cat 202,0 java 30603 RM 202,144 620864512 java 30603 RM 202,144 584767616 java 30603 RM 202,144 601721984 java 30603 RM 202,144 603721568 java 30603 RM 202,144 61067936 java 30603 RM 202,144 1678557024 java 30603 RM 202,144 55299456 java 30603 RM 202,144 1625084928 java 30603 RM 202,144 618895408 java 30603 RM 202,144 581318480 java 30603 RM 202,144 1167348016 java 30603 RM 202,144 51561280 [...] BYTES LATmsslide 78:
# perf record -e block:block_rq_issue --filter rwbs ~ "*M*" -g -a # perf report -n –stdio [...] # Overhead Samples Command Shared Object Symbol # ........ ............ ............ ................. .................... 70.70% java [kernel.kallsyms] [k] blk_peek_request --- blk_peek_request do_blkif_request __blk_run_queue queue_unplugged blk_flush_plug_list blk_finish_plug _xfs_buf_ioapply xfs_buf_iorequest |--88.84%-- _xfs_buf_read xfs_buf_read_map |--87.89%-- xfs_trans_read_buf_map |--97.96%-- xfs_imap_to_bp xfs_iread xfs_iget xfs_lookup xfs_vn_lookup lookup_real __lookup_hash lookup_slow path_lookupat filename_lookup user_path_at_empty user_path_at vfs_fstatat |--99.48%-- SYSC_newlstat sys_newlstat system_call_fastpath __lxstat64 |Lsun/nio/fs/UnixNativeDispatcher;.lstat0 0x7f8f963c847cslide 79:
slide 80:
# /usr/share/bcc/tools/biosnoop TIME(s) COMM PID tar tar tar tar tar tar tar tar tar tar tar tar tar tar tar tar tar tar tar tar tar [...] DISK xvda xvda xvda xvda xvda xvda xvda xvda xvda xvda xvda xvda xvda xvda xvda xvda xvda xvda xvda xvda xvda SECTOR BYTES LAT(ms)slide 81:
eBPFslide 82:
eBPF: extended Berkeley Packet Filter User-Defined BPF Programs SDN Configuration DDoS Mitigation Kernel Runtime Event Targets verifier sockets Intrusion Detection Container Security kprobes BPF Observability Firewalls (bpfilter) Device Drivers uprobes tracepoints BPF actions perf_eventsslide 83:
slide 84:
bcc # /usr/share/bcc/tools/tcplife PID COMM LADDR 2509 java 2509 java 2509 java 2509 java 2509 java 12030 upload-mes 127.0.0.1 12030 upload-mes 127.0.0.1 3964 mesos-slav 127.0.0.1 12021 upload-sys 127.0.0.1 2509 java 2235 dockerd 2235 dockerd [...] LPORT RADDR 8078 100.82.130.159 8078 100.82.78.215 60778 100.82.207.252 38884 100.82.208.178 4243 127.0.0.1 34020 127.0.0.1 21196 127.0.0.1 7101 127.0.0.1 34022 127.0.0.1 8078 127.0.0.1 13730 100.82.136.233 34314 100.82.64.53 RPORT TX_KB RX_KB MS 0 5.44 0 135.32 13 15126.87 0 15568.25 0 0.61 0 3.38 0 12.61 0 12.64 0 15.28 372 15.31 4 18.50 8 56.73slide 85:
bpftrace # biolatency.bt Attaching 3 probes... Tracing block device I/O... Hit Ctrl-C to end. @usecs: [256, 512) [512, 1K) [1K, 2K) [2K, 4K) [4K, 8K) [8K, 16K) [16K, 32K) [32K, 64K) [64K, 128K) [128K, 256K) 2 | 10 |@ 426 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| 230 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 9 |@ 128 |@@@@@@@@@@@@@@@ 68 |@@@@@@@@ 0 | 0 | 10 |@slide 86:
bpftrace: biolatency.bt #!/usr/local/bin/bpftrace BEGIN printf("Tracing block device I/O... Hit Ctrl-C to end.\n"); kprobe:blk_account_io_start @start[arg0] = nsecs; kprobe:blk_account_io_completion /@start[arg0]/ @usecs = hist((nsecs - @start[arg0]) / 1000); delete(@start[arg0]);slide 87:
Future: eBPF GUIsslide 88:
4. Processor Analysisslide 89:
What “90% CPU Utilization” might suggest: What it typically means on the Netflix cloud:slide 90:
PMCs Performance Monitoring Counters help you analyze stalls Some instances (eg. Xen-based m4.16xl) have the architectural set:slide 91:
Instructions Per Cycle (IPC) “good*” >gt;2.0 Instruction bound IPC “bad” * probably; exception: spin locksslide 92: PMCs: EC2 Xen Hypervisor # perf stat -a -- sleep 30 Performance counter stats for 'system wide': 1,103,112 189,173 4,044 2,057,164,531,949slide 93:gt; gt; 1,357,979,592,699 243,244,156,173 4,391,259,112 task-clock (msec) context-switches cpu-migrations page-faults cycles stalled-cycles-frontend stalled-cycles-backend instructions branches branch-misses 64.034 CPUs utilized 0.574 K/sec 0.098 K/sec 0.002 K/sec 1.071 GHz (100.00%) (100.00%) (100.00%) 0.66 insns per cycle 126.617 M/sec 1.81% of all branches (75.01%) (74.99%) (75.00%) (75.00%) 30.001112466 seconds time elapsed # ./pmcarch 1 CYCLES INSTRUCTIONS [...] IPC BR_RETIRED 0.66 4692322525 0.65 5286747667 0.70 4616980753 0.69 5055959631 BR_MISPRED BMR% LLCREF 1.95 780435112 1.81 751335355 1.87 709841242 1.83 787333902 LLCMISS LLC% PMCs: EC2 Nitro Hypervisor Some instance types (large, Nitro-based) support most PMCs! Meltdown KPTI patch TLB miss analysis on a c5.9xl: nopti: # tlbstat -C0 1 K_CYCLES K_INSTR [...] pti, nopcid: # tlbstat -C0 1 K_CYCLES K_INSTR [...] IPC DTLB_WALKS ITLB_WALKS K_DTLBCYC 0.86 565 0.86 950 0.86 396 K_ITLBCYC DTLB% ITLB% 0.00 0.00 0.00 0.00 0.00 0.00 IPC DTLB_WALKS ITLB_WALKS K_DTLBCYC 0.10 89709496 0.10 88829158 0.10 89683045 0.10 79055465 K_ITLBCYC DTLB% ITLB% 27.40 22.63 27.28 22.52 27.29 22.55 27.40 22.63 worst caseslide 94:MSRs Model Specific Registers System config info, including current clock rate: # showboost Base CPU MHz : 2500 Set CPU MHz : 2500 Turbo MHz(s) : 3100 3200 3300 3500 Turbo Ratios : 124% 128% 132% 140% CPU 0 summary every 1 seconds... TIME 23:39:07 23:39:08 23:39:09 C0_MCYC C0_ACYC UTIL 64% 70% 99% RATIO MHzslide 95:Summary Take-awaysslide 96:Take Aways 1. Get push-button CPU flame graphs: kernel & user 2. Check out eBPF perf tools: bcc, bpftrace 3. Measure IPC as well as CPU utilization using PMCs 90% CPU busy: … really means:slide 97:Observability Methodology Velocityslide 98:Observability Statistics, Flame Graphs, eBPF Tracing, Cloud PMCs Methodology USE method, RED method, Drill-down Analysis, … Velocity Self-service GUIs: Vector, FlameScope, …slide 99:Resources 2014 talk From Clouds to Roots: http://www.slideshare.net/brendangregg/netflix-from-clouds-to-roots http://www.youtube.com/watch?v=H-E0MQTID0g Chaos: https://medium.com/netflix-techblog/chap-chaos-automation-platform-53e6d528371f https://principlesofchaos.org/ Atlas: https://github.com/Netflix/Atlas Atlas: https://medium.com/netflix-techblog/introducing-atlas-netflixs-primary-telemetry-platform-bd31f4d8ed9a RED method: https://thenewstack.io/monitoring-microservices-red-method/ USE method: https://queue.acm.org/detail.cfm?id=2413037 Winston: https://medium.com/netflix-techblog/introducing-winston-event-driven-diagnostic-and-remediation-platform-46ce39aa81cc Lumen: https://medium.com/netflix-techblog/lumen-custom-self-service-dashboarding-for-netflix-8c56b541548c Flame graphs: http://www.brendangregg.com/flamegraphs.html Java flame graphs: https://medium.com/netflix-techblog/java-in-flames-e763b3d32166 Vector: http://vectoross.io https://github.com/Netflix/Vector FlameScope: https://github.com/Netflix/FlameScope Tracing ponies: thanks Deirdré Straughan & General Zoi's Pony Creator ftrace: http://lwn.net/Articles/608497/ - usually already in your kernel perf: http://www.brendangregg.com/perf.html - perf is usually packaged in linux-tools-common tcplife: https://github.com/iovisor/bcc - often available as a bcc or bcc-tools package bpftrace: https://github.com/iovisor/bpftrace pmcarch: https://github.com/brendangregg/pmc-cloud-tools showboost: https://github.com/brendangregg/msr-cloud-tools - also try turbostatslide 100:Netflix Tech Blogslide 101:Thank you. Brendan Gregg @brendangregg