USENIX SREcon 2016: Performance Checklists for SREs
Talk from SREcon2016 by Brendan Gregg.Video: https://www.youtube.com/watch?v=zxCWXNigDpA
Video: https://www.usenix.org/conference/srecon16/program/presentation/gregg
Description: "There's limited time for performance analysis in the emergency room. When there is a performance-related site outage, the SRE team must analyze and solve complex performance issues as quickly as possible, and under pressure. Many performance tools and techniques are designed for a different environment: an engineer analyzing their system over the course of hours or days, and given time to try dozens of tools: profilers, tracers, monitoring tools, benchmarks, as well as different tunings and configurations. But when Netflix is down, minutes matter, and there's little time for such traditional systems analysis. As with aviation emergencies, short checklists and quick procedures can be applied by the on-call SRE staff to help solve performance issues as quickly as possible.
In this talk, I'll cover a checklist for Linux performance analysis in 60 seconds, as well as other methodology-derived checklists and procedures for cloud computing, with examples of performance issues for context. Whether you are solving crises in the SRE war room, or just have limited time for performance engineering, these checklists and approaches should help you find some quick performance wins. Safe flying."
PDF: SREcon_2016_perf_checklists.pdf
Keywords (from pdftotext):
slide 1:
Performance Checklists for SREs Brendan Gregg Senior Performance Architectslide 2:
Performance Checklists per instance: uptime dmesg -T | tail vmstat 1 mpstat -P ALL 1 pidstat 1 iostat -xz 1 free -m sar -n DEV 1 sar -n TCP,ETCP 1 10. top cloud wide: 1. RPS, CPU 2. Volume 3. Instances 4. Scaling 5. CPU/RPS 6. Load Avg 7. Java Heap 8. ParNew 9. Latency 10. 99th Qleslide 3:
slide 4:
Brendan the SRE • On the Perf Eng team & primary on-call rotation for Core: our central SRE team – we get paged on SPS dips (starts per second) & more • In this talk I'll condense some perf engineering into SRE timescales (minutes) using checklistsslide 5:
Performance Engineering != SRE Performance Incident Responseslide 6:
Performance Engineering • Aim: best price/performance possible – Can be endless: continual improvement • Fixes can take hours, days, weeks, months – Time to read docs & source code, experiment – Can take on large projects no single team would staff • Usually no prior "good" state – No spot the difference. No starting point. – Is now "good" or "bad"? Experience/instinct helps • Solo/team work At Netflix: The Performance Engineering team, with help from developersslide 7:
Performance Engineeringslide 8:
Performance Engineering stat tools tracers benchmarks monitoring dashboards documentation source code tuning PMCs profilers flame graphsslide 9:
SRE Perf Incident Response • Aim: resolve issue in minutes – Quick resolution is king. Can scale up, roll back, redirect traffic. – Must cope under pressure, and at 3am • Previously was in a "good" state – Spot the difference with historical graphs • Get immediate help from all staff – Must be social • Reliability & perf issues often related At Netflix, the Core team (5 SREs), with immediate help from developers and performance engineersslide 10:
SRE Perf Incident Responseslide 11:
SRE Perf Incident Response custom dashboards central event logs distributed system tracing chat rooms pager ticket systemslide 12:
NeSlix Cloud Analysis Process In summary… Example SRE response path enumerated Atlas Alerts ICE 1. Check Issue Cost Atlas Dashboards 2. Check Events Chronos Create New Alert Plus some other tools not pictured Redirected to a new Target 3. Drill Down Atlas Metrics 4. Check Dependencies 5. Root Cause Mogul SSH, instance tools Salpslide 13:
The Need for Checklists Speed Completeness A Starting Point An Ending Point Reliability Training Perf checklists have historically been created for perf engineering (hours) not SRE response (minutes) More on checklists: Gawande, A., The Checklist Manifesto. Metropolitan Books, 2008 Boeing 707 Emergency Checklist (1969)slide 14:
SRE Checklists at NeSlix • Some shared docs – PRE Triage Methodology – go/triage: a checklist of dashboards • Most "checklists" are really custom dashboards – Selected metrics for both reliability and performance • I maintain my own per-service and per-device checklistsslide 15:
SRE Performance Checklists The following are: • Cloud performance checklists/dashboards • SSH/Linux checklists (lowest common denominator) • Methodologies for deriving cloud/instance checklists Ad Hoc Methodology Checklists Dashboards Including aspirational: what we want to do & build as dashboardsslide 16:
1. PRE Triage Checklist Our iniQal checklist NeSlix specificslide 17:
PRE Triage Checklist • Performance and Reliability Engineering checklist – Shared doc with a hierarchal checklist with 66 steps total 1. Initial Impact record timestamp quantify: SPS, signups, support calls check impact: regional or global? check devices: device specific? 2. Time Correlations 1. pretriage dashboard 1. check for suspect NIWS client: error rates 2. check for source of error/request rate change 3. […dashboard specifics…] Confirms, quantifies, & narrows problem. Helps you reason about the cause.slide 18:
PRE Triage Checklist. cont. • 3. Evaluate Service Health – perfvitals dashboard – mogul dependency correlation – by cluster/asg/node: • latency: avg, 90 percentile • request rate • CPU: utilization, sys/user • Java heap: GC rate, leaks • memory • load average • thread contention (from Java) • JVM crashes • network: tput, sockets • […] custom dashboardsslide 19:
2. predash IniQal dashboard NeSlix specificslide 20:
predash Performance and Reliability Engineering dashboard A list of selected dashboards suited for incident responseslide 21:
predash List of dashboards is its own checklist: 1. Overview 2. Client stats 3. Client errors & retries 4. NIWS HTTP errors 5. NIWS Errors by code 6. DRM request overview 7. DoS attack metrics 8. Push map 9. Cluster statusslide 22:
3. perfvitals Service dashboardslide 23:
perfvitals 1. RPS, CPU 2. Volume 3. Instances 4. Scaling 5. CPU/RPS 6. Load Avg 7. Java Heap 8. ParNew 9. Latency 10. 99th Qleslide 24:
4. Cloud ApplicaQon Performance Dashboard A generic exampleslide 25:
Cloud App Perf Dashboard 1. Load 2. Errors 3. Latency 4. Saturation 5. Instancesslide 26:
Cloud App Perf Dashboard 1. Load 2. Errors 3. Latency 4. Saturation 5. Instances problem of load applied? req/sec, by type errors, Qmeouts, retries response Qme average, 99th -‐Qle, distribuQon CPU load averages, queue length/Qme scale up/down? count, state, version All time series, for every application, and dependencies. Draw a functional diagram with the entire data path. Same as Google's "Four Golden Signals" (Latency, Traffic, Errors, Saturation), with instances added due to cloud – Beyer, B., Jones, C., Petoff, J., Murphy, N. Site Reliability Engineering. O'Reilly, Apr 2016slide 27:
5. Bad Instance Dashboard An An>gt;-‐Methodologyslide 28:
Bad Instance Dashboard Plot request time per-instance Find the bad instance Terminate bad instance Someone else’s problem now! In SRE incident response, if it works, do it. Bad instance Terminate! 95th percenQle latency (Atlas Exploder)slide 29:
Lots More Dashboards We have countless more, mostly app specific and reliability focused • Most reliability incidents involve time correlation with a central log system Sometimes, dashboards & monitoring aren't enough. Time for SSH. NIWS HTTP errors: Error Types Regions Apps Timeslide 30:
6. Linux Performance Analysis in 60,000 millisecondsslide 31:
Linux Perf Analysis in 60s 1. uptime 2. dmesg -T | tail 3. vmstat 1 4. mpstat -P ALL 1 5. pidstat 1 6. iostat -xz 1 7. free -m 8. sar -n DEV 1 9. sar -n TCP,ETCP 1 10. topslide 32:
Linux Perf Analysis in 60s 1. uptime 2. dmesg -T | tail 3. vmstat 1 4. mpstat -P ALL 1 5. pidstat 1 6. iostat -xz 1 7. free -m 8. sar -n DEV 1 9. sar -n TCP,ETCP 1 10. top load averages kernel errors overall stats by Qme CPU balance process usage disk I/O memory usage network I/O TCP stats check overview hap://techblog.neSlix.com/2015/11/linux-‐performance-‐analysis-‐in-‐60s.htmlslide 33:
60s: upQme, dmesg, vmstat $ uptime 23:51:26 up 21:31, 1 user, load average: 30.02, 26.43, 19.02 $ dmesg | tail [1880957.563150] perl invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0 [...] [1880957.563400] Out of memory: Kill process 18694 (perl) score 246 or sacrifice child [1880957.563408] Killed process 18694 (perl) total-vm:1972392kB, anon-rss:1953348kB, file-rss:0kB [2320864.954447] TCP: Possible SYN flooding on port 7001. Dropping request. Check SNMP counters. $ vmstat 1 procs ---------memory---------- ---swap-- -----io---- -system-- ------cpu----r b swpd free buff cache cs us sy id wa st 34 0 0 200889792 73708 591828 10 96 1 3 0 0 32 0 0 200889920 73708 591860 592 13284 4282 98 1 1 0 0 32 0 0 200890112 73708 591860 0 9501 2154 99 1 0 0 0 32 0 0 200889568 73712 591856 48 11900 2459 99 0 0 0 0 32 0 0 200890208 73712 591860 0 15898 4840 98 1 1 0 0slide 34:
60s: mpstat $ mpstat -P ALL 1 Linux 3.13.0-49-generic (titanclusters-xxxxx) 07:38:49 PM 07:38:50 PM 07:38:50 PM 07:38:50 PM 07:38:50 PM 07:38:50 PM [...] CPU all %usr %nice %sys %iowait 07/14/2015 %irq _x86_64_ (32 CPU) %soft %steal %guest %gnice %idleslide 35:
60s: pidstat $ pidstat 1 Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (32 CPU) 07:41:02 PM UID 07:41:03 PM 07:41:03 PM 07:41:03 PM 07:41:03 PM 07:41:03 PM 07:41:03 PM 60004 PID %usr %system 6521 1596.23 6564 1571.70 %guest %CPU 0.00 1598.11 0.00 1579.25 CPU Command rcuos/0 mesos-slave java java java pidstat 07:41:03 PM UID 07:41:04 PM 07:41:04 PM 07:41:04 PM 07:41:04 PM 07:41:04 PM 60004 PID %usr %system 6521 1590.00 6564 1573.00 %guest %CPU 0.00 1591.00 0.00 1583.00 CPU Command mesos-slave java java snmp-pass pidstatslide 36:
60s: iostat $ iostat -xmdz 1 Linux 3.13.0-29 (db001-eb883efa) Device: xvda xvdb xvdc md0 rrqm/s 08/18/2014 wrqm/s r/s 0.00 15299.00 0.00 15271.00 0.00 31082.00 w/s _x86_64_ rMB/s (16 CPU) wMB/s \ ... 0.00 / ... 0.00 \ ... 0.01 / ... 0.01 \ ... Workload ... \ avgqu-sz ... / ... \ ... / ... \ await r_await w_await ResulQng Performance svctm %utilslide 37:
60s: free, sar –n DEV $ free -m total Mem: -/+ buffers/cache: Swap: used free shared $ sar -n DEV 1 Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 buffers _x86_64_ cached (32 CPU) 12:16:48 AM 12:16:49 AM 12:16:49 AM 12:16:49 AM IFACE rxpck/s eth0 18763.00 docker0 txpck/s rxkB/s 5032.00 20686.42 txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil 12:16:49 AM 12:16:50 AM 12:16:50 AM 12:16:50 AM IFACE rxpck/s eth0 19763.00 docker0 txpck/s rxkB/s 5101.00 21999.10 txkB/s rxcmp/s txcmp/s rxmcst/s %ifutilslide 38:
60s: sar –n TCP,ETCP $ sar -n TCP,ETCP 1 Linux 3.13.0-49-generic (titanclusters-xxxxx) (32 CPU) 12:17:19 AM 12:17:20 AM active/s passive/s 12:17:19 AM 12:17:20 AM atmptf/s 12:17:20 AM 12:17:21 AM active/s passive/s 12:17:20 AM 12:17:21 AM atmptf/s iseg/s 07/14/2015 oseg/s estres/s retrans/s isegerr/s iseg/s _x86_64_ orsts/s oseg/s estres/s retrans/s isegerr/s orsts/sslide 39:
60s: top $ top top - 00:15:40 up 21:56, 1 user, load average: 31.09, 29.87, 29.92 Tasks: 871 total, 1 running, 868 sleeping, 0 stopped, 2 zombie %Cpu(s): 96.8 us, 0.4 sy, 0.0 ni, 2.7 id, 0.1 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem: 25190241+total, 24921688 used, 22698073+free, 60448 buffers KiB Swap: 0 total, 0 used, 0 free. 554208 cached Mem PID USER 20248 root 4213 root 66128 titancl+ 5235 root 4299 root 1 root 2 root 3 root 5 root 6 root 8 root PR NI VIRT RES 0 0.227t 0.012t 0 2722544 64640 0 38.227g 547004 0 20.015g 2.682g 0 -20 SHR S 18748 S 44232 S 1172 R 49996 S 16836 S 1496 S 0 S 0 S 0 S 0 S 0 S %CPU %MEM TIME+ COMMAND 3090 5.2 29812:58 java 23.5 0.0 233:35.37 mesos-slave 1.0 0.0 0:00.07 top 0.7 0.2 2:02.74 java 0.3 1.1 33:14.42 java 0.0 0.0 0:03.82 init 0.0 0.0 0:00.02 kthreadd 0.0 0.0 0:05.35 ksoftirqd/0 0.0 0.0 0:00.00 kworker/0:0H 0.0 0.0 0:06.94 kworker/u256:0 0.0 0.0 2:38.05 rcu_schedslide 40:
Other Analysis in 60s • We need such checklists for: – Java – Cassandra – MySQL – Nginx – etc… • Can follow a methodology: – Process of elimination – Workload characterization – Differential diagnosis – Some summaries: http://www.brendangregg.com/methodology.html • Turn checklists into dashboards (many do exist)slide 41:
7. Linux Disk Checklistslide 42:
slide 43:
Linux Disk Checklist iostat –xnz 1 vmstat 1 df -h ext4slower 10 bioslower 10 ext4dist 1 biolatency 1 cat /sys/devices/…/ioerr_cnt smartctl -l error /dev/sda1slide 44:
Linux Disk Checklist iostat –xnz 1 any disk I/O? if not, stop looking vmstat 1 is this swapping? or, high sys Qme? df -h are file systems nearly full? ext4slower 10 (zfs*, xfs*, etc.) slow file system I/O? bioslower 10 if so, check disks check distribuQon and rate ext4dist 1 biolatency 1 if interesQng, check disks (if available) errors cat /sys/devices/…/ioerr_cnt smartctl -l error /dev/sda1 (if available) errors Another short checklist. Won't solve everything. FS focused. ext4slower/dist, bioslower, are from bcc/BPF tools.slide 45:
ext4slower • ext4 operations slower than the threshold: # ./ext4slower 1 Tracing ext4 operations slower than 1 ms TIME COMM PID T BYTES OFF_KB 06:49:17 bash R 128 06:49:17 cksum R 39552 06:49:17 cksum R 96 06:49:17 cksum R 96 06:49:17 cksum R 10320 06:49:17 cksum R 65536 06:49:17 cksum R 55400 06:49:17 cksum R 36792 06:49:17 cksum R 15008 […] LAT(ms) FILENAME 7.75 cksum 1.34 [ 5.36 2to3-2.7 14.94 2to3-3.4 6.82 411toppm 4.01 a2p 8.77 ab 16.34 aclocal-1.14 19.31 acpi_listen • Better indicator of application pain than disk I/O • Measures & filters in-kernel for efficiency using BPF – From https://github.com/iovisor/bccslide 46:
BPF is coming… Free your mindslide 47:
BPF • That file system checklist should be a dashboard: – FS & disk latency histograms, heatmaps, IOPS, outlier log • Now possible with enhanced BPF (Berkeley Packet Filter) – Built into Linux 4.x: dynamic tracing, filters, histograms System dashboards of 2017+ should look very differentslide 48:
8. Linux Network Checklistslide 49:
Linux Network Checklist 1. sar -n DEV,EDEV 1 2. sar -n TCP,ETCP 1 3. cat /etc/resolv.conf 4. mpstat -P ALL 1 5. tcpretrans 6. tcpconnect 7. tcpaccept 8. netstat -rnv 9. check firewall config 10. netstat -sslide 50:
Linux Network Checklist 1. sar -n DEV,EDEV 1 2. sar -n TCP,ETCP 1 3. cat /etc/resolv.conf 4. mpstat -P ALL 1 5. tcpretrans 6. tcpconnect 7. tcpaccept 8. netstat -rnv 9. check firewall config 10. netstat -s tcp*, are from bcc/BPF tools at interface limits? or use nicstat acQve/passive load, retransmit rate it's always DNS high kernel Qme? single hot CPU? what are the retransmits? state? connecQng to anything unexpected? unexpected workload? any inefficient routes? anything blocking/throaling? play 252 metric pickupslide 51:
tcpretrans • Just trace kernel TCP retransmit functions for efficiency: # ./tcpretrans TIME PID 01:55:05 0 01:55:05 0 01:55:17 0 […] IP LADDR:LPORT 4 10.153.223.157:22 4 10.153.223.157:22 4 10.153.223.157:22 T>gt; RADDR:RPORT R>gt; 69.53.245.40:34619 R>gt; 69.53.245.40:34619 R>gt; 69.53.245.40:22957 STATE ESTABLISHED ESTABLISHED ESTABLISHED • From either bcc (BPF) or perf-tools (ftrace, older kernels)slide 52:
9. Linux CPU Checklistslide 53:
(too many lines – should be a utilization heat map)slide 54:
http://www.brendangregg.com/HeatMaps/subsecondoffset.htmlslide 55:
$ perf script […] java 14327 [022] 252764.179741: cycles: java 14315 [014] 252764.183517: cycles: java 14310 [012] 252764.185317: cycles: java 14332 [015] 252764.188720: cycles: java 14341 [019] 252764.191307: cycles: java 14341 [019] 252764.198825: cycles: java 14341 [019] 252764.207057: cycles: java 14341 [019] 252764.215962: cycles: java 14341 [019] 252764.225141: cycles: java 14341 [019] 252764.234578: cycles: […] 7f36570a4932 SpinPause (/usr/lib/jvm/java-8 7f36570a4932 SpinPause (/usr/lib/jvm/java-8 7f36570a4932 SpinPause (/usr/lib/jvm/java-8 7f3658078350 pthread_cond_wait@@GLIBC_2.3.2 7f3656d150c8 ClassLoaderDataGraph::do_unloa 7f3656d140b8 ClassLoaderData::free_dealloca 7f3657192400 nmethod::do_unloading(BoolObje 7f3656ba807e Assembler::locate_operand(unsi 7f36571922e8 nmethod::do_unloading(BoolObje 7f3656ec4960 CodeHeap::block_start(void*) cslide 56:
Linux CPU Checklist uptime vmstat 1 mpstat -P ALL 1 pidstat 1 CPU flame graph CPU subsecond offset heat map perf stat -a -- sleep 10slide 57:
Linux CPU Checklist uptime load averages vmstat 1 system-‐wide uQlizaQon, run q length mpstat -P ALL 1 CPU balance pidstat 1 per-‐process CPU CPU flame graph CPU profiling map look for gaps CPU subsecond offset heat perf stat -a -- sleep 10 IPC, LLC hit raQo htop can do 1-4slide 58:
htopslide 59:
CPU Flame Graphslide 60:
perf_events CPU Flame Graphs • We have this automated in Netflix Vector: git clone --depth 1 https://github.com/brendangregg/FlameGraph cd FlameGraph perf record -F 99 -a –g -- sleep 30 perf script | ./stackcollapse-perf.pl |./flamegraph.pl >gt; perf.svg • Flame graph interpretation: – x-axis: alphabetical stack sort, to maximize merging – y-axis: stack depth – color: random, or hue can be a dimension (eg, diff) – Top edge is on-CPU, beneath it is ancestry • Can also do Java & Node.js. Differentials. • We're working on a d3 version for Vectorslide 61:
10. Tools Method An An>gt;-‐Methodologyslide 62:
Tools Method 1. RUN EVERYTHING AND HOPE FOR THE BEST For SRE response: a mental checklist to see what might have been missed (no time to run them all)slide 63:
Linux Perf Observability Toolsslide 64:
Linux StaQc Performance Toolsslide 65:
Linux perf-‐tools (mrace, perf)slide 66:
Linux bcc tools (BPF) Needs Linux 4.x CONFIG_BPF_SYSCALL=yslide 67:
11. USE Method A Methodologyslide 68:
The USE Method • For every resource, check: Utilization Saturation Errors X Resource UQlizaQon (%) • Definitions: – Utilization: busy time – Saturation: queue length or queued time – Errors: easy to interpret (objective) Used to generate checklists. Starts with the questions, then finds the tools.slide 69:
USE Method for Hardware • For every resource, check: Utilization Saturation Errors • Including busses & interconnectsslide 70:
(hap://www.brendangregg.com/USEmethod/use-‐linux.html)slide 71:
USE Method for Distributed Systems • Draw a service diagram, and for every service: Utilization: resource usage (CPU, network) Saturation: request queueing, timeouts Errors • Turn into a dashboardslide 72:
NeSlix Vector • Real time instance analysis tool – https://github.com/netflix/vector – http://techblog.netflix.com/2015/04/introducing-vector-netflixs-on-host.html • USE method-inspired metrics – More in development, incl. flame graphsslide 73:
NeSlix Vectorslide 74:
CPU: utilization NeSlix Vector Network: utilization Memory: utilization Disk: load saturation saturation load saturation utilization saturationslide 75:
12. Bonus: External Factor Checklistslide 76:
External Factor Checklist 1. Sports ball? 2. Power outage? 3. Snow storm? 4. Internet/ISP down? 5. Vendor firmware update? 6. Public holiday/celebration? 7. Chaos Kong? Social media searches (Twitter) often useful – Can also be NSFWslide 77:
Take Aways • Checklists are great – Speed, Completeness, Starting/Ending Point, Training – Can be ad hoc, or from a methodology (USE method) • Service dashboards – Serve as checklists – Metrics: Load, Errors, Latency, Saturation, Instances • System dashboards with Linux BPF – Latency histograms & heatmaps, etc. Free your mind. Please create and share more checklistsslide 78:
References Netflix Tech Blog: Linux Performance & BPF tools: http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html Heat maps: http://www.brendangregg.com/USEmethod/use-linux.html Flame Graphs: http://www.brendangregg.com/linuxperf.html https://github.com/iovisor/bcc#tools USE Method Linux: http://techblog.netflix.com/2015/11/linux-performance-analysis-in-60s.html http://techblog.netflix.com/2015/02/sps-pulse-of-netflix-streaming.html http://techblog.netflix.com/2015/04/introducing-vector-netflixs-on-host.html http://cacm.acm.org/magazines/2010/7/95062-visualizing-system-latency/fulltext http://www.brendangregg.com/heatmaps.html Books: Beyer, B., et al. Site Reliability Engineering. O'Reilly, Apr 2016 Gawande, A. The Checklist Manifesto. Metropolitan Books, 2008 Gregg, B. Systems Performance. Prentice Hall, 2013 (more checklists & methods!) Thanks: Netflix Perf & Core teams for predash, pretriage, Vector, etcslide 79:
Thanks http://slideshare.net/brendangregg http://www.brendangregg.com bgregg@netflix.com @brendangregg Netflix is hiring SREs!