USENIX LISA 2012: Performance Analysis Methodology
Talk for USENIX/LISA12, San Diego, 2012, by Brendan Gregg.Video: https://www.youtube.com/watch?v=abLan0aXJkw and https://www.usenix.org/conference/lisa12/performance-analysis-methodology
Description: "Performance analysis methodologies provide guidance, save time, and can find issues that are otherwise overlooked. Example issues include hardware bus saturation, lock contention, recoverable device errors, kernel scheduling issues, and unnecessary workloads. The talk will focus on the USE Method: a simple strategy for all staff for performing a complete check of system performance health, identifying common bottlenecks and errors. Other analysis methods discussed include workload characterization, drill-down analysis, and latency analysis, with example applications from enterprise and cloud computing. Don’t just reach for tools—use a method!"
PDF: LISA2012_methodologies.pdf
Keywords (from pdftotext):
slide 1:
Performance Analysis Methodology Brendan Gregg Lead Performance Engineer brendan@joyent.com @brendangregg Thursday, December 13, 12 LISA’12 December, 2012slide 2:
In particular, the USE Method CPU Interconnect Memory Bus DRAM CPU CPU DRAM I/O Bus For each resource, check: 1. Utilization 2. Saturation 3. Errors I/O Bridge Expander Interconnect I/O Controller Network Controller Interface Transports Disk Thursday, December 13, 12 Disk Port Portslide 3:
whoami • Lead Performance Engineer • Work/Research: tools, visualizations, methodologies • Was Brendan@Sun Microsystems, Oracle, now Joyent Thursday, December 13, 12slide 4:
Joyent • High-Performance Cloud Infrastructure • Public/private cloud provider • OS-Virtualization for bare metal performance • KVM for Linux guests • Core developers of SmartOS and node.js Thursday, December 13, 12slide 5:
LISA10: Performance Visualizations • Included latency heat maps Buldge Beak Head Wing Shoulders Body Neck http://dtrace.org/blogs/brendan/2012/12/10/usenix-lisa-2010-visualizations-for-performance-analysis/ Thursday, December 13, 12slide 6:
LISA12: Performance Methodologies • Also a focus of my next book Systems Performance ENTERPRISE AND THE CLOUD Brendan Gregg Prentice Hall, 2013 Thursday, December 13, 12slide 7:
Agenda • Performance Issue Example • Ten Performance Methodologies and Anti-Methodologies: • 1. Blame-Someone-Else Anti-Method • 2. Streetlight Anti-Method • 3. Ad Hoc Checklist Method • 4. Problem Statement Method • 5. Scientific Method • 6. Workload Characterization Method • 7. Drill-Down Analysis Method • 8. Latency Analysis Method • 9. USE Method • 10. Stack Profile Method Thursday, December 13, 12slide 8:
Agenda, cont. • Content based on: • Thinking Methodically About Performance. ACMQ http://queue.acm.org/detail.cfm?id=2413037 • Systems Performance. Prentice Hall, 2013 • A focus on systems performance; also applicable to apps Thursday, December 13, 12slide 9:
Performance Issue • An example cloud-based performance issue: “Database response time sometimes take multiple seconds. Is the network dropping packets?” • They tested the network using traceroute, which showed some packet drops Thursday, December 13, 12slide 10:
Performance Issue, cont. • Performance Analysis Top 2nd Level 1st Level Customer: “network drops?” Thursday, December 13, 12slide 11:
Performance Issue, cont. • Performance Analysis Top 2nd Level “network looks ok, CPU also ok” 1st Level “ran traceroute, can’t reproduce” Customer: “network drops?” Thursday, December 13, 12 my turnslide 12:
Performance Issue, cont. • Could try network packet sniffing • tcpdump/snoop • Performance overhead during capture (CPU, storage) and post-processing (wireshark, etc) • Time consuming to analyze: not real-time Thursday, December 13, 12slide 13:
Performance Issue, cont. • Could try dynamic tracing • Efficient: only drop/retransmit paths traced • Context: kernel state readable • Real-time: analysis and summaries # ./tcplistendrop.d TIME 2012 Jan 19 01:22:49 2012 Jan 19 01:22:49 2012 Jan 19 01:22:49 2012 Jan 19 01:22:49 2012 Jan 19 01:22:49 2012 Jan 19 01:22:49 2012 Jan 19 01:22:49 [...] Thursday, December 13, 12 SRC-IP PORT DST-IP 25691 ->gt; 192.192.240.212 18423 ->gt; 192.192.240.212 38883 ->gt; 192.192.240.212 10739 ->gt; 192.192.240.212 27988 ->gt; 192.192.240.212 28824 ->gt; 192.192.240.212 65070 ->gt; 192.192.240.212 PORTslide 14:
Performance Issue, cont. • Instead of either, I began with the USE method • Inslide 15: Performance Issue, cont. • Customer was surprised. These findings were then investigated using another methodology – latency analysis: • memory: using both microstate accounting and dynamic tracing to confirm that anonymous page-ins were hurting the database; worst case app thread spent 97% of time blocked on disk (data faults). • disk: using dynamic tracing to confirm synchronous latency at the application / file system interface; included up to 1000 ms fsync() calls. • These confirmations took about 1 hour Thursday, December 13, 12slide 16:Performance Issue, cont. • Methodologies can help identify and root-cause issues • Different methodologies can be used as needed; in this case: • USE Method: quick system health • Latency Analysis: root cause • Faster resolution of issues, frees time for multiple teams Thursday, December 13, 12slide 17:Performance Methodologies • Not a tool • Not a product • Is a procedure (documentation) Thursday, December 13, 12slide 18:Performance Methodologies, cont. • Not a tool but tools can be written to help • Not a product could be in monitoring solutions • Is a procedure (documentation) Thursday, December 13, 12slide 19:Performance Methodologies, cont. • Audience • Beginners: provides a starting point • Experts: provides a reminder • Casual users: provides a checklist Thursday, December 13, 12slide 20:Performance Methodologies, cont. • Operating system performance analysis circa ‘90s, metric-orientated: • Vendor creates metrics and performance tools • Users develop methods to interpret metrics • Previously common methodologies: • Ad hoc checklists: common tuning tips • Tools-based checklists: for each tool, study useful metrics • Study kernel internals, then develop your own • Problematic: vendors often don’t provide the best metrics; can be blind to issue types Thursday, December 13, 12slide 21:Performance Methodologies, cont. • Operating systems now provide dynamic tracing • See anything, not just what the vendor gave you • Hardest part is knowing what questions to ask • Methodologies can pose the questions • What would previously be an academic exercise is now practical Thursday, December 13, 12slide 22:Performance Methodologies, cont. • Starting with some anti-methodologies for comparison... Thursday, December 13, 12slide 23:Blame-Someone-Else Anti-Method Thursday, December 13, 12slide 24:Blame-Someone-Else Anti-Method • 1. Find a system or environment component you are not responsible for • 2. Hypothesize that the issue is with that component • 3. Redirect the issue to the responsible team • 4. When proven wrong, go to 1 Thursday, December 13, 12slide 25:Blame-Someone-Else Anti-Method, cont. "Maybe it's the network. Can you check with the network team if they have had dropped packets ... or something?" Thursday, December 13, 12slide 26:Blame-Someone-Else Anti-Method, cont. • 1. Find a system or environment component you are not responsible for • 2. Hypothesize that the issue is with that component • 3. Redirect the issue to the responsible team • 4. When proven wrong, go to 1 ... a colleague asked if I could make this into a flow chart Thursday, December 13, 12slide 27:Blame-Someone-Else Anti-Method, cont. Start Pick Someone Else’s Component Hypothesize Redirect Thursday, December 13, 12 Proven Wrong?slide 28:Blame-Someone-Else Anti-Method, cont. • Wasteful of other team resources • Identifiable by a lack of data analysis – or any data at all • Ask for screenshots, then take them for a 2nd opinion Thursday, December 13, 12slide 29:Streetlight Anti-Method Thursday, December 13, 12slide 30:Streetlight Anti-Method • 1. Pick observability tools that are • familiar • found on the Internet • found at random • 2. Run tools • 3. Look for obvious issues Thursday, December 13, 12slide 31:Streetlight Anti-Method, cont. • Named after an observational bias called the streetlight effect A policeman sees a drunk looking under a streetlight, and asks what he is looking for. The drunk says he has lost his keys. The policeman can't find them either, and asks if he lost them under the streetlight. The drunk replies: “No, but this is where the light is best.” Thursday, December 13, 12slide 32:Streetlight Anti-Method, cont. $ ping 10.2.204.2 PING 10.2.204.2 (10.2.204.2) 56(84) bytes of data. 64 bytes from 10.2.204.2: icmp_seq=1 ttl=254 time=0.654 ms 64 bytes from 10.2.204.2: icmp_seq=2 ttl=254 time=0.617 ms 64 bytes from 10.2.204.2: icmp_seq=3 ttl=254 time=0.660 ms 64 bytes from 10.2.204.2: icmp_seq=4 ttl=254 time=0.641 ms 64 bytes from 10.2.204.2: icmp_seq=5 ttl=254 time=0.629 ms 64 bytes from 10.2.204.2: icmp_seq=6 ttl=254 time=0.606 ms 64 bytes from 10.2.204.2: icmp_seq=7 ttl=254 time=0.588 ms 64 bytes from 10.2.204.2: icmp_seq=8 ttl=254 time=0.653 ms 64 bytes from 10.2.204.2: icmp_seq=9 ttl=254 time=0.618 ms 64 bytes from 10.2.204.2: icmp_seq=10 ttl=254 time=0.650 ms --- 10.2.204.2 ping statistics --10 packets transmitted, 10 received, 0% packet loss, time 8994ms rtt min/avg/max/mdev = 0.588/0.631/0.660/0.035 ms • Why were you running ping? Thursday, December 13, 12slide 33:Streetlight Anti-Method, cont. top - 15:09:38 up 255 days, 16:54, 10 users, load average: 0.00, 0.03, 0.00 Tasks: 274 total, 1 running, 273 sleeping, 0 stopped, 0 zombie Cpu(s): 0.7%us, 0.0%sy, 0.0%ni, 99.1%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 8181740k total, 7654228k used, 527512k free, 405616k buffers Swap: 2932728k total, 125064k used, 2807664k free, 3826244k cached PID USER 16876 root 3947 brendan 15841 joshw 16922 joshw 1 root 2 root 3 root 4 root 5 root [...] NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 0 57596 17m 1972 S 4 0.2 3:00.60 python 0 19352 1552 1060 R 0 0.0 0:00.06 top 0 67144 23m 908 S 0 0.3 218:21.70 mosh-server 0 54924 11m 920 S 0 0.1 121:34.20 mosh-server 0 23788 1432 736 S 0 0.0 0:18.15 init 0 S 0 0.0 0:00.61 kthreadd 0 S 0 0.0 0:00.11 migration/0 0 S 0 0.0 18:43.09 ksoftirqd/0 0 S 0 0.0 0:00.00 watchdog/0 • Why are you still running top? Thursday, December 13, 12slide 34:Streetlight Anti-Method, cont. • Tools-based approach • Inefficient: • can take time before the right tool is found • can be wasteful when investigating false positives • Incomplete: • tools are difficult to find or learn • tools are incomplete or missing Thursday, December 13, 12slide 35:Ad Hoc Checklist Method Thursday, December 13, 12slide 36:Ad Hoc Checklist Method • 1..N. Run A, if B, do C Thursday, December 13, 12slide 37:Ad Hoc Checklist Method, cont. • 1..N. Run A, if B, do C • Each item can include: • which tool to run • how to interpret output • suggested actions • Can cover common and recent issues Thursday, December 13, 12slide 38:Ad Hoc Checklist Method, cont. • Page 1 of Sun Performance and Tuning [Cockcroft 95], has “Quick Reference for Common Tuning Tips” • disk bottlenecks • run iostat with 30 second intervals; look for more than 30% busy disks with +50ms service times; increasing the inode cache size can help; stripe file systems over multiple disks • NFS response times • run nfsstat -m, follow similar strategy as with disk bottlenecks • memory checks • don’t worry about where RAM has gone, or page-ins and -outs; run vmstat and look at the page scanner: over 200 for 30 secs • etc. Thursday, December 13, 12slide 39:Ad Hoc Checklist Method, cont. • Pros: • Easy to follow • Can also be fast • Consistent check of all items – including egregious issues • Can be prescriptive • Cons: • Limited to items on list • Point-in-time recommendations – needs regular updates • Pragmatic: a process for all staff on a support team to check a minimum set of issues, and deliver a practical result. Thursday, December 13, 12slide 40:Problem Statement Method Thursday, December 13, 12slide 41:Problem Statement Method • 1. What makes you think there is a performance problem? • 2. Has this system ever performed well? • 3. What has changed recently? (Software? Hardware? Load?) • 4. Can the performance degradation be expressed in terms of latency or run time? • 5. Does the problem affect other people or applications (or is it just you)? • 6. What is the environment? What software and hardware is used? Versions? Configuration? Thursday, December 13, 12slide 42:Problem Statement Method, cont.: Examples • 1. What makes you think there is a performance problem? • “I saw 1000 disk IOPS” • 2. Has this system ever performed well? • “The system has never worked” • 3. What has changed recently? • “We’re on slashdot/HN/reddit right now” • 4. Can the performance degradation be expressed ... latency? • “Query time is 10%/10x slower” • 5. Does the problem affect other people or applications? • “All systems are offline” • 6. What is the environment? ... • “We are on an ancient software version” Thursday, December 13, 12slide 43:Problem Statement Method, cont.: Examples • 1. What makes you think there is a performance problem? • “I saw 1000 disk IOPS” – not a problem by itself • 2. Has this system ever performed well? • “The system has never worked” – good to know! • 3. What has changed recently? • “We’re on slashdot/HN/reddit right now” – scalability? • 4. Can the performance degradation be expressed ... latency? • “Query time is 10%/10x slower” – quantify • 5. Does the problem affect other people or applications? • “All systems are offline” – power/network? • 6. What is the environment? ... • “We are on an ancient software version” – known issue? Thursday, December 13, 12slide 44:Problem Statement Method, cont. • Often used by support staff for collecting information, and entered into a ticketing system • Can be used first before other methodologies • Pros: • Fast • Resolves a class of issues without further investigation • Cons: • Limited scope (but this is obvious) Thursday, December 13, 12slide 45:Scientific Method Thursday, December 13, 12slide 46:Scientific Method • 1. Question • 2. Hypothesis • 3. Prediction • 4. Test • 5. Analysis Thursday, December 13, 12slide 47:Scientific Method, cont. • Observation tests: • Run a tool, read a metric • Experimental tests: • Change a tunable parameter • Increase/decrease load Thursday, December 13, 12slide 48:Scientific Method, cont. • Experimental tests can either increase or decrease performance • Examples: • A) Observational • B) Observational • C) Experimental: increase • D) Experimental: decrease • E) Experimental: decrease Thursday, December 13, 12slide 49:Scientific Method, cont. • Example A, observational: • 1. Question: what is causing slow database queries? • 2. Hypothesis: noisy neighbors (cloud) performing disk I/O, contending with database disk I/O (via the file system) • 3. Prediction: Thursday, December 13, 12slide 50:Scientific Method, cont. • Example A, observational: • 1. Question: what is causing slow database queries? • 2. Hypothesis: noisy neighbors (cloud) performing disk I/O, contending with database disk I/O (via the file system) • 3. Prediction: if file system I/O latency is measured during a query, it will show that it is responsible for slow queries • 4. Test: dynamic tracing of database FS latency as a ratio of query latency shows less than 5% is FS • 5. Analysis: FS, and disks, are not responsible for slow queries. Go to 2 and develop a new hypothesis Thursday, December 13, 12slide 51:Scientific Method, cont. • Example B, observational: • 1. Question: why is an app slower after moving it to a multiprocessor system? • 2. Hypothesis: NUMA effects – remote memory I/O, CPU interconnect contention, less cache warmth, cross calls, ... • 3. Prediction: Thursday, December 13, 12slide 52:Scientific Method, cont. • Example B, observational: • 1. Question: why is an app slower after moving it to a multiprocessor system? • 2. Hypothesis: NUMA effects – remote memory I/O, CPU interconnect contention, less cache warmth, cross calls, ... • 3. Prediction: increase in memory stall cycles, an increase in CPI, and remote memory access • 4. Test: perf events / cpustat, quality time with the vendor processor manuals • 5. Analysis: consistent with predictions • time consuming; experimental? Thursday, December 13, 12slide 53:Scientific Method, cont. • Example C, experimental: • 1. Question: why is an app slower after moving it to a multiprocessor system? • 2. Hypothesis: NUMA effects – remote memory I/O, CPU interconnect contention, less cache warmth, cross calls, ... • 3. Prediction: Thursday, December 13, 12slide 54:Scientific Method, cont. • Example C, experimental: • 1. Question: why is an app slower after moving it to a multiprocessor system? • 2. Hypothesis: NUMA effects – remote memory I/O, CPU interconnect contention, less cache warmth, cross calls, ... • 3. Prediction: perf improved by disabling extra processors; partially improved by off-lining them (easier; still has remote memory I/O) • 4. Test: disabled all CPUs on extra processors, perf improved by 50% • 5. Analysis: magnitude consistent with perf reduction Thursday, December 13, 12slide 55:Scientific Method, cont. • Example D, experimental: • 1. Question: degraded file system perf as the cache grows • 2. Hypothesis: file system metadata overheads, relative to the record size – more records, more lock contention on hash tables for the record lists • 3. Prediction: Thursday, December 13, 12slide 56:Scientific Method, cont. • Example D, experimental: • 1. Question: degraded file system perf as the cache grows • 2. Hypothesis: file system metadata overheads, relative to the record size – more records, more lock contention on hash tables for the record lists • 3. Prediction: making the record size progressively smaller, and therefore more records in memory, should make perf progressively worse • 4. Test: same workload with record size /2, /4, /8, /16 • 5. Analysis: results consistent with prediction Thursday, December 13, 12slide 57:Scientific Method, cont. • Example E, experimental: • 1. Question: why did write throughput drop by 20%? • 2. Hypothesis: disk vibration by datacenter alarm • 3. Prediction: any loud noise will reduce throughput • 4. Test: ? Thursday, December 13, 12slide 58:Scientific Method, cont. • Test • Shouting in the Datacenter: http://www.youtube.com/watch?v=tDacjrSCeq4 Thursday, December 13, 12slide 59:Scientific Method, cont. • Analysis Thursday, December 13, 12slide 60:Scientific Method, cont. • Pros: • Good balance of theory and data • Generic methodology • Encourages thought, develops understanding • Cons: • Hypothesis requires expertise • Time consuming – more suited for harder issues Thursday, December 13, 12slide 61:Workload Characterization Method Thursday, December 13, 12slide 62:Workload Characterization Method • 1. Who is causing the load? PID, UID, IP addr, ... • 2. Why is the load called? code path • 3. What is the load? IOPS, tput, type • 4. How is the load changing over time? Thursday, December 13, 12slide 63:Workload Characterization Method, cont. • Example: • System log checker is much slower after system upgrade • Who: grep(1) is on-CPU for 8 minutes • Why: UTF8 encoding, as LANG=en_US.UTF-8 • LANG=C avoided UTF8 encoding – 2000x faster Thursday, December 13, 12slide 64:Workload Characterization Method, cont. • Identifies issues of load • Best performance wins are from eliminating unnecessary work • Don’t assume you know what the workload is – characterize Thursday, December 13, 12slide 65:Workload Characterization Method, cont. • Pros: • Potentially largest wins • Cons: • Only solves a class of issues – load • Time consuming, and can be discouraging – most attributes examined will not be a problem Thursday, December 13, 12slide 66:Drill-Down Analysis Method Thursday, December 13, 12slide 67:Drill-Down Analysis Method • 1. Start at highest level • 2. Examine next-level details • 3. Pick most interesting breakdown • 4. If problem unsolved, go to 2 Thursday, December 13, 12slide 68:Drill-Down Analysis Method, cont. • For a distributed environment [McDougall 06]: • 1. Monitoring: environment-wide, and identifying or alerting when systems have issues (eg, SNMP) • 2. Identification: given a system, examining resources and applications for location of issue (eg, mpstat) • 3. Analysis: given a suspected source, drilling down to identify root cause or causes (eg, dtrace) • Analysis stage was previously limited to the given toolset; now can be explored in arbitrary detail using dynamic tracing Thursday, December 13, 12slide 69:Drill-Down Analysis Method, cont.: Example • For example, ZFS Process User-Land Kernel Syscall Interface VFS ZFS Block Device Interface Device Drivers Disks Thursday, December 13, 12slide 70:Drill-Down Analysis Method, cont.: Example • For example, ZFS Drill-Down Analysis Process User-Land Kernel Syscall Interface VFS ZFS Block Device Interface Device Drivers Disks Thursday, December 13, 12slide 71:Drill-Down Analysis Method, cont.: Example • Using DTrace mysql_pid_fslatency.d Process User-Land Kernel Syscall Interface VFS ZFS Block Device Interface Device Drivers Disks Thursday, December 13, 12slide 72:Drill-Down Analysis Method, cont.: Example • Drill... mysql_pid_fslatency.d Process User-Land Kernel syscall with fi_fs == zfs Syscall Interface VFS ZFS Block Device Interface Device Drivers Disks Thursday, December 13, 12slide 73:Drill-Down Analysis Method, cont.: Example • Drill... mysql_pid_fslatency.d Process User-Land Kernel syscall with fi_fs == zfs Syscall Interface *vfssnoop.d fswho.d VFS ZFS Block Device Interface Device Drivers Disks Thursday, December 13, 12slide 74:Drill-Down Analysis Method, cont.: Example • Drill... mysql_pid_fslatency.d Process User-Land Kernel syscall with fi_fs == zfs Syscall Interface *vfssnoop.d fswho.d VFS zioslower.d ZFS Block Device Interface Device Drivers Disks Thursday, December 13, 12slide 75:Drill-Down Analysis Method, cont.: Example • Drill... mysql_pid_fslatency.d Process User-Land Kernel syscall with fi_fs == zfs Syscall Interface *vfssnoop.d fswho.d VFS zioslower.d spasync.d ziosnoop.d metaslab_free.d arcaccess.d ZFS Block Device Interface Device Drivers Disks Thursday, December 13, 12slide 76:Drill-Down Analysis Method, cont.: Example • Drill... mysql_pid_fslatency.d Process User-Land Kernel syscall with fi_fs == zfs Syscall Interface *vfssnoop.d fswho.d VFS zioslower.d spasync.d ziosnoop.d metaslab_free.d arcaccess.d ZFS iostacks.d Block Device Interface Device Drivers Disks Thursday, December 13, 12 iosnoop disklatency.d seeksize.d bitesize.dslide 77:Drill-Down Analysis Method, cont.: Example • Drill... mysql_pid_fslatency.d Process User-Land Kernel syscall with fi_fs == zfs Syscall Interface *vfssnoop.d fswho.d VFS zioslower.d spasync.d ziosnoop.d metaslab_free.d arcaccess.d kernel drivers: see DTrace Book Chap 4 eg, scsilatency.d Thursday, December 13, 12 ZFS iostacks.d Block Device Interface Device Drivers Disks iosnoop disklatency.d seeksize.d bitesize.dslide 78:Drill-Down Analysis Method, cont. • Moves from higher- to lower-level details based on findings: environment-wide down to metal • Peels away layers of software and hardware to locate cause • Pros: • Will identify root cause(s) • Cons: • Time consuming – especially when drilling in the wrong direction Thursday, December 13, 12slide 79:Latency Analysis Method Thursday, December 13, 12slide 80:Latency Analysis Method, cont. • 1. Measure operation time (latency) • 2. Divide into logical synchronous components • 3. Continue division until latency origin is identified • 4. Quantify: estimate speedup if problem fixed Thursday, December 13, 12slide 81:Latency Analysis Method, cont.: Example • Example, logging of slow query time with file system latency: # ./mysqld_pid_fslatency_slowlog.d 29952 2011 May 16 23:34:00 filesystem I/O during query >gt; 100 ms: query 538 ms, fs 509 ms, 83 I/O 2011 May 16 23:34:11 filesystem I/O during query >gt; 100 ms: query 342 ms, fs 303 ms, 75 I/O 2011 May 16 23:34:38 filesystem I/O during query >gt; 100 ms: query 479 ms, fs 471 ms, 44 I/O 2011 May 16 23:34:58 filesystem I/O during query >gt; 100 ms: query 153 ms, fs 152 ms, 1 I/O 2011 May 16 23:35:09 filesystem I/O during query >gt; 100 ms: query 383 ms, fs 372 ms, 72 I/O 2011 May 16 23:36:09 filesystem I/O during query >gt; 100 ms: query 406 ms, fs 344 ms, 109 I/O 2011 May 16 23:36:44 filesystem I/O during query >gt; 100 ms: query 343 ms, fs 319 ms, 75 I/O 2011 May 16 23:36:54 filesystem I/O during query >gt; 100 ms: query 196 ms, fs 185 ms, 59 I/O 2011 May 16 23:37:10 filesystem I/O during query >gt; 100 ms: query 254 ms, fs 209 ms, 83 I/O [...] Operation Time Thursday, December 13, 12 FS Componentslide 82:Latency Analysis Method, cont.: Types • Drill-down analysis of latency • many of the previous ZFS examples were latency-based • Latency binary search, eg: • 1. Operation latency is A • 2. Measure A • 3. Measure synchronous components: B, C (can be sums) • 4. if B >gt; C, A = B. else A = C • 5. If problem unsolved, go to 2 • Spot-the-outlier from multiple layers – correlate latency Thursday, December 13, 12slide 83:Latency Analysis Method, cont.: Example • Drill-down: Latency distributions # ./zfsstacklatency.d dtrace: script './zfsstacklatency.d' matched 25 probes CPU FUNCTION:NAME :END zfs_read time (ns) value ------------- Distribution ------------- count 512 | 1024 |@@@@ 2048 |@@@@@@@@ 4096 |@@@@ 8192 |@@@@@@@@@@@@@@@@ 16384 |@@@@@@@@ 32768 | 65536 | 131072 | 262144 | 524288 | Thursday, December 13, 12slide 84:Latency Analysis Method, cont.: Example • Drill-down: Latency distributions zfs_write time (ns) value ------------- Distribution ------------- count 2048 | 4096 |@@@ 8192 |@@@@@@@@@@@@@@@@@@@ 16384 |@@@@@@@@@@@@@@@ 32768 |@@@ 65536 |@ 131072 | 262144 | 524288 | Thursday, December 13, 12slide 85:Latency Analysis Method, cont.: Example • Drill-down: Latency distributions zio_wait time (ns) value ------------- Distribution ------------- count 512 | 1024 |@@@@@@@@@@@@@ 2048 |@@@@@@@@@@@@@@@@@@@@@@@ 4096 |@@@@ 8192 | 16384 | 32768 | 65536 | 131072 | 262144 | 524288 | 1048576 | 2097152 | 4194304 | 8388608 | 16777216 | 33554432 | 67108864 | 134217728 | 268435456 | 536870912 | Thursday, December 13, 12slide 86:Latency Analysis Method, cont.: Example • Drill-down: Latency distributions zio_vdev_io_done time (ns) value ------------- Distribution ------------- count 2048 | 4096 |@ 8192 |@@@@ 16384 |@ 32768 |@ 65536 | 131072 |@@ 262144 |@@ 524288 |@@@ 1048576 |@@@ 2097152 | 4194304 | 8388608 | 16777216 | 33554432 |@@@ 67108864 |@@@@@@@@@@@@@@@@@@@@@ 134217728 | 268435456 | 536870912 | Thursday, December 13, 12slide 87:Latency Analysis Method, cont.: Example • Drill-down: Latency distributions vdev_disk_io_done time (ns) value ------------- Distribution ------------- count 65536 | 131072 |@ 262144 |@@ 524288 |@@@@ 1048576 |@@@ 2097152 | 4194304 | 8388608 | 16777216 | 33554432 |@@@ 67108864 |@@@@@@@@@@@@@@@@@@@@@@@@@ 134217728 | 268435456 | 536870912 | Thursday, December 13, 12slide 88:Latency Analysis Method, cont.: Example • Drill-down: Latency distributions io:::start time (ns) value ------------- Distribution ------------- count 32768 | 65536 | 131072 |@@ 262144 |@@ 524288 |@@@@ 1048576 |@@@ 2097152 | 4194304 | 8388608 | 16777216 | 33554432 |@@@ 67108864 |@@@@@@@@@@@@@@@@@@@@@@@@@ 134217728 | 268435456 | 536870912 | Thursday, December 13, 12slide 89:Latency Analysis Method, cont.: Example • Drill-down: Latency distributions scsi time (ns) value ------------- Distribution ------------- count 16384 | 32768 | 65536 | 131072 |@ 262144 |@@ 524288 |@@@@ 1048576 |@@@ 2097152 | 4194304 | 8388608 | 16777216 | 33554432 |@@@ 67108864 |@@@@@@@@@@@@@@@@@@@@@@@@@ 134217728 | 268435456 | 536870912 | Thursday, December 13, 12slide 90:Latency Analysis Method, cont.: Example • Drill-down: Latency distributions mega_sas time (ns) value ------------- Distribution ------------- count 16384 | 32768 | 65536 | 131072 |@@ 262144 |@ 524288 |@@@@ 1048576 |@@@ 2097152 | 4194304 | 8388608 | 16777216 | 33554432 |@@@ 67108864 |@@@@@@@@@@@@@@@@@@@@@@@@@ 134217728 | 268435456 | 536870912 | Thursday, December 13, 12slide 91:Latency Analysis Method, cont. • Latency matters – potentially solve most issues • Similar pros & cons as drill-down analysis • Also see Method R: latency analysis initially developed for Oracle databases [Millsap 03] Thursday, December 13, 12slide 92:USE Method Thursday, December 13, 12slide 93:USE Method • For every resource, check: • 1. Utilization • 2. Saturation • 3. Errors Thursday, December 13, 12slide 94:USE Method, cont. • For every resource, check: • 1. Utilization: time resource was busy, or degree used • 2. Saturation: degree of queued extra work • 3. Errors: any errors Saturation Errors Thursday, December 13, 12 Utilizationslide 95:USE Method, cont. • Process: Choose Resource Errors? High Utilization? Saturation? A Problem Identified • Errors are often easier to interpret, and can be checked first Thursday, December 13, 12slide 96:USE Method, cont. • Hardware Resources: • CPUs • Main Memory • Network Interfaces • Storage Devices • Controllers • Interconnects Thursday, December 13, 12slide 97:USE Method, cont. • A great way to determine resources is to find or draw the server functional diagram • Vendor hardware teams have these • Analyze every component in the data path Thursday, December 13, 12slide 98:USE Method, cont.: Functional Diagram CPU Interconnect Memory Bus DRAM CPU CPU DRAM I/O Bus I/O Bridge Expander Interconnect I/O Controller Network Controller Interface Transports Disk Thursday, December 13, 12 Disk Port Portslide 99:USE Method, cont. • Definition of utilization depends on the resource type: • I/O resource (eg, disks) – utilization is time busy • Capacity resource (eg, main memory) – utilization is space consumed • Storage devices can act as both Thursday, December 13, 12slide 100:USE Method, cont. • Utilization • 100% usually a bottleneck • 60%+ often a bottleneck for I/O resources, especially when high priority work cannot easily interrupt lower priority work (eg, disks) • Beware of time intervals. 60% utilized over 5 minutes may mean 100% utilized for 3 minutes then idle • Best examined per-device (unbalanced workloads) Thursday, December 13, 12slide 101:USE Method, cont. • Saturation • Any sustained non-zero value adds latency • Errors • Should be obvious Thursday, December 13, 12slide 102:USE Method, cont.: Examples Resource Type CPU utilization CPU saturation Memory utilization Memory saturation Network Interface utilization Storage Device I/O utilization Storage Device I/O saturation Storage Device I/O errors Thursday, December 13, 12 Metricslide 103:USE Method, cont.: Examples Resource Type Metric CPU utilization CPU utilization CPU saturation run-queue length, sched lat. Memory utilization available memory Memory saturation paging or swapping Network Interface utilization RX/TX tput/bandwidth Storage Device I/O utilization device busy percent Storage Device I/O saturation wait queue length Storage Device I/O errors device errors Thursday, December 13, 12slide 104:USE Method, cont.: Harder Examples Resource Type CPU errors Network saturation Storage Controller utilization CPU Interconnect utilization Mem. Interconnect saturation I/O Interconnect utilization Thursday, December 13, 12 Metricslide 105:USE Method, cont.: Harder Examples Resource Type Metric CPU errors eg, correctable CPU cache ECC events Network saturation “nocanputs”, buffering utilization active vs max controller IOPS and tput utilization per port tput / max bandwidth saturation memory stall cycles, high cycles-per-instruction (CPI) utilization bus throughput / max bandwidth Storage Controller CPU Interconnect Mem. Interconnect I/O Interconnect Thursday, December 13, 12slide 106:USE Method, cont. • Some software resources can also be studied: • Mutex Locks • Thread Pools • Process/Thread Capacity • File Descriptor Capacity • Consider possible USE metrics for each Thursday, December 13, 12slide 107:USE Method, cont. • This process may reveal missing metrics – those not provided by your current toolset • They are your known unknowns • Much better than unknown unknowns • More tools can be installed and developed to help • Please, no more top variants! unless it is interconnect-top or bus-top Thursday, December 13, 12slide 108:USE Method, cont.: Example Linux Checklist http://dtrace.org/blogs/brendan/2012/03/07/the-use-method-linux-performance-checklist Resource Type CPU CPU CPU Utilization Saturation Errors Metric per-cpu: mpstat -P ALL 1, “%idle”; sar -P ALL, “%idle”; system-wide: vmstat 1, “id”; sar -u, “%idle”; dstat -c, “idl”; per-process:top, “%CPU”; htop, “CPU %”; ps -o pcpu; pidstat 1, “%CPU”; per-kernelthread: top/htop (“K” to toggle), where VIRT == 0 (heuristic). [1]vmstat 1, “r” >gt; CPU count [2]; sar -q, system-wide: “runq-sz” >gt; CPU count; dstat -p, “run” >gt; CPU count; per-process: /proc/PID/schedstat 2nd field (sched_info.run_delay); perf sched latency (shows “Average” and “Maximum” delay per-schedule); dynamic tracing, eg, SystemTap schedtimes.stp “queued(us)” [3] perf (LPE) if processor specific error events (CPC) are available; eg, AMD64′s “04Ah Single-bit ECC Errors Recorded by Scrubber” [4] ... etc for all combinations (would fill a dozen slides) Thursday, December 13, 12slide 109:USE Method, cont.: illumos/SmartOS Checklist http://dtrace.org/blogs/brendan/2012/03/01/the-use-method-solaris-performance-checklist Resource Type Metric CPU Utilization per-cpu: mpstat 1, “idl”; system-wide: vmstat 1, “id”; per-process:prstat -c 1 (“CPU” == recent), prstat mLc 1 (“USR” + “SYS”); per-kernel-thread: lockstat Ii rate, DTrace profile stack() Saturation system-wide: uptime, load averages; vmstat 1, “r”; DTrace dispqlen.d (DTT) for a better “vmstat r”; perprocess: prstat -mLc 1, “LAT” Errors fmadm faulty; cpustat (CPC) for whatever error counters are supported (eg, thermal throttling) Saturation system-wide: vmstat 1, “sr” (bad now), “w” (was very bad); vmstat -p 1, “api” (anon page ins == pain), “apo”; per-process: prstat -mLc 1, “DFL”; DTrace anonpgpid.d (DTT), vminfo:::anonpgin on execname CPU CPU Memory ... etc for all combinations (would fill a dozen slides) Thursday, December 13, 12slide 110:USE Method, cont. • To be thorough, you will need to use: • CPU performance counters (CPC) • For bus and interconnect activity; eg, perf events, cpustat • Dynamic Tracing • For missing saturation and error metrics; eg, DTrace Thursday, December 13, 12slide 111:USE Method, cont.: CPC Example • Quad-processor AMD w/HyperTransport, functional diagram: DRAM CPU Socket 2 CPU Socket 3 DRAM CPU Socket 1 DRAM HT2 DRAM HT0 CPU Socket 0 HT1 Thursday, December 13, 12 I/O MCP55 I/O IO55 PCIe PCIe HyperTransport Memory Bus PCIe Busslide 112:USE Method, cont.: CPC Example • Per-port HyperTransport TX throughput: # ./amd64htcpu 1 Socket HT0 TX MB/s Socket HT0 TX MB/s [...] HT1 TX MB/s HT1 TX MB/s HT2 TX MB/s HT2 TX MB/s HT3 TX MB/s HT3 TX MB/s HT1 TX MB/s MCP55 CPU1-3 CPU2-3 CPU3-1 HT2 TX MB/s CPU0-2 IO55 CPU2-0 CPU3-2 HT3 TX MB/s • Decoder Matrix: Socket Thursday, December 13, 12 HT0 TX MB/s CPU0-1 CPU1-0 CPU2-3 CPU3-2slide 113:USE Method, cont.: CPC Example • Currently not that easy to write – takes time to study the processor manuals • Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3B, page 535 of 1,026: • I’ve written and shared CPC-based tools before. It takes a lot of maintenance to stay current; getting better with PAPI. Thursday, December 13, 12slide 114:USE Method, cont.: Products • Supported products can be developed to help • Joyent Cloud Analytics includes metrics to support USE Thursday, December 13, 12slide 115:USE Method, cont.: Products • Supported products can be developed to help • Joyent Cloud Analytics includes metrics to support USE Cloud-wide Per-CPU Utilization Heat Map Hostname Thursday, December 13, 12 Hot CPUslide 116:USE Method, cont.: Products • Do you develop a monitoring product? • Suggestion: add USE Method wizard • For docs, refer to this talk and: http://queue.acm.org/detail.cfm?id=2413037 • Do you pay for a monitoring product? • Ask for the USE Method Thursday, December 13, 12slide 117:USE Method, cont. • Resource-based approach • Quick system health check, early in an investigation • Pros: • Complete: all resource bottlenecks and errors • Not limited in scope by your current toolset • No unknown unknowns – at least known unknowns • Efficient: picks three metrics for each resource – from what may be dozens available • Cons: • Limited to a class of issues Thursday, December 13, 12slide 118:Stack Profile Method Thursday, December 13, 12slide 119:Stack Profile Method • 1. Profile thread stack traces (on- and off-CPU) • 2. Coalesce • 3. Study stacks bottom-up Thursday, December 13, 12slide 120:Stack Profile Method, cont. • Profiling thread stacks: • On-CPU: often profiled by sampling (low overhead) • eg, perf, oprofile, dtrace • Off-CPU (sleeping): not commonly profiled • no PC or pinned thread stack for interrupt-profiling • with static/dynamic tracing, you can trace stacks on scheduler off-/on-cpu events, and, stacks don’t change while off-cpu • I’ve previously called this: Off-CPU Performance Analysis • Examine both Thursday, December 13, 12slide 121:Stack Profile Method, cont. • Eg, using DTrace (easiest to demo both), for PID 191: • On-CPU: • dtrace -n ’profile-97 /pid == 191/ { @[ustack()] = count(); }’ • output has stacks with sample counts (97 Hertz) • Off-CPU: • dtrace -n ’sched:::off-cpu /pid == 191/ { self->gt;ts = timestamp; } sched:::on-cpu /self->gt;ts/ { @[ustack()] = sum(timestamp - self->gt;ts); self->gt;ts = 0; }’ • output has stacks with nanosecond times Thursday, December 13, 12slide 122:Stack Profile Method, cont. • One stack: libc.so.1`mutex_trylock_adaptive+0x112 libc.so.1`mutex_lock_impl+0x165 libc.so.1`mutex_lock+0xc mysqld`key_cache_read+0x741 mysqld`_mi_fetch_keypage+0x48 mysqld`w_search+0x84 mysqld`_mi_ck_write_btree+0xa5 mysqld`mi_write+0x344 mysqld`_ZN9ha_myisam9write_rowEPh+0x43 mysqld`_ZN7handler12ha_write_rowEPh+0x8d mysqld`_ZL9end_writeP4JOINP13st_join_tableb+0x1a3 mysqld`_ZL20evaluate_join_recordP4JOINP13st_join_tablei+0x11e mysqld`_Z10sub_selectP4JOINP13st_join_tableb+0x86 mysqld`_ZL9do_selectP4JOINP4ListI4ItemEP5TABLEP9Procedure+0xd9 mysqld`_ZN4JOIN4execEv+0x482 mysqld`_Z12mysql_selectP3THDPPP4ItemP10TABLE_LISTjR4ListIS1_ES2_... mysqld`_Z13handle_selectP3THDP3LEXP13select_resultm+0x17d mysqld`_ZL21execute_sqlcom_selectP3THDP10TABLE_LIST+0xa6 mysqld`_Z21mysql_execute_commandP3THD+0x124b mysqld`_Z11mysql_parseP3THDPcjP12Parser_state+0x3e1 mysqld`_Z16dispatch_command19enum_server_commandP3THDPcj+0x1619 mysqld`_Z24do_handle_one_connectionP3THD+0x1e5 mysqld`handle_one_connection+0x4c libc.so.1`_thrp_setup+0xbc libc.so.1`_lwp_start Thursday, December 13, 12slide 123:Stack Profile Method, cont. • Study, bottom-up: libc.so.1`mutex_trylock_adaptive+0x112 libc.so.1`mutex_lock_impl+0x165 libc.so.1`mutex_lock+0xc mysqld`key_cache_read+0x741 mysqld`_mi_fetch_keypage+0x48 mysqld`w_search+0x84 mysqld`_mi_ck_write_btree+0xa5 mysqld`mi_write+0x344 mysqld`_ZN9ha_myisam9write_rowEPh+0x43 mysqld`_ZN7handler12ha_write_rowEPh+0x8d mysqld`_ZL9end_writeP4JOINP13st_join_tableb+0x1a3 mysqld`_ZL20evaluate_join_recordP4JOINP13st_join_tablei+0x11e mysqld`_Z10sub_selectP4JOINP13st_join_tableb+0x86 mysqld`_ZL9do_selectP4JOINP4ListI4ItemEP5TABLEP9Procedure+0xd9 mysqld`_ZN4JOIN4execEv+0x482 mysqld`_Z12mysql_selectP3THDPPP4ItemP10TABLE_LISTjR4ListIS1_ES2_... mysqld`_Z13handle_selectP3THDP3LEXP13select_resultm+0x17d mysqld`_ZL21execute_sqlcom_selectP3THDP10TABLE_LIST+0xa6 mysqld`_Z21mysql_execute_commandP3THD+0x124b mysqld`_Z11mysql_parseP3THDPcjP12Parser_state+0x3e1 mysqld`_Z16dispatch_command19enum_server_commandP3THDPcj+0x1619 mysqld`_Z24do_handle_one_connectionP3THD+0x1e5 mysqld`handle_one_connection+0x4c libc.so.1`_thrp_setup+0xbc libc.so.1`_lwp_start Thursday, December 13, 12slide 124:Stack Profile Method, cont. • Profiling, 27,053 unique stacks (already aggregated): 60 seconds of on-CPU MySQL Thursday, December 13, 12slide 125:Stack Profile Method, cont. • Profiling, 27,053 unique stacks (already aggregated): First Stack Size of One Stack Last Stack 60 seconds of on-CPU MySQL Thursday, December 13, 12slide 126:Stack Profile Method, cont. • Coalesce: Flame Graphs for on-CPU (DTrace/perf/...) same dataset Thursday, December 13, 12slide 127:Stack Profile Method, cont. • Coalesce: perf events for on-CPU (also has interactive mode) # perf report | cat [...] # Overhead Command Shared Object Symbol # ........ ........... ................. .............................. 72.98% swapper [kernel.kallsyms] [k] native_safe_halt --- native_safe_halt default_idle cpu_idle rest_init start_kernel x86_64_start_reservations x86_64_start_kernel 9.43% [...] Thursday, December 13, 12 dd [kernel.kallsyms] [k] acpi_pm_read --- acpi_pm_read ktime_get_ts |--87.75%-- __delayacct_blkio_start io_schedule_timeout balance_dirty_pages_ratelimited_nr generic_file_buffered_writeslide 128:Stack Profile Method, cont.: Example Toolset • 1. Profile thread stack traces • DTrace on-CPU sampling, off-CPU tracing • 2. Coalesce • Flame Graphs • 3. Study stacks bottom-up Thursday, December 13, 12slide 129:Stack Profile Method, cont. • Pros: • Can identify a wide range of issues, both on- and off-CPU • Cons: • Doesn’t identify issues with dependancies – eg, when blocked on a mutex or CV • If stacks aren’t obvious, can be time consuming to browse code (assuming you have source access!) Thursday, December 13, 12slide 130:Methodology Ordering • A suggested order for applying previous methodologies: • 1. Problem Statement Method • 2. USE Method • 3. Stack Profile Method • 4. Workload Characterization Method • 5. Drill-Down Analysis Method • 6. Latency Analysis Method Thursday, December 13, 12slide 131:Final Remarks • Methodologies should: • solve real issues quickly • not mislead or confuse • be easily learned by others • You may incorporate elements from multiple methodologies while working an issue • methodologies don’t need to be followed strictly – they are a means to an end Thursday, December 13, 12slide 132:Final Remarks, cont. • Be easily learned by others: • Try tutoring/teaching – if students don’t learn it and solve issues quickly, it isn’t working • This was the inspiration for the USE Method – I was teaching performance classes several years ago • I’ve been teaching again recently, which inspired me to document the Stack Profile Method (more classes in 2013) Thursday, December 13, 12slide 133:References • Gregg, B. 2013. Thinking Methodically About Performance. ACMQ http://queue.acm.org/detail.cfm?id=2413037 • Streetlight Effect; http://en.wikipedia.org/wiki/Streetlight_effect • Cockcroft, A. 1995. Sun Performance and Tuning. Prentice Hall. • Hargreaves, A. 2011. I have a performance problem; http:// alanhargreaves.wordpress.com/2011/06/27/i-have-a-performance-problem • McDougall, R., Mauro, J., Gregg, B. 2006. Solaris Performance and Tools. Prentice Hall. • Gregg, B., Mauro, J., 2011. DTrace: Dynamic Tracing in Oracle Solaris, Mac OS X and FreeBSD, Prentice Hall • Millsap, C., Holt, J. 2003. Optimizing Oracle Performance. O’Reilly. • Pacheco, D. 2011. Welcome to Cloud Analytics. http://dtrace.org/blogs/ dap/2011/03/01/welcome-to-cloud-analytics/ • Gregg, B. 2013. Systems Performance, Prentice Hall (upcoming!) – includes more methodologies Thursday, December 13, 12slide 134:Methodology Origins • Anti-Methodologies – Bryan Cantrill encouraged me to write these up, and named them, while I was documenting other methodologies • Problem Statement Method – these have been used by support teams for a while; Alan Hargreaves documented it for performance • Scientific Method – science! • Latency Analysis Method – Cary Millsap has popularized latency analysis recently with Method R • USE Method – myself; inspired to write about methodology from Cary Millsap’s work, while armed with the capability to explore methodology due to team DTrace’s work • Stack Profile Method (incl. flame graphs & off-CPU analysis) – myself • Ad Hoc Checklist, Workload Characterization, and Drill-Down Analysis have been around in some form for a while, as far as I can tell Thursday, December 13, 12slide 135:Thank you! • email: brendan@joyent.com • twitter: @brendangregg • blog: http://dtrace.org/blogs/brendan • blog resources: • http://dtrace.org/blogs/brendan/2012/02/29/the-use-method/ • http://dtrace.org/blogs/brendan/2012/03/01/the-use-method-solarisperformance-checklist/ • http://dtrace.org/blogs/brendan/2012/03/07/the-use-method-linux-performancechecklist/ • http://dtrace.org/blogs/brendan/2011/05/18/file-system-latency-part-3/ • http://dtrace.org/blogs/brendan/2011/07/08/off-cpu-performance-analysis/ • http://dtrace.org/blogs/brendan/2011/12/16/flame-graphs/ Thursday, December 13, 12