OSCON 2007: Observability Matters, How DTrace Helped Twitter
Talk given at OSCON 2007 by Adam Leventhal and Brendan Gregg.This summarizes findings by a team of Sun Microsystems engineers who visited the headquarters of a small new startup that was running Solaris and having crippling performance issues. That startup was called Twitter.
next prev 1/21 | |
next prev 2/21 | |
next prev 3/21 | |
next prev 4/21 | |
next prev 5/21 | |
next prev 6/21 | |
next prev 7/21 | |
next prev 8/21 | |
next prev 9/21 | |
next prev 10/21 | |
next prev 11/21 | |
next prev 12/21 | |
next prev 13/21 | |
next prev 14/21 | |
next prev 15/21 | |
next prev 16/21 | |
next prev 17/21 | |
next prev 18/21 | |
next prev 19/21 | |
next prev 20/21 | |
next prev 21/21 |
PDF: OSCON2007_Observability_DTrace_Twitter.pdf
Keywords (from pdftotext):
slide 1:
USE IMPROVE EVANGELIZE Observability Matters: How DTrace Helped Twitter Adam Leventhal Brendan Gregg FishWorks Sun Microsystemsslide 2:
USE IMPROVE EVANGELIZE This Talk ● Twitter performance meltdown ● Introduction to DTrace ● DTrace on the case ● Solutions and results ● DTrace for your application ● Q&Aslide 3:
USE IMPROVE EVANGELIZE What Is Twitter? ● Social networking/RSS/SMS ● Ruby on Rails application ● Horizontally scaled ● Centralized MySQL backendslide 4:
USE IMPROVE EVANGELIZE Twitter's Problem ● Started with a few users ● Didn't scale with their success ● High latencies could make it unusable ● Many possible suspects: – The OS: kernel, libraries, etc. – Ruby, MySQL, Apache – The Twitter application itselfslide 5:
USE IMPROVE EVANGELIZE Enter DTrace ● Luckily: Twitter had DTrace Solaris 10, Mac OS X 10.5, FreeBSD* ● Systemic observability Ruby, Java, JavaScript, C/C++, kernel ... ● Concise answers to arbitrary questions ● Designed for production systems – Architected to always be safe – No overhead when not in use ● Also great for developersslide 6:
USE IMPROVE EVANGELIZE DEMO Trace all system calls: # dtrace -n syscall:::entry Aggregate by executable name: # dtrace -n 'syscall:::entry{ @[execname] = count(); }' Examine a user-land process and generate a power-of-two histogram of allocations: # dtrace -c date -n 'pid$target::malloc:entry{ @ = quantize(arg0); }' Trace all I/O on the system: # dtrace -n io:::start In summary: DTrace has systemic scope.slide 7:
USE IMPROVE EVANGELIZE The Investigation ● High latencies under load ● What could cause these latencies? – on-CPU time: slow/numerous functions, ... – off-CPU time: I/O, lock contention, ...slide 8:
USE IMPROVE EVANGELIZE DEMO Run our load generator program that attempts to mimic the salient problem that we saw when examining the Twitter application in production. $ ./deepstack.rb Use a familiar tool to start with; in this case we used mpstat(1): # mpstat 1 CPU minf mjf xcal intr ithr csw icsw migr smtx 450 296 138 srw syscl 0 12362 usr sys wt idl We're spending all our time in user-land (the usr column) so let's look at what application is taking up our time by using a probe that fires at 1234hz and seeing what application is running: # dtrace -n 'profile-1234{ @[execname] = count(); }' Now that we've see that it's Ruby, let's look at the stack trace in Ruby: # dtrace -n 'profile-1234/execname == "ruby"/{ @[ustack()] = count(); }'slide 9:
USE IMPROVE EVANGELIZE Investigation: Steps 1. Started with a broad look at the system using familiar tools mpstat, ... 2. DTrace profiling to sample processes 3. DTrace profiling to sample user stacksslide 10:
USE IMPROVE EVANGELIZE Investigation: So Far ● What we knew: – High application latency (from somewhere) ● What we learned: – Latency may be due to high CPU load – CPU load is due to some Ruby component: Twitter application ● Standard Ruby libraries ● Interpreter itselfslide 11:
USE IMPROVE EVANGELIZE DEMO We'll see what functions in the Ruby process are being called most frequently: # dtrace -p `pgrep ruby` -n 'pid$target:::entry{ @[probefunc] = count(); }' Now that we see that it's memcpy(3C), let's see the stack trace: # dtrace -p `pgrep ruby` -n 'pid$target::memcpy:entry{ @[ustack()]=count(); }' Looking at the stack trace a bit, we noticed backtrace() which seemed strange – why would the Twitter application spend so much time taking stack backtraces? We wanted to measure the amount of time spent in backtrace() to confirm our findings (be sure to look at the backtrace.d script): # ./backtrace.d -p `pgrep ruby` backtrace TOTAL: 665 ms 1005 msslide 12:
USE IMPROVE EVANGELIZE Investigation: Steps 4. Function call counts 5. ustack() for memcpy() 6. Time spent in backtrace() ● What we learned: – 40% of CPU time spent in backtrace()slide 13:
USE IMPROVE EVANGELIZE DEMO We noticed that backtrace() was many a bunch of calls to snprintf(3C) which is used to format strings. Our hope was that the snprintf(3C) calls might be used to format the Ruby stack backtrace for use in an exception object so we wrote a rather complicated script to test this theory (be sure to look at printstack.d): # ./printstack.d -p `pgrep ruby`slide 14:gt; ./deepstack.rb:20:in `parserfunc2' ./deepstack.rb:6:in `parserfunc1' ./deepstack.rb:18:in `parserfunc2' ./deepstack.rb:6:in `parserfunc1' ./deepstack.rb:6:in `parserfunc1' ./deepstack.rb:33 We saw something quite similar at Twitter: a very deep Ruby stack albeit with a more complex texture.
USE IMPROVE EVANGELIZE Investigation: Steps 7. Located source of backtrace() calls in Ruby code ● What we learned: – Several instances of code like this: @string = (str.string rescue str)slide 15:
USE IMPROVE EVANGELIZE Results: No Exceptions ● Rather than blithely calling the method: @string = (str.string rescue str) ● ... check first: @string = str.responds_to?(:string) ? str.string : str ● Result: 30% drop in CPU utilizationslide 16:
USE IMPROVE EVANGELIZE Ruby Provider ● Joyent built DTrace-enabled Ruby ● Trace function entry and return ● Probes for line execution, memory allocation, etc.slide 17:
USE IMPROVE EVANGELIZE DEMO You'll need the DTrace-enabled Ruby which you can find here: http://dtrace.joyent.com Then run our other load generator program with that version of Ruby: $ .../ruby hotfuncs.rb List all Ruby probes in the DTrace framework: # dtrace -l -n 'ruby*:::' Use the function-entry probe to see what Ruby functions are being called most frequently: # dtrace -n 'ruby*:::function-entry{ @[copyinstr(arg1)] = count(); }' func_aslide 18: USE IMPROVE EVANGELIZE Results: CRC32 in C ● Precision optimization ● Rewrite the CRC32 computation in C ● Estimated result: 15% drop in CPU utilizationslide 19:USE IMPROVE EVANGELIZE DTrace Your Application ● Get your application on an OS with DTrace ● Start with the tools you know and dive deeper with DTrace ● If you can do it today or tomorrow, find Adam and Brendan: we'll help you investigate ● DTrace sees allslide 20:USE IMPROVE EVANGELIZE Q&A (and links) – DTrace home page http://www.opensolaris.org/os/community/dtrace – DTrace-enabled Ruby http://dtrace.joyent.com – Getting started with DTrace http://blogs.sun.com/ahl/entry/dtrace_boot_camp – Exception problem: http://dev.rubyonrails.org/ticket/8159 ● http://dev.rubyonrails.org/changeset/6571 – Joyent's blog post on this subject http://joyeur.com/2007/04/24/solaris-dtrace-and-rails 20slide 21:USE IMPROVE EVANGELIZE Observability Matters: How DTrace Helped Twitter Adam Leventhal http://blogs.sun.com/ahl Brendan Gregg http://blogs.sun.com/brendan