I originally posted this at http://dtrace.org/blogs/brendan/2011/06/03/file-system-latency-part-5.
This is part 5 of File System Latency, a series on storage I/O performance from the application perspective (see parts 1, 2, 3 and 4). Previously I explained why disk I/O metrics may not reflect application performance, and how some file system issues may be invisible at the disk I/O level. I then showed how to resolve this by measuring file system latency at the application level using MySQL as an example, and measured latency from other levels of the operating system stack to pinpoint the origin.
The main tool I've used so far is DTrace, which is great for prototyping new metrics in production. In this post, I'll show what this can mean beyond DTrace, establishing file system latency as a primary metric for system administrators and application developers. I'll start by discussing the history for this metric on the Solaris operating system, how it's being used at the moment, and what is coming up next.
This is not just a pretty tool, but the culmination of years of experience (and pain) with file system and disk performance. I've explained this history a little below. To cut to the chase, see the Cloud Analytics and vfsstat sections.
A Little History
During the last five years, new observability tools have appeared on Solaris to measure file system latency and file system I/O. In particular these have targeted the Virtual File System (VFS) layer, which is the kernel (POSIX-based) interface for file systems (and was traced in part 4):
+---------------+ | Application | | mysqld | +---------------+ reads ^ | writes | | | | user-land -------------system calls--------------------------- | | kernel +------------------+ HERE --> | VFS | +------------------+ ^ | ^ | | V | V +--------+ +-------+ | ZFS | | UFS | ... +--------+ +-------+ [...]
Outside of kernel engineering, the closest most of us got to VFS was in operating systems books, seen in diagrams like that above. It's been an abstract notion - not really a practical and observable component. This attitude has been changing with the release of DTrace (2003), which allows us to measure VFS I/O and latency directly, and fsstat(1M), a tool for file system statistics (2006).
Richard McDougall, Jim Mauro and I presented some DTrace-based tools to measure file system I/O and latency at the VFS layer in Solaris Performance and Tools (Prentice Hall, 2006), including vopstat (p118) and fsrw.d (p116). These and other VFS tracing scripts (fspaging.d, rfileio.d, rfsio.d, rwbytype.d) are in the DTraceToolkit (see the FS subdirectory).
fsstat(1M) was also developed in 2006 by Rich Brown to provide kstat-based VFS statistics (PSARC 2006/34) and added to Solaris. This tool is great, and since it is kstat-based it provides historic values with negligible performance overhead. (Bryan Cantrill later added DTrace probes to the kstat instrumentation, to form the fsinfo provider: PSARC 2006/196.) However, fsstat(1M) only provides operation and byte counts, not file system latency - which we really need.
For the DTrace book (Prentice Hall, 2011), Jim and I produced many new VFS scripts, covering Solaris, Mac OS X and FreeBSD. These are in the File Systems chapter (which is available as a PDF). While many of these did not report latency statistics, it is not difficult to enhance the scripts to do so, tracing the time between the "entry" and "return" probes (as was demonstrated via a one-liner in part 4).
DTrace has been the most practical way to measure file system latency across arbitrary applications, especially scripts like those in part 3. I'll comment briefly on a few more sources of performance metrics on Solaris: kstats, truss(1M), LatencyTOP, and application instrumentation.
Kernel statistics (kstats) is a "registry of metrics" (thanks Ben for the term) on Solaris, which provide the raw numbers for traditional observability tools including iostat(1M). While there are many thousands of kstats available, file system latency was not among them.
Even if you were a vendor whose job it was to build monitoring tools on top of Solaris, you can only use what the operating system gives you, hence the focus on disk I/O statistics from iostat(1M) or kstat. More on this in a moment (vfsstat).
You could attach system call tracers such as truss(1) or strace(1) to applications one-by-one to get latency from the syscall level. This would involve examining reads and writes and associating them back to file system-based file descriptors, along with timing data. However, the overhead of these tools is often prohibitive in production due to the way these tools work.
Another tool that could measure file system latency is LatencyTOP. This was released in 2008 by Intel to identify sources of desktop latency on Linux, and implemented by the addition of static trace points throughout the kernel. To see if DTrace could fetch similar data without kernel changes, I quickly wrote latencytop.d. LatencyTOP itself was ported to Solaris in 2009 (PSARC 2009/339):
LatencyTOP uses the Solaris DTrace APIs, specifically the following DTrace providers: sched, proc and lockstat.
While it isn't a VFS or file system oriented tool, its latency statistics do include file system read and write latency, presenting them as a "Maximum" and "Average". With these, it may be possible to identify instances of high latency (outliers), and increased average latency. Which is handy, but about all that is possible. To confirm that file system latency is directly causing slow queries, you'll need to use more DTrace like I did in part 3 to sum file system latency during query latency.
Application developers can instrument their own code as it performs file I/O, collecting high resolution timestamps to calculate file system latency metrics. Because of this, they haven't needed DTrace, however they do need the foresight to have added these metrics before the application is in production. Too often I've been looking at a system where if we could restart the application with different options, then we could probably get the performance data we need. But restarting the application comes with serious cost (downtime), and can mean that the performance issue isn't visible again for hours or days (eg, memory growth/leak related). DTrace lets me get this data immediately.
That's not to say that DTrace is better than application-level metrics - it isn't. If the application already provides file system latency metrics, then use them. Running DTrace will add much more performance overhead than (well designed) application-level counters.
Check what the application provides before turning to DTrace.
I put this high up in the Strategy sections in the DTrace book, not only to avoid re-inventing the wheel, but since familiarization with application metrics is excellent context to build upon with DTrace.
 Well, almost. See the "CPU Latency" section in part 3, which is also true for application-level measurements. DTrace can inspect the kernel and differentiate CPU latency from file system latency, but as I said in part 3, you don't want to be running in a CPU latency state to start with.
I've been using MySQL as an example application to investigate with DTrace. Apart from illustrating the technique, this may serve a second reason: a solution for systems with DTrace but no MySQL support for measuring file system latency. Recent versions of MySQL have provided the performance schema which can do this. Mark Leith posted a detailed article: Monitoring MySQL IO Latency with performance_schema, writing:
filesystem latency can be monitored from the current MySQL 5.5 GA version, with performance schema, on all platforms.
In part 3 I thought it would need version 5.6 to do this, but Mark has demonstrated this on 5.5 GA. This is good news if you are on MySQL 5.5 GA or later, and are running with the performance-schema option.
The DTrace story doesn't quite end here. I'll save this for a separate post as it's more about the MySQL performance schema than the file system latency metric.
What's Happening Now
I'm regularly using DTrace to identify issues of high file system latency, and to quantify how much it is affecting application performance. This is for cloud computing environments, where multiple tenants are sharing a pool of disks, and where disk I/O statistics from iostat(1M) look more alarming than reality (reasons why this can happen are explained in part 1).
The most useful scripts I'm using include those in part 3 to measure latency from within MySQL, and also because these scripts can be run by the customer. Most of the time they show that the file system is performing great, and returning out of DRAM cache (thanks to our environment). This leads the investigation to other areas, narrowing the scope to where the issue really is and saving time skipping where it isn't.
I've identified real disk based issues as well (which are fortunately rare) which when traced as file system latency show that the application really is affected. Again, this saves time, as knowing that there is a file system issue to investigate is a much better position than knowing that there might be one.
Tracing file system latency has worked best from two locations:
Application layer: as demonstrated in part 3, which provides application context to identify if the issue is real (synchronous to workload) and what is affected (workload attributes). The key example from that post was mysqld_pid_fslatency_slowlog.d, which printed the total file system latency along with the query latency - so that slow queries could be immediately identified as file system-based or not.
VFS layer: as demonstrated in part 4, which allows all applications to be traced simultaneously regardless of their I/O code-path. Since these trace inside the kernel, the scripts are unable to be run by customers in the cloud computing environment (zones), as the fbt provider is not available.
For the rare times that there is high file system latency, I'll dig down deeper into the kernel stack to pinpoint the location: tracing the specific file system type (ZFS, UFS, ...), and disk device drivers, as shown in part 4 and chapters 4 and 5 of the DTrace book. This includes using several custom fbt provider-based DTrace scripts, which are fairly brittle as they trace a specific kernel version.
File system latency has become so important that examining it interactively via DTrace scripts is not enough. Many people do use remote monitoring tools (eg, munin) to fetch statistics for graphing, and it's not straight-forward to run these DTrace scripts 24x7 to feed remote monitoring. And those tools would be better to show the full distribution of latency, such as with a heat map, and not just a line graph of the average latency per-second.
At Joyent we've been developing solutions to these on the latest version of our operating system, SmartOS (which is Illumos based). These solutions include vfsstat(1M) and the file system latency instrumentation in Cloud Analytics.
For a disk I/O summary, iostat(1M) does do a good job (using -x for the extended columns). The main issue is that from an application perspective we'd like the statistics to be measured closer to the app, such as in the VFS level.
vfsstat(1M) is a new tool developed by Bill Pijewski of Joyent to do this. You can think of it as an iostat(1M)-like tool for the VFS level, breaking down by smartmachine (zone) instead of by-disk. He showed it in a recent blog post here on dtrace.org. Sample output:
$ vfsstat 1 r/s w/s kr/s kw/s ractv wactv read_t writ_t %r %w d/s del_t zone 2.5 0.1 1.5 0.0 0.0 0.0 0.0 2.6 0 0 0.0 8.0 06da2f3a (437) 1540.4 0.0 195014.9 0.0 0.0 0.0 0.0 0.0 3 0 0.0 0.0 06da2f3a (437) 1991.7 0.0 254931.5 0.0 0.0 0.0 0.0 0.0 4 0 0.0 0.0 06da2f3a (437) 1989.8 0.0 254697.0 0.0 0.0 0.0 0.0 0.0 4 0 0.0 0.0 06da2f3a (437) 1913.0 0.0 244862.7 0.0 0.0 0.0 0.0 0.0 4 0 0.0 0.0 06da2f3a (437) ^C
Rather than the VFS operation counts shown by fsstat(1M), vfsstat(1M) shows resulting VFS performance, including the average read I/O time ("read_t"). And unlike iostat(1M), if vfsstat(1M) is showing an increase in average latency then you know that applications have suffered.
If vfsstat(1M) does identify high latency, the next question is to check if sensitive code-paths have suffered (the "synchronous component of the workload" requirement), which can be identified using the pid provider. An example of this was the mysqld_pid_fslatency_slowlog.d script in part 3, which expressed total file system I/O latency next to query time.
vfsstat(1M) can be a handy tool to run before reaching for DTrace, as it is using kernel statistics (kstats) that are essentially free to use (already maintained and active), and can be run as a non-root user.
A new class of kstats were added for vfsstat(1M), called "zone_vfs". Listing them:
$ kstat zone_vfs module: zone_vfs instance: 437 name: 06da2f3a-752c-11e0-9f4b-07732c class: zone_vfs 100ms_ops 107 10ms_ops 315 1s_ops 19 crtime 960767.771679531 delay_cnt 2160 delay_time 16925 nread 4626152694 nwritten 78949099 reads 7492345 rlentime 27105336415 rtime 21384034819 snaptime 4006844.70048824 wlentime 655500914122 writes 277012 wtime 576119455347
Apart from the data behind the vfsstat(1M) columns, there are also counters for file system I/O with latency greater than 10 ms (10ms_ops), 100 ms (100ms_ops), and 1 second (1s_ops). While these counters have coarse latency resolution, they do provide a historic summary of high file system latency since boot. This may be invaluable for diagnosing a file system issue after the fact, if it wasn't still happening for DTrace to see live, and if remote monitoring of vfsstat(1M) wasn't active.
vfsstat(1M) can be used as an addition to remote monitoring tools (like munin) which currently graph disk I/O statistics from iostat(1M). This not just provides sysadmins with historical graphs, but it also can provide others without root and DTrace access to observe VFS performance, including application developers and database administrators.
Modifying tools that already process iostat(1M) to also process vfsstat(1M) should be a trivial exercise. vfsstat(1M) also supports the -I option to prints absolute values, so that this could be executed every few minutes by the remote monitoring tool and averages calculated after the fact (without needing to leave it running):
$ vfsstat -Ir r/i,w/i,kr/i,kw/i,ractv,wactv,read_t,writ_t,%r,%w,d/i,del_t,zone 6761806.0,257396.0,4074450.0,74476.6,0.0,0.0,0.0,2.5,0,0,0.0,7.9,06da2f3a,437
I used -r as well to print the output in comma-separated format, to make it easier to parse by monitoring software.
Here are some exerpts, showing the other available switches and column definitions:
SYNOPSIS vfsstat [-hIMrzZ] [interval [count]] DESCRIPTION The vfsstat utility reports a summary of VFS read and write activity per zone. It first prints all activity since boot, then reports activity over a specified interval. When run from a non-global zone (NGZ), only activity from that NGZ can be observed. When run from a the global zone (GZ), activity from the GZ and all other NGZs can be observed. [...] OUTPUT The vfsstat utility reports the following information: r/s reads per second w/s writes per second kr/s kilobytes read per second kw/s kilobytes written per second ractv average number of read operations actively being serviced by the VFS layer wactv average number of write operations actively being serviced by the VFS layer read_t average VFS read latency writ_t average VFS write latency %r percent of time there is a VFS read operation pending %w percent of time there is a VFS write operation pending d/s VFS operations per second delayed by the ZFS I/O throttle del_t average ZFS I/O throttle delay, in microseconds OPTIONS The following options are supported: -h Show help message and exit -I Print results per interval, rather than per second (where applicable) -M Print results in MB/s instead of KB/s -r Show results in a comma-separated format -z Hide zones with no VFS activity -Z Print results for all zones, not just the current zone [...]
Similar to how iostat(1M)'s %b (percent busy) metric works, the vfsstat(1M) %r and %w columns show the percentages of time that read or write operations were active. Once they hit 100% this only means that 100% of the time something was active - not that there is no more headroom to accept more I/O. It's the same for iostat(1M)'s %b - disk devices may accept additional concurrent requests even though they are already running at 100% busy.
The default vfsstat(1M) output includes columns (d/s, del_t) to show the performance effect of I/O throttling - another feature by Bill that manages disk I/O in a multi-tenant environment. I/O throttle latency will be invisible at the disk level, like those other types described in part 2. And it is very important to observe, as file system latency issues could now simply be I/O throttling preventing a tenant from hurting others. Since it's a new latency type, I'll illustrate it here:
+-------------------------------+ | Application | | mysqld | +-------------------------------+ I/O | I/O latency ^ I/O request | <--------------> | completion | | ---------------------------------------------------- | | <-- vfsstat(1M) | | +-------------------------------------------+ | File System | | | | | I/O throttling | | | X-------------->. | | | |I/O | | | |wait| | | V | | +-------------------------------------------+ | | | | <-- iostat(1M) .------. `------' | disk | `------'
As pictured here, the latency for file system I/O could be dominated by I/O throttling wait time, and not the time spent waiting on the actual disk I/O.
vfsstat(1M) is handy for some roles, such as an addition to remote monitoring tools that already handle iostat(1M)-like output. However, as a summary of average latency, it may not identify issues with the distribution of I/O. For example, if most I/O were fast with a few very slow I/O (outliers), the average may hide the presence of the few slow I/O. This is an issue we've encountered before, and solved using a 3rd dimension to show the entire distribution over time as a heat map. We've done this for file system latency in Cloud Analytics.
As I discussed in the MySQL Query Latency post, there can be interesting patterns seen over time that are difficult to visualize at the text-based command line. This is something that we are addressing with Cloud Analytics (videos), using heat maps to display file system latency. This is done for all applications and file sytems, by tracing in the kernel at the VFS level. Here is an example:
Time is the x-axis, file system I/O latency is the y-axis, and the number of I/O at each pixel is represented by color saturation (z-axis). The pixels are deliberately drawn large so that their x and y ranges will sometimes span multiple I/O, allowing various shades to be picked and patterns revealed.
This heatmap is showing a cloud node running MySQL 5.1.57, which was under a steady workload from sysbench. Most of the file system I/O is returning so fast it has been grouped in the first pixel at the bottom of the heat map, which represents the lowest latency.
I/O with particularly high latency will be shown at the top of the heat map. The example above shows a single outlier at the top. This was clicked, revealing details below the heat map. It was a single ZFS I/O with latency between 278 and 286 ms.
This visualization makes identifying outliers trivial - outliers which can cause problems and can be missed when considering latency as an average. Finding outliers was also possible using the mysqld_pid_fslatency.d script from part 3; this is what an outlier with a similar latency range looks like from that script:
read value ------------- Distribution ------------- count 1024 | 0 2048 |@@@@@@@@@@@ 2926 4096 |@@@@@@@@@@@@@@@@ 4224 8192 |@@@@@@@@ 2186 16384 |@@@@@ 1318 32768 | 96 65536 | 3 131072 | 5 262144 | 0 524288 | 0 1048576 | 0 2097152 | 1 4194304 | 3 8388608 | 2 16777216 | 1 33554432 | 1 67108864 | 0 134217728 | 0 268435456 | 1 536870912 | 0
Consider taking this distribution and plotting it as a single column in the heat map. Do this every second, displaying them across the x-axis. This is what Cloud Analytics does, which also uses DTrace to efficiently collect and aggregate the data in-kernel as the distribution before passing the summary to user-land. Cloud Analytics also uses a higher resolution distribution than the power-of-2 shown here: it uses Log/linear quantization which Bryan added to DTrace for this very reason.
The mysqld_pid_fslatency.d script did show a separate distribution for reads and writes; this is also possible in Cloud Analytics, which can measure details as an extra dimension.
The dataset can include another attribute to breakdown the data, which in the example above was the file system type. This is like having a 4th dimension in the dataset - potentially very useful, but tricky to visualize. We've used different color hues for the elements in that dimension, which can be selected by clicking the list on the left:
This shows writes in blue and fsyncs in green. These were also shown in isolation from the other operation types to focus on their details, and the y-axis has been zoomed to show finer details: latency from 0 to 0.4 ms.
This heat map also shows that that the distribution has detail across the latency axis: the fsyncs mostly grouped into two bands, and the writes were usually grouped into one lower latency band, but sometimes split into two. This split happened twice, for about 10 to 20 seconds each time, appearing as blue "arches". The higher band of latency is only about 100 microseconds slower - so it's interesting, but probably isn't an issue.
The x-axis, time, reveals patterns when the write latency changes second-by-second: seen above by the blue "arches". These would be difficult to spot at the command line; I could adjust the mysqld_pid_latency.d script to output the distribution every second - but I would need to read hundreds of pages of text-based distributions to comprehend the pattern seen in the above heat map.
This behavior was something I just discovered (I happen to be benchmarking MySQL for a different reason), and can't offer a explanation for this yet with certainty. Now that I know this behavior exists and the latency cost, I can decide if it is worth investigating further (with DTrace), or if there are larger latency issues to work on instead.
Here is another time-based pattern to consider, this time it's not MySQL:
It's iozone's auto mode.
iozone is performing numerous write and read tests, stepping up the file size and the record size of the I/O. The resulting I/O latency can be seen to creep up as the record size is increased for the runs, and reset again for the next run.
The screenshots I've used showed file system latency by file system, and by operation type. Here is the current list of possible breakdowns:
Since this is tracing at the VFS level, it still has application context allowing the application name and arguments to be examined.
So far this is just showing I/O latency at the VFS level. If you've read my previous posts in this series, you'll know that this solves many problems but not all problems. It is especially good at identifying outliers, and for illustrating the full distribution, not just the average. However, having application (eg, MySQL) context lets us take it a step further, expressing file system latency as a portion of the application request time. This was demonstrated by the mysqld_pid_fslatency_slowlog.d script in part 3, which provides a metric that can measure and prove that file system latency was hurting the application's workload, and exactly by how much.
Cloud Analytics can do (and will do) a lot more than just file system latency. See Dave Pacheco's blog (who is the lead engineer) for future posts. We'll also both be giving a talk at OSCONdata 2011 about it titled Design and Implementation of a Real-Time Cloud Analytics Platform.
For more background on these type of visualizations, see my ACMQ/CACM article on visualizing system latency, which focused on NFS latency (still a file system) and included other interesting patterns.
Interesting patterns like those above sometimes happen, but the distributions are more often mundane (at first glance). I'll finish by demonstrating this with a simple workload (which was also shown in the first screenshot in this post); A 1 Gbyte file is read randomly until it has been entirely cached:
About midway (60%) across the heat map (x-axis, time) the file became fully cached in DRAM. The left side shows three characteristics:
- A line at the bottom of the heat map, showing very fast file system I/O. These are likely to be DRAM cache hits (more DTrace could confirm if needed).
- A "cloud" of latency from about 3 to 10 ms. This is likely to be random disk I/O.
- Vertical spikes of latency, about every 5 seconds. This is likely evidence of some I/O queueing (serializing) behind an event, such as a file system flush. (More DTrace, like that used in part 4, can be used to identify the exact reason.)
This is great. Consider again what would happen if this was a line graph instead that showed average latency per second. All of these interesting details would be squashed into a single line, averaging both DRAM cache latency and disk I/O latency together.
Zooming in vertically on the right hand side reveals the latency of the DRAM hits:
This shows the distribution of the file system DRAM cache hits. The y-axis scale is now 100 microseconds (0.1 ms) - this extrordinary resolution made possible by both DTrace and Bryan's recent llquantize() addition.
Most of the latency is at the bottom of the distribution. In this case, the default rank-based palette (aka "false color palette") has emphasised the pattern in the higher latency ranges. It does this in an unusual but effective way: it applies the palatte evenly across a list of heat map elements sorted (ranked) by their I/O count, so that the full spectrum is used to emphasize details. There the I/O count affects the pixel rank, and the saturation is based on that rank. But the saturation isn't proportional: a pixel that is a little bit darker may span 10x the I/O.
Basing the saturation on the I/O count directly is how the linear-based palette works, which may not use every possible shade, but the shades will be correctly proportional. The "COLOR BY" control in Cloud Analytics allows this palette to be selected:
While the linear palette washes out finer detals, it's better at showing where the bulk of the I/O were - here the darker orange line of lower latency.
File system latency has become an essential metric for understanding application performance, with increased functionality and caching of file systems leaving disk-level metrics confusing and incomplete. Various DTrace-based tools have been created to measure file system latency, with the most effective expressing it as a synchronous component of the application workload: demonstrated here by the sum of file system latency during a MySQL query. I've been using these tools for several months, solving real world performance issues in a cloud computing environment.
To allow file system latency to be more easily observed, a new command line tool has been developed, vfsstat(1M), which reports iostat(1M)-like metrics for the VFS level and can be consumed by remote monitoring tools. To allow it to be examined in greater detail, it has also been added to Joyent's Cloud Analytics as a heat map, with various breakdowns possible. If needed, DTrace at the command-line can be used to inspect latency event-by-event, and at any layer of the operating system stack from the application interface to the kernel device drivers - to pinpoint the origin.