File System Latency: part 3

I originally posted this at http://dtrace.org/blogs/brendan/2011/05/18/file-system-latency-part-3.

This is part 3 on file system latency, a series on storage I/O performance from the application perspective (see part 1 and part 2). Here I'll show how to actually measure file system I/O latency - the time spent waiting on the file system to complete I/O. Examining this can save a lot of time when considering disk I/O as a source of performance issues.

As an example of an application to study I picked a busy MySQL server, and I'll focus on using the DTrace pid provider to examine storage I/O. For an introduction to MySQL analysis with DTrace, see my earlier post on MySQL Query Latency. Here I'll take that topic further, measuring the file system component of query latency.

File System Latency Distribution

I'll start by showing some results of measuring this, and then tool used to do it.

This is file system latency within a busy MySQL database, for a 10 second interval:

# ./mysqld_pid_fslatency.d -n 'tick-10s { exit(0); }' -p 7357
Tracing PID 7357... Hit Ctrl-C to end.
MySQL filesystem I/O: 55824; latency (ns):

  read                                              
           value  ------------- Distribution ------------- count    
            1024 |                                         0        
            2048 |@@@@@@@@@@                               9053     
            4096 |@@@@@@@@@@@@@@@@@                        15490    
            8192 |@@@@@@@@@@@                              9525     
           16384 |@@                                       1982     
           32768 |                                         121      
           65536 |                                         28       
          131072 |                                         6        
          262144 |                                         0        

  write                                             
           value  ------------- Distribution ------------- count    
            2048 |                                         0        
            4096 |                                         1        
            8192 |@@@@@@                                   3003     
           16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@             13532    
           32768 |@@@@@                                    2590     
           65536 |@                                        370      
          131072 |                                         58       
          262144 |                                         27       
          524288 |                                         12       
         1048576 |                                         1        
         2097152 |                                         0        
         4194304 |                                         10       
         8388608 |                                         14       
        16777216 |                                         1        
        33554432 |                                         0

This shows the distribution of file system I/O latency in nanoseconds in the left column ("value"), with the number of I/O in that latency range shown in the right column ("count"). Most of the I/O (where the ASCII distribution plot has its spikes) was between 2 and 16 microseconds for the reads, and 8 and 65 microseconds for the writes. That's fast - and is a strong indication that these reads and writes were to the DRAM cache and not to disk.

The slower time for writes vs reads is probably due to the time to acquire write locks and the buffers to write data to, and to manage the new file system metadata to reference it. I can answer this with more DTrace if needed.

A small handful of the writes, 25 in total, fell in the 4 to 33 millisecond range - the expected time for disk I/O, including a degree of queueing. (If it's not clear in the above output - 4194304 nanoseconds == 4 milliseconds.) Compared to all the other I/O during tracing shows that the file system cache was running with a hit rate of over 99.9%.

It's neat to be able to see these system components from the latency distribution, with annotations:

  write                                             
           value  ------------- Distribution ------------- count    
            2048 |                                         0        
            4096 |                                         1        
            8192 |@@@@@@                                   3003     
           16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@            13532   <--- DRAM cache I/O
           32768 |@@@@@                                    2590     
           65536 |@                                        370      
          131072 |                                         58       
          262144 |                                         27       
          524288 |                                         12       
         1048576 |                                         1        
         2097152 |                                         0        
         4194304 |                                         10     <--- disk I/O
         8388608 |                                         14       
        16777216 |                                         1        
        33554432 |                                         0

This is just based on the times; to be certain that the 4+ ms I/O was disk based, more DTrace can confirm.

In summary, this shows that file system I/O is usually lightening fast here, hitting out of DRAM. For an application, this looks like a GREAT system to be running on.

Comparing to iostat(1M)

This is actually the same system examined in part 1 using iostat(1M), and was traced at the same time. I had two shells running, and collected the iostat(1M) output at the exact same time as this DTrace output. As a reminder, the disks were doing this according to iostat(1M):

# iostat -xnz 1 10
                    extended device statistics              
[...]
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  208.0    0.0 26619.9    0.0  0.0  1.9    0.0    9.2   0  99 c0t1d0
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  208.0    0.0 26624.4    0.0  0.0  1.7    0.0    8.2   0  95 c0t1d0
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  106.0  368.9 13566.1 26881.3  0.0  2.7    0.0    5.7   0  93 c0t1d0
[...]

Which looks awful. But at the file system level, it is great.

During the 10 seconds that both tools were running, this MySQL database experienced multiple slow queries (queries longer than one second). Based on the iostat(1M) output, one may spend a while investigating disk I/O issues, but this would be starting out in the wrong direction. The issue isn't slow disk I/O. The file system latency distribution show only a trickle reaching disk, and the vast majority of I/O returning at microsecond speeds.

iostat(1M) pointed in the wrong direction for this application issue.

So what are the disks doing? In this example, the reads are mostly from other applications that are running on the same system as this MySQL database. The bursts of writes seen are ZFS transaction group flushes, which are batching writes from MySQL and the other applications for sending to disk later as a group. Some of the disk I/O are other file system types as described in part 1 as well. All of these details were confirmed using more DTrace (which could make for another blog post).

What it isn't

So what are causing the slow queries? It may seem like we haven't learned anything about the slow queries yet, but we have - it's not the disks. We can move onto searching other areas - the database itself, as well as where the time is spent during the slow query (on-CPU or off-CPU, as measured by mysqld_pid_slow.d). The time could be spent waiting on database locks, for example.

Quickly identifying what an issue isn't helps narrow the search to what it is.

Before you say...

There was disk I/O shown in the distribution - couldn't they all combine to cause a slow query? ... Not in this example: the sum of those disk I/O is between 169 and 338 milliseconds, which is a long way from causing a single slow query (over 1 second). If it were a closer call, I'd rewrite the DTrace script to print the sum of file system latency per query (more on this later).
Could the cache hits shown in the distribution combine to cause a slow query? ... Not in this example, but their sum does get closer. While fast, there were a lot of them. Again, file system latency can be expressed as a sum per query instead of a distribution to identify this with certainty.

Presentation

The above latency distributions were a neat way of presenting the data, but not the only way. As just mentioned, a different presentation of this data would be needed to really confirm that slow queries were caused by the file system: specifically, a sum of file system latency per query.

It was getting the latency data in the first place that was previously difficult - which DTrace now lets us do. The presentation of that data can be what we need to effectively answer questions, and DTrace lets us present it as totals, averages, min, max, and event-by-event data as well, if needed.

Distribution Script

The previous example traced I/O inside the application using the DTrace pid provider, and printed out a distribution plot of the file system latency. The tool that was executed, mysqld_pid_fslatency.d, is a DTrace script that is about 50 lines long (including comments). I've included it here, enumerated:

mysqld_pid_fslatency.d

     1  #!/usr/sbin/dtrace -s
     2  /*
     3   * mysqld_pid_fslatency.d  Print file system latency distribution.
     4   *
     5   * USAGE: ./mysqld_pid_fslatency.d -p mysqld_PID
     6   *
     7   * TESTED: these pid-provider probes may only work on some mysqld versions.
     8   *  5.0.51a: ok
     9   *
    10   * 27-Mar-2011  brendan.gregg@joyent.com
    11   */
    12  
    13  #pragma D option quiet
    14  
    15  dtrace:::BEGIN
    16  {
    17      printf("Tracing PID %d... Hit Ctrl-C to end.\n", $target);
    18  }
    19  
    20  pid$target::os_file_read:entry,
    21  pid$target::os_file_write:entry,
    22  pid$target::my_read:entry,
    23  pid$target::my_write:entry
    24  {
    25      self->start = timestamp;
    26  }
    27  
    28  pid$target::os_file_read:return  { this->dir = "read"; }
    29  pid$target::os_file_write:return { this->dir = "write"; }
    30  pid$target::my_read:return       { this->dir = "read"; }
    31  pid$target::my_write:return      { this->dir = "write"; }
    32  
    33  pid$target::os_file_read:return,
    34  pid$target::os_file_write:return,
    35  pid$target::my_read:return,
    36  pid$target::my_write:return
    37  /self->start/
    38  {
    39      @time[this->dir] = quantize(timestamp - self->start);
    40      @num = count();
    41      self->start = 0;
    42  }
    43  
    44  dtrace:::END
    45  {
    46      printa("MySQL filesystem I/O: %@d; latency (ns):\n", @num);
    47      printa(@time);
    48      clear(@time); clear(@num);
    49  }

This script traces functions in the mysql and innodb source that perform reads and writes to the file system: os_file_read(), os_file_write(), my_read() and my_write(). A little poking around this version of MySQL (5.0.51a) showed that these were causing all the file system I/O, which along with their function names and use in the code made them attractive locations to trace.

Script Caveats

Since this is pid provider-based, these functions are not considered a stable interface - and this script may not work on other versions of MySQL. This would happen if the function names were changed, removed, or other file system functions added. It shouldn't be too hard to locate the new locations if that has happened, and fix the script.

Besides the stability of this script, other caveats are:

Overhead: there will be additional overhead (extra CPU cycles) for DTrace to instrument MySQL and collect this data. This should be minimal, especially considering the code-path instrumented - file system I/O. See my post on pid provider overhead for more discussion on this type of overhead.
CPU latency: the times measured by this script include CPU dispatcher queue latency.

I'll explain that last note in more detail:

CPU Latency

These MySQL functions use system calls to perform the I/O, which will block and voluntarily context switch the thread off-CPU and put it to sleep until the I/O completes. When the I/O completes, the thread is woken up but it may need to wait its turn on-CPU if there are threads with a higher priority already running (it will probably have had a priority boost due to the scheduler to help its chances at preemption). Waiting its turn is the "CPU dispatcher queue latency", and if the CPUs are heavily saturated with work that could be milliseconds. This is included in the time that mysqld_pid_fslatency.d prints out.

Showing the time components with CPUs at saturation:

                +--------------------------------+ 
                |           Application          | 
                |             mysqld             | 
                +--------------------------------+ 
                     |     FS I/O latency     ^
                     | <--------------------> |      
                     |                        |   user-land
     -----------------------system calls ----------------------
                     |                        |   kernel
        +-------------------------------------------+
        | File       |                        |     |
        | System     |  I/O wait     CPU wait |     |
        |            V <--------->   <------> |     |
        |            X.............+----------+     |
        |            |             ^                |
        | thread is: | off-CPU     | runnable       |
        |            |             |                |
        +-------------------------------------------+
            disk I/O |             | disk I/O       
             request |             | completion
                     V             |           
                    .---------------.       
                    `---------------'       
                    |     disk      |          
                    `---------------'

While this could make interpreting the measured file system I/O latency confusing, you don't want to be running the system in this state to start with. If the CPUs are at saturation, then the application could be affected at random times with involuntary context switches, apart from the additional dispatcher queue latency at the end of I/O.

Identifying CPU saturation is usually straightforward with standard operating system tools (or at least, the lack of CPU %idle); the best tool on Solaris-based systems would be "prstat -mL" to examine the percent of time threads spent waiting on the CPU dispatcher queues ("LAT"). This is a much better measurement, as it also catches other cases that can't be seen via the lack of %idle (eg, dispatcher queue latency due to processes reaching their CPU caps).

Slow Query Logger

Apart from examining file system latency as a distribution, it may be desirable to express it as a portion of the query time, so that "slow" queries can be clearly attributed to file system latency. The following script does this, by tracing both the query latency and summing file system latency, for any query where the total file system latency is over a threshold:

mysqld_pid_fslatency_slowlog.d

     1  #!/usr/sbin/dtrace -s
     2  /*
     3   * mysqld_pid_fslatency_slowlog.d  Print slow filesystem I/O events.
     4   *
     5   * USAGE: ./mysql_pid_fslatency_slowlog.d mysqld_PID
     6   *
     7   * This traces mysqld filesystem I/O during queries, and prints output when
     8   * the total I/O time during a query was longer than the MIN_FS_LATENCY_MS
     9   * tunable.  This requires tracing every query, whether it performs FS I/O
    10   * or not, which may add a noticable overhead.
    11   *
    12   * TESTED: these pid-provider probes may only work on some mysqld versions.
    13   *  5.0.51a: ok
    14   *
    15   * 27-Mar-2011  brendan.gregg@joyent.com
    16   */
    17  
    18  #pragma D option quiet
    19  
    20  inline int MIN_FS_LATENCY_MS = 1000;
    21  
    22  dtrace:::BEGIN
    23  {
    24      min_ns = MIN_FS_LATENCY_MS * 1000000;
    25  }
    26  
    27  pid$1::*dispatch_command*:entry
    28  {
    29      self->q_start = timestamp;
    30      self->io_count = 0;
    31      self->total_ns = 0;
    32  }
    33  
    34  pid$1::os_file_read:entry,
    35  pid$1::os_file_write:entry,
    36  pid$1::my_read:entry,
    37  pid$1::my_write:entry
    38  /self->q_start/
    39  {
    40      self->fs_start = timestamp;
    41  }
    42  
    43  pid$1::os_file_read:return,
    44  pid$1::os_file_write:return,
    45  pid$1::my_read:return,
    46  pid$1::my_write:return
    47  /self->fs_start/
    48  {
    49      self->total_ns += timestamp - self->fs_start;
    50      self->io_count++;
    51      self->fs_start = 0;
    52  }
    53  
    54  pid$1::*dispatch_command*:return
    55  /self->q_start && (self->total_ns > min_ns)/
    56  {
    57      this->query = timestamp - self->q_start;
    58      printf("%Y filesystem I/O during query > %d ms: ", walltimestamp,
    59          MIN_FS_LATENCY_MS);
    60      printf("query %d ms, fs %d ms, %d I/O\n", this->query / 1000000,
    61          self->total_ns / 1000000, self->io_count);
    62  }
    63  
    64  pid$1::*dispatch_command*:return
    65  /self->q_start/
    66  {
    67      self->q_start = 0;
    68      self->io_count = 0;
    69      self->total_ns = 0;
    70  }

A key difference with this script is that it only examines file system I/O if they are called during a query, as checked on line 38 (which checks that a thread-local variable q_start was set, which is only true during a query). The previous script, mysqld_pid_fslatency.d, showed all file system I/O latency - whether it was during a query or for another task in the database.

To capture some sample output, I modified line 20 to reduce the threshold to 100 milliseconds:

# ./mysqld_pid_fslatency_slowlog.d 29952
2011 May 16 23:34:00 filesystem I/O during query > 100 ms: query 538 ms, fs 509 ms, 83 I/O
2011 May 16 23:34:11 filesystem I/O during query > 100 ms: query 342 ms, fs 303 ms, 75 I/O
2011 May 16 23:34:38 filesystem I/O during query > 100 ms: query 479 ms, fs 471 ms, 44 I/O
2011 May 16 23:34:58 filesystem I/O during query > 100 ms: query 153 ms, fs 152 ms, 1 I/O
2011 May 16 23:35:09 filesystem I/O during query > 100 ms: query 383 ms, fs 372 ms, 72 I/O
2011 May 16 23:36:09 filesystem I/O during query > 100 ms: query 406 ms, fs 344 ms, 109 I/O
2011 May 16 23:36:44 filesystem I/O during query > 100 ms: query 343 ms, fs 319 ms, 75 I/O
2011 May 16 23:36:54 filesystem I/O during query > 100 ms: query 196 ms, fs 185 ms, 59 I/O
2011 May 16 23:37:10 filesystem I/O during query > 100 ms: query 254 ms, fs 209 ms, 83 I/O

In the few minutes this was running, there were 9 queries longer than 100 milliseconds due to file system I/O. With this output, we can immediately identify the reason for those slow queries - they spent most of their time waiting on the file system. Reaching this conclusion with other tools is much more difficult and time consuming - if possible (and practical) at all.

DTrace can be used to positively identify slow queries caused by file system latency.

But this is about more than DTrace - it's about the metric itself - file system latency. Since this has been of tremendous use so far, it may make sense to add file system latency to the slow query log (requiring a MySQL source code change). If you are on MySQL 5.6 and later, it looks like you may also be able to get this information from the wait/io events from the new performance schema additions (I haven't tried). If that isn't viable, or you are on older MySQL, or a different application entirerly (MySQL was just my example application), you can keep using DTrace to dynamically fetch this information from the running source.

Interpreting Totals

For the queries in the above output, most of the query latency is due to file system latency (eg, 509 ms out of 538 ms = 95%). The last column printed out the I/O count, which helps answer the next question: is the file system latency due to many I/O, or a few slow I/O? For example:

The first line of output showed 509 milliseconds of file system latency from 83 I/O, which works out to be about 6 milliseconds on average. Just based on the average, this could mean that most of these were cache misses causing random disk I/O. The next step may be to investigate doing fewer file system I/O in the first place, by caching more in MySQL.
The fourth line of output shows 152 milliseconds of file system latency from a single I/O. This line of output is more alarming than any of the others, as it shows the file system returning with very high latency (this system is not at CPU saturation). Fortunately, this may be an isolated event.

If those descriptions sound a little vague, that's because we've lost so much data when summarizing as an I/O count and a total latency. The script has achieved its goal - of identifying the issue - but to investigate the issue further I need to return to the distribution plots like those used by mysqld_pid_fslatency.d. By examining the full distribution, I could confirm if the 152 ms I/O was as isolated as it looks, or if in fact slow I/O are responsible for all of the latency seen above (eg, for the first line, was it 3 slow I/O + 80 fast I/O = 83 I/O?).

Script Caveats

Caveats and notes for this script are:

This script has a higher level of overhead than mysqld_pid_fslatency.d, since it is tracing all queries via the dispatch_command() function, and not just file system functions.
CPU latency: the file system time measured by this script include CPU dispatcher queue latency (as explained earlier).
The dispatch_command() function is matched as "*dispatch_command*", since the full function name is the C++ signature (eg, "_Z16dispatch_command19enum_server_commandP3THDPcj", for this build of MySQL).
Instead of using "-p PID" at the command line, this script fed in the PID as $1 (macro variable). This was done as the script is intended to be left running for a while (hours), during which it may be desirable to continue investigating MySQL with other DTrace programs. Only one "-p PID" script can be run at a time ("-p" locks the process, and other instances get the error "process is traced"). By using $1 for the long running script, shorter -p based ones can be run in the meantime.

I'll finish with one final note about these scripts: since they are pid provider-based, they can work from inside Solaris Zones or Joyent SmartMachines. A minor exception is with how I executed the first script: I added "-n 'tick-10s { exit(0); }'" at the command line to exit after 10 seconds, which currently doesn't work reliably in those environments due to a bug (tick only sometimes fires). Until that bug is fixed, drop that statement from the command line and it will still work fine, and require a Ctrl-C to end tracing.

Considering File System Latency

By examining latency at the file system, application issues can be immediately identified as file system (and probably disk) based, or not. This can set the investigation down the right path immediately, and not the wrong direction as was being suggested by iostat(1M)'s view of busy disks.

Two pid provider-based DTrace scripts were introduced in this post to do this: mysqld_pid_fslatency.d for summarizing the distribution of file system latency, and mysqld_pid_fslatency_slowlog.d for printing slow queries due to the file system.

The pid provider isn't the only way to measure file system latency: it's also possible from the syscall layer and from the file system code in the kernel. I'll demonstrate those methods in the next post, and discuss how they differ to the pid provider method.

Brendan Gregg's Blog