Viewing the Invisible

I originally posted this at http://dtrace.org/blogs/brendan/2011/06/27/viewing-the-invisible.

Some of my posts are long and carefully explained, like my file system latency series reaching 5 parts. This post is different: here I'll share an advanced topic for an advanced audience, and let the screenshots do most of the talking.

This assumes you know OS internals, x86 assembly, mdb and DTrace. Do not try any of the following without understanding and accepting the risk of using these tools.

The problem

I found some unexplained behavior while chasing down a ~10% performance loss in MySQL (after enabling the performance schema). Profiling found my_timer_cycles() on-CPU due to the config change, so, tracing its execution time:

# dtrace -n 'pid$target::my_timer_cycles:entry { self->vs = vtimestamp; }
    pid$target::my_timer_cycles:return /self->vs/ {
    @["ns"] = quantize(vtimestamp - self->vs); self->vs = 0; }
    tick-10s { exit(0); }' -p 1711
dtrace: description 'pid$target::my_timer_cycles:entry ' matched 3 probes
CPU     ID                    FUNCTION:NAME
  3  75844                        :tick-10s 

  ns                                                
           value  ------------- Distribution ------------- count    
              32 |                                         0        
              64 |                                         3        
             128 |                                         5        
             256 |@@@@@@@@@@@@@@@@@@                       1783997  
             512 |@@@@@@@@@@@@@@@@@@@@@@                   2183562  
            1024 |                                         38028    
            2048 |                                         5706     
            4096 |                                         7538     
            8192 |                                         7103     
           16384 |                                         4551     
           32768 |                                         2058     
           65536 |                                         508      
          131072 |                                         213      
          262144 |                                         39       
          524288 |                                         10       
         1048576 |                                         1        
         2097152 |                                         0

Ok, so it's on-CPU because it's called a boat-load of times.

Why does it sometimes reach 1+ ms?:

# echo my_timer_cycles::dis | mdb -p 1711
my_timer_cycles:                pushq  %rbp
my_timer_cycles+1:              movq   %rsp,%rbp
my_timer_cycles+4:              xorl   %eax,%eax
my_timer_cycles+6:              rdtsc  
my_timer_cycles+8:              shlq   $0x20,%rdx
my_timer_cycles+0xc:            orq    %rdx,%rax
my_timer_cycles+0xf:            leave  
my_timer_cycles+0x10:           ret

Hmmm. That's about as register-bound as they come (rdtsc is read time stamp counter).

Now it's really interesting.

I'll focus here on investigating why this function can take 1 ms. This function has no loads/stores - it's register bound - so usual suspects including CPU and MMU stall cycles are unlikely at play (even if it were to I$ miss, it's hard to explain how it ends up taking 1 ms for a function called so often).

Whatever could be happening isn't visible in the function above.

Hypothesis

There could be kernel-level events that are interrupting the execution of this function: including hardware device interrupts, cross calls, and involuntary switches (or simply, "interrupts").

# mpstat 1
[...]
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0    0 18849 16847 14813  922 2118 1458    0 35294   29  55   0  16
  1    0   0    0  1503  159 14085  637 1680  483    0 40124   44  18   0  39
  2    0   0 1416  2192  211 21684 1398 2830  836    0 49985   44  20   0  36
  3    0   0    0  1486  144 13019  511 1394  470    0 38281   44  16   0  39
  4    0   0    0  2079  149 20340 1189 2426  797    0 47753   40  19   0  41
  5    0   0    0  1305  147 12126  467 1199  454    0 33649   37  15   0  47
  6    0   0    0  1868  197 18549  927 1974  760    0 43298   45  16   0  39
  7    0   0    0  1275  159 9103  317  892  301    0 26289   43  12   0  44
  8    0   0    0  1644  165 15883  845 1768  520    0 39550   41  19   0  41
  9    0   0    0  1357  183 11293  371 1076  414    0 32063   30  15   0  55
 10    0   0    0  1588  158 16086  710 1556  633    0 38003   37  17   0  46
 11    2   0    0   922  109 7679  213  717  322    0 23071   41  11   0  48
 12    0   0    0  1759  240 15800  775 1560  623    0 37871   37  15   0  48
 13    0   0   15  1070  259 7573  268  799  274    0 22658   34  13   0  53
 14    0   0 7784  1326  450 15993  681 1497  607    0 37083   32  16   0  52
 15    0   0    0 17247 15953 10697  372  979  397    0 33010   38  21   0  40

and:

# dtrace -n 'pid$target::my_timer_cycles:entry { self->vs = vtimestamp; }
    pid$target::my_timer_cycles:return /self->vs/ { @["CPU:", cpu] =
    quantize(vtimestamp - self->vs); self->vs = 0; }
    tick-10s { exit(0); }' -p 1711
dtrace: description 'pid$target::my_timer_cycles:entry ' matched 3 probes
CPU     ID                    FUNCTION:NAME
  4  75844                        :tick-10s 

  CPU:                                                     13
           value  ------------- Distribution ------------- count    
             128 |                                         0        
             256 |@@@@@@@@@@@@@@@@@                        39591    
             512 |@@@@@@@@@@@@@@@@@@@@@@@                  53690    
            1024 |                                         1147     
            2048 |                                         85       
            4096 |                                         155      
            8192 |                                         87       
           16384 |                                         2        
           32768 |                                         0   
[...]

  CPU:                                                     15
           value  ------------- Distribution ------------- count    
             128 |                                         0        
             256 |@@@@@@@@@@@@@@@@@@@                      96785    
             512 |@@@@@@@@@@@@@@@@@@@                      97022    
            1024 |@                                        7105     
            2048 |@                                        3272     
            4096 |@                                        2955     
            8192 |                                         1077     
           16384 |                                         214      
           32768 |                                         46       
           65536 |                                         0    
[...]

  CPU:                                                      2
           value  ------------- Distribution ------------- count    
              64 |                                         0        
             128 |                                         1        
             256 |@@@@@@@@@@@@@@@@@@@@                     168860   
             512 |@@@@@@@@@@@@@@@@@@@                      164438   
            1024 |@                                        5473     
            2048 |                                         3071     
            4096 |                                         275      
            8192 |                                         264      
           16384 |                                         22       
           32768 |                                         4        
           65536 |                                         3        
          131072 |                                         0        

  CPU:                                                      0
           value  ------------- Distribution ------------- count    
              64 |                                         0        
             128 |                                         1        
             256 |@@@@@@@@@@@@@@@@@@@                      129491   
             512 |@@@@@@@@@@@@@@@@@@@                      130355   
            1024 |@                                        5547     
            2048 |                                         2007     
            4096 |                                         630      
            8192 |                                         2998     
           16384 |@                                        3547     
           32768 |                                         1656     
           65536 |                                         415      
          131072 |                                         153      
          262144 |                                         46       
          524288 |                                         13       
         1048576 |                                         1        
         2097152 |                                         0

Ah-hah!

What's on 0?

# intrstat 1

      device |      cpu0 %tim      cpu1 %tim      cpu2 %tim      cpu3 %tim
-------------+------------------------------------------------------------
       bnx#0 |         0  0.0        16  0.0         0  0.0         0  0.0
      ehci#0 |         0  0.0         0  0.0         0  0.0         0  0.0
      ehci#1 |         0  0.0         0  0.0         0  0.0         0  0.0
       igb#0 |         0  0.0         0  0.0         0  0.0         0  0.0
       igb#2 |     16889 37.7         0  0.0         0  0.0         0  0.0
  mega_sas#0 |         0  0.0         0  0.0         0  0.0         0  0.0
      uhci#0 |         0  0.0         0  0.0         0  0.0         0  0.0
      uhci#1 |         0  0.0         0  0.0         0  0.0         0  0.0
      uhci#2 |         0  0.0         0  0.0         0  0.0         0  0.0
      uhci#3 |         0  0.0         0  0.0         0  0.0         0  0.0
[...]

      device |     cpu12 %tim     cpu13 %tim     cpu14 %tim     cpu15 %tim
-------------+------------------------------------------------------------
       bnx#0 |         0  0.0         0  0.0         0  0.0         0  0.0
      ehci#0 |         0  0.0         0  0.0         0  0.0         0  0.0
      ehci#1 |         0  0.0         0  0.0         0  0.0         0  0.0
       igb#0 |         0  0.0        11  0.0       155  0.1         0  0.0
       igb#2 |         0  0.0         0  0.0         0  0.0     16453  7.6
  mega_sas#0 |         0  0.0         0  0.0         0  0.0         0  0.0
      uhci#0 |         0  0.0         0  0.0         0  0.0         0  0.0
      uhci#1 |         0  0.0         0  0.0         0  0.0         0  0.0
      uhci#2 |         0  0.0         0  0.0         0  0.0         0  0.0
      uhci#3 |         0  0.0         0  0.0         0  0.0         0  0.0
[...]

Context

While mpstat(1M) and intrstat(1M) show interrupts system-wide, I'd like to see them in application context to measure their effect on this user-level function. However, application debuggers and performance analysis tools typically only show what is happening in user-land, and their visibility ends at the syscall boundry.

Kernel-level events are typically invisible to application performance tools.

Fortunately DTrace can do it; I'll start by trying the sdt:::interrupt-start probe, which is what intrstat(1M) uses:

# dtrace -n 'sdt:::interrupt-start { @[execname] = count(); }'
dtrace: description 'sdt:::interrupt-start ' matched 1 probe
^C

  sched                                                         55608

No.

# dtrace -n 'sdt:::interrupt-start { @[stack()] = count(); }'
[...]

              unix`dispatch_hilevel+0x1f
              unix`switch_sp_and_call+0x13
              unix`do_interrupt+0xde
              unix`_interrupt+0xb8
            13030

              unix`dispatch_hardint+0x33
              unix`switch_sp_and_call+0x13
            93038

Need this before the switch_sp_and_call(). do_interrupt()?

# dtrace -n 'fbt::do_interrupt:entry { @[execname] = count(); }'
dtrace: description 'fbt::do_interrupt:entry ' matched 1 probe
^C

  poold                                                             8
  dtrace                                                           60
  rotatelogs                                                       66
  zeus.zxtm                                                       225
  zabbix_agentd                                                   682
  ruby                                                           2555
  memcached                                                      4970
  sysbench                                                      13966
  httpd                                                         28736
  sched                                                         44916
  mysqld                                                        92089

That's better.

do_interrupt() runs with interrupts disabled -- it feels a little strange to be able to trace this at all.