I originally posted this at http://dtrace.org/blogs/brendan/2011/06/27/viewing-the-invisible.
Some of my posts are long and carefully explained, like my file system latency series reaching 5 parts. This post is different: here I'll share an advanced topic for an advanced audience, and let the screenshots do most of the talking.
This assumes you know OS internals, x86 assembly, mdb and DTrace. Do not try any of the following without understanding and accepting the risk of using these tools.
The problem
I found some unexplained behavior while chasing down a ~10% performance loss in MySQL (after enabling the performance schema). Profiling found my_timer_cycles() on-CPU due to the config change, so, tracing its execution time:
# dtrace -n 'pid$target::my_timer_cycles:entry { self->vs = vtimestamp; } pid$target::my_timer_cycles:return /self->vs/ { @["ns"] = quantize(vtimestamp - self->vs); self->vs = 0; } tick-10s { exit(0); }' -p 1711 dtrace: description 'pid$target::my_timer_cycles:entry ' matched 3 probes CPU ID FUNCTION:NAME 3 75844 :tick-10s ns value ------------- Distribution ------------- count 32 | 0 64 | 3 128 | 5 256 |@@@@@@@@@@@@@@@@@@ 1783997 512 |@@@@@@@@@@@@@@@@@@@@@@ 2183562 1024 | 38028 2048 | 5706 4096 | 7538 8192 | 7103 16384 | 4551 32768 | 2058 65536 | 508 131072 | 213 262144 | 39 524288 | 10 1048576 | 1 2097152 | 0
Ok, so it's on-CPU because it's called a boat-load of times.
Why does it sometimes reach 1+ ms?:
# echo my_timer_cycles::dis | mdb -p 1711 my_timer_cycles: pushq %rbp my_timer_cycles+1: movq %rsp,%rbp my_timer_cycles+4: xorl %eax,%eax my_timer_cycles+6: rdtsc my_timer_cycles+8: shlq $0x20,%rdx my_timer_cycles+0xc: orq %rdx,%rax my_timer_cycles+0xf: leave my_timer_cycles+0x10: ret
Hmmm. That's about as register-bound as they come (rdtsc is read time stamp counter).
Now it's really interesting.
I'll focus here on investigating why this function can take 1 ms. This function has no loads/stores - it's register bound - so usual suspects including CPU and MMU stall cycles are unlikely at play (even if it were to I$ miss, it's hard to explain how it ends up taking 1 ms for a function called so often).
Whatever could be happening isn't visible in the function above.
Hypothesis
There could be kernel-level events that are interrupting the execution of this function: including hardware device interrupts, cross calls, and involuntary switches (or simply, "interrupts").
# mpstat 1 [...] CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 0 18849 16847 14813 922 2118 1458 0 35294 29 55 0 16 1 0 0 0 1503 159 14085 637 1680 483 0 40124 44 18 0 39 2 0 0 1416 2192 211 21684 1398 2830 836 0 49985 44 20 0 36 3 0 0 0 1486 144 13019 511 1394 470 0 38281 44 16 0 39 4 0 0 0 2079 149 20340 1189 2426 797 0 47753 40 19 0 41 5 0 0 0 1305 147 12126 467 1199 454 0 33649 37 15 0 47 6 0 0 0 1868 197 18549 927 1974 760 0 43298 45 16 0 39 7 0 0 0 1275 159 9103 317 892 301 0 26289 43 12 0 44 8 0 0 0 1644 165 15883 845 1768 520 0 39550 41 19 0 41 9 0 0 0 1357 183 11293 371 1076 414 0 32063 30 15 0 55 10 0 0 0 1588 158 16086 710 1556 633 0 38003 37 17 0 46 11 2 0 0 922 109 7679 213 717 322 0 23071 41 11 0 48 12 0 0 0 1759 240 15800 775 1560 623 0 37871 37 15 0 48 13 0 0 15 1070 259 7573 268 799 274 0 22658 34 13 0 53 14 0 0 7784 1326 450 15993 681 1497 607 0 37083 32 16 0 52 15 0 0 0 17247 15953 10697 372 979 397 0 33010 38 21 0 40
and:
# dtrace -n 'pid$target::my_timer_cycles:entry { self->vs = vtimestamp; } pid$target::my_timer_cycles:return /self->vs/ { @["CPU:", cpu] = quantize(vtimestamp - self->vs); self->vs = 0; } tick-10s { exit(0); }' -p 1711 dtrace: description 'pid$target::my_timer_cycles:entry ' matched 3 probes CPU ID FUNCTION:NAME 4 75844 :tick-10s CPU: 13 value ------------- Distribution ------------- count 128 | 0 256 |@@@@@@@@@@@@@@@@@ 39591 512 |@@@@@@@@@@@@@@@@@@@@@@@ 53690 1024 | 1147 2048 | 85 4096 | 155 8192 | 87 16384 | 2 32768 | 0 [...] CPU: 15 value ------------- Distribution ------------- count 128 | 0 256 |@@@@@@@@@@@@@@@@@@@ 96785 512 |@@@@@@@@@@@@@@@@@@@ 97022 1024 |@ 7105 2048 |@ 3272 4096 |@ 2955 8192 | 1077 16384 | 214 32768 | 46 65536 | 0 [...] CPU: 2 value ------------- Distribution ------------- count 64 | 0 128 | 1 256 |@@@@@@@@@@@@@@@@@@@@ 168860 512 |@@@@@@@@@@@@@@@@@@@ 164438 1024 |@ 5473 2048 | 3071 4096 | 275 8192 | 264 16384 | 22 32768 | 4 65536 | 3 131072 | 0 CPU: 0 value ------------- Distribution ------------- count 64 | 0 128 | 1 256 |@@@@@@@@@@@@@@@@@@@ 129491 512 |@@@@@@@@@@@@@@@@@@@ 130355 1024 |@ 5547 2048 | 2007 4096 | 630 8192 | 2998 16384 |@ 3547 32768 | 1656 65536 | 415 131072 | 153 262144 | 46 524288 | 13 1048576 | 1 2097152 | 0
Ah-hah!
What's on 0?
# intrstat 1 device | cpu0 %tim cpu1 %tim cpu2 %tim cpu3 %tim -------------+------------------------------------------------------------ bnx#0 | 0 0.0 16 0.0 0 0.0 0 0.0 ehci#0 | 0 0.0 0 0.0 0 0.0 0 0.0 ehci#1 | 0 0.0 0 0.0 0 0.0 0 0.0 igb#0 | 0 0.0 0 0.0 0 0.0 0 0.0 igb#2 | 16889 37.7 0 0.0 0 0.0 0 0.0 mega_sas#0 | 0 0.0 0 0.0 0 0.0 0 0.0 uhci#0 | 0 0.0 0 0.0 0 0.0 0 0.0 uhci#1 | 0 0.0 0 0.0 0 0.0 0 0.0 uhci#2 | 0 0.0 0 0.0 0 0.0 0 0.0 uhci#3 | 0 0.0 0 0.0 0 0.0 0 0.0 [...] device | cpu12 %tim cpu13 %tim cpu14 %tim cpu15 %tim -------------+------------------------------------------------------------ bnx#0 | 0 0.0 0 0.0 0 0.0 0 0.0 ehci#0 | 0 0.0 0 0.0 0 0.0 0 0.0 ehci#1 | 0 0.0 0 0.0 0 0.0 0 0.0 igb#0 | 0 0.0 11 0.0 155 0.1 0 0.0 igb#2 | 0 0.0 0 0.0 0 0.0 16453 7.6 mega_sas#0 | 0 0.0 0 0.0 0 0.0 0 0.0 uhci#0 | 0 0.0 0 0.0 0 0.0 0 0.0 uhci#1 | 0 0.0 0 0.0 0 0.0 0 0.0 uhci#2 | 0 0.0 0 0.0 0 0.0 0 0.0 uhci#3 | 0 0.0 0 0.0 0 0.0 0 0.0 [...]
Context
While mpstat(1M) and intrstat(1M) show interrupts system-wide, I'd like to see them in application context to measure their effect on this user-level function. However, application debuggers and performance analysis tools typically only show what is happening in user-land, and their visibility ends at the syscall boundry.
Kernel-level events are typically invisible to application performance tools.
Fortunately DTrace can do it; I'll start by trying the sdt:::interrupt-start probe, which is what intrstat(1M) uses:
# dtrace -n 'sdt:::interrupt-start { @[execname] = count(); }' dtrace: description 'sdt:::interrupt-start ' matched 1 probe ^C sched 55608
No.
# dtrace -n 'sdt:::interrupt-start { @[stack()] = count(); }' [...] unix`dispatch_hilevel+0x1f unix`switch_sp_and_call+0x13 unix`do_interrupt+0xde unix`_interrupt+0xb8 13030 unix`dispatch_hardint+0x33 unix`switch_sp_and_call+0x13 93038
Need this before the switch_sp_and_call(). do_interrupt()?
# dtrace -n 'fbt::do_interrupt:entry { @[execname] = count(); }' dtrace: description 'fbt::do_interrupt:entry ' matched 1 probe ^C poold 8 dtrace 60 rotatelogs 66 zeus.zxtm 225 zabbix_agentd 682 ruby 2555 memcached 4970 sysbench 13966 httpd 28736 sched 44916 mysqld 92089
That's better.
do_interrupt() runs with interrupts disabled -- it feels a little strange to be able to trace this at all.
Viewing the Invisible
Counting interrupts during function execution:
# dtrace -n 'pid$target::my_timer_cycles:entry { self->vs = vtimestamp; } fbt::do_interrupt:entry /self->vs/ { self->i++; } pid$target::my_timer_cycles:return /self->vs/ { @[self->i] = quantize(vtimestamp - self->vs); self->vs = 0; self->i = 0; }' -p 1711 dtrace: description 'pid$target::my_timer_cycles:entry ' matched 3 probes ^C 7 value ------------- Distribution ------------- count 131072 | 0 262144 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 3 524288 |@@@@@@@@@@ 1 1048576 | 0 8 value ------------- Distribution ------------- count 131072 | 0 262144 |@@@@@@@@@@@@@ 1 524288 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ 2 1048576 | 0 6 value ------------- Distribution ------------- count 65536 | 0 131072 |@@@@@@@ 2 262144 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ 8 524288 |@@@@@@@ 2 1048576 | 0 5 value ------------- Distribution ------------- count 32768 | 0 65536 |@@ 1 131072 |@@@@@@@@@@@@@@@@@@@@ 9 262144 |@@@@@@@@@@@@@@@@ 7 524288 | 0 1048576 |@@ 1 2097152 | 0 4 value ------------- Distribution ------------- count 4096 | 0 8192 |@@ 3 16384 |@@ 2 32768 | 0 65536 |@@ 2 131072 |@@@@@@@@@@@@@@@@@@@@@@@@@@ 32 262144 |@@@@@@@@ 10 524288 | 0 3 value ------------- Distribution ------------- count 2048 | 0 4096 |@ 2 8192 |@@ 5 16384 |@@ 6 32768 |@ 3 65536 |@@@@@@@@@@@@@@@@@@@@ 59 131072 |@@@@@@@@@@@@@@ 40 262144 | 1 524288 | 0 2 value ------------- Distribution ------------- count 1024 | 0 2048 | 4 4096 |@ 14 8192 |@@@@@@@@ 94 16384 |@@@@@ 57 32768 |@@@@ 43 65536 |@@@@@@@@@@@@@@@@@@@@@ 240 131072 |@ 9 262144 | 0 524288 | 1 1048576 | 0 1 value ------------- Distribution ------------- count 512 | 0 1024 |@@ 641 2048 |@@@@@@@@ 3332 4096 |@@@@@@@@@@@ 4230 8192 |@@@@@@@@@@ 4070 16384 |@@@@@@@ 2611 32768 |@@@ 1082 65536 | 22 131072 | 9 262144 | 2 524288 | 0 0 value ------------- Distribution ------------- count 128 | 0 256 |@@@@@@@@@@@@@@@@@@ 921978 512 |@@@@@@@@@@@@@@@@@@@@@ 1083485 1024 | 17032 2048 | 489 4096 | 498 8192 | 124 16384 | 0
!!!
# dtrace -n 'pid$target::my_timer_cycles:entry { self->vs = vtimestamp; } fbt::do_interrupt:entry /self->vs/ { self->i++; } pid$target::my_timer_cycles:return /self->vs/ { @[self->i] = avg(vtimestamp - self->vs); self->vs = 0; self->i = 0; }' -p 1711 dtrace: description 'pid$target::my_timer_cycles:entry ' matched 3 probes ^C 0 545 1 11748 2 67950 3 146795 4 190061 5 271303 6 295016 7 463835 8 823128 10 878304
!
At some level this is expected: a network interface on CPU 0 has a high rate of interrupts, which sometimes slows this function. The more interrupts that occur during the function, the slower it is.
What's amazing is that it's possible to see this so clearly.