Hist triggers is a new tracing feature that recently landed in Linux 4.7-rc1. It allows the creation of custom, efficient, in-kernel histograms. Cue some screenshots!
syscalls by process name and PID
# echo 'hist:key=common_pid.execname' > \
/sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger
[...wait some seconds...]
# cat /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/hist
# event histogram
#
# trigger info: hist:keys=common_pid.execname:vals=hitcount:sort=hitcount:size=2048 [active]
#
{ common_pid: sshd [ 28089] } hitcount: 5
{ common_pid: jps [ 27941] } hitcount: 15
[...]
{ common_pid: bash [ 32754] } hitcount: 833
{ common_pid: supervise [ 2060] } hitcount: 1824
{ common_pid: supervise [ 2062] } hitcount: 1824
{ common_pid: supervise [ 2064] } hitcount: 1824
{ common_pid: dumpsystemstats [ 27909] } hitcount: 2691
{ common_pid: sshd [ 32745] } hitcount: 3761
{ common_pid: jps [ 27914] } hitcount: 3957
{ common_pid: snmpd [ 1617] } hitcount: 4854
{ common_pid: dumpsystemstats [ 27940] } hitcount: 9671
{ common_pid: readproctitle [ 2054] } hitcount: 19296
{ common_pid: dumpsystemstats [ 27926] } hitcount: 51386
Totals:
Hits: 219519
Entries: 764
Dropped: 0
# echo '!hist:key=common_pid.execname' > \
/sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger
The output shows that during tracing, the dumpsystemstats process (PID 27926) did 51,386 syscalls.
The three commands used were:
- echo 'hist:config' > probe/trigger: set the histogram config for the given probe, and enable tracing
- cat probe/hist: prints the current histogram counts
- echo '!hist:config' > probe/trigger: disable tracing
If this interface seems new and bizarre to you, then you might also be new to the little-known Linux ftrace tracer (now just called "trace"), which has existed since Linux 2.6.27. This is how trace operates: echoing funny strings to /sys locations. Hist triggers is an addition to trace.
I use trace frequently, and still find the interface a bit bizarre, but it's not really a problem. I'm almost always using it via a front-end wrapper, like my perf-tools or trace-cmd.
For perf-tools I wrote syscount, which can do the same summary as above: syscalls by process. But it used older Linux capabilities, where I had to dump all syscall events to a perf.data file and post-process in user space, costing much overhead. Post Linux 4.7, I can use hist triggers, making it cheap to instrument.
syscall read() returned size and process name and PID
# echo 'hist:key=common_pid.execname,ret' > \
/sys/kernel/debug/tracing/events/syscalls/sys_exit_read/trigger
# cat /sys/kernel/debug/tracing/events/syscalls/sys_exit_read/hist
[...]
{ common_pid: snmpd [ 1617], ret: 5 } hitcount: 8
{ common_pid: sshd [ 32745], ret: 1 } hitcount: 8
{ common_pid: irqbalance [ 1189], ret: 1024 } hitcount: 10
{ common_pid: supervise [ 2062], ret: 18446744073709551605 } hitcount: 14
{ common_pid: supervise [ 2064], ret: 18446744073709551605 } hitcount: 14
{ common_pid: supervise [ 2060], ret: 18446744073709551605 } hitcount: 14
{ common_pid: sshd [ 32745], ret: 36 } hitcount: 18
{ common_pid: bash [ 32754], ret: 1 } hitcount: 18
{ common_pid: readproctitle [ 2054], ret: 1 } hitcount: 1407
[...]
# echo '!hist:key=common_pid.execname,ret' > \
/sys/kernel/debug/tracing/events/syscalls/sys_exit_read/trigger
So readproctitle is doing a lot of 1 byte reads. Note the large ret value, 18446744073709551605, which is likely -1's (errors). (Hist triggers needs a way to print these as signed decimal, not just unsigned.)
syscall total read() returned bytes by process name and PID
# echo 'hist:key=common_pid.execname:values=ret:sort=ret if ret >= 0' \
> /sys/kernel/debug/tracing/events/syscalls/sys_exit_read/trigger
# cat /sys/kernel/debug/tracing/events/syscalls/sys_exit_read/hist
[...]
{ common_pid: bash [ 16608] } hitcount: 4 ret: 11722
{ common_pid: bash [ 16616] } hitcount: 4 ret: 12386
{ common_pid: bash [ 16617] } hitcount: 4 ret: 12469
{ common_pid: irqbalance [ 1189] } hitcount: 36 ret: 21702
{ common_pid: snmpd [ 1617] } hitcount: 75 ret: 22078
{ common_pid: sshd [ 32745] } hitcount: 329 ret: 165710
[...]
# echo '!hist:key=common_pid.execname:values=ret:sort=ret if ret >= 0' \
> /sys/kernel/debug/tracing/events/syscalls/sys_exit_read/trigger
I'm now doing much more than just counting a custom key. Here I'm summing the ret value (return value), and also filtering to only sum positive ret values (successful reads).
kernel stacks issuing disk I/O
# echo 'hist:key=stacktrace' > \
/sys/kernel/debug/tracing/events/block/block_rq_insert/trigger
# cat /sys/kernel/debug/tracing/events/block/block_rq_insert/hist
# event histogram
#
# trigger info: hist:keys=stacktrace:vals=hitcount:sort=hitcount:size=2048 [active]
#
{ stacktrace:
blk_mq_insert_requests+0x142/0x1b0
blk_mq_flush_plug_list+0x127/0x140
blk_flush_plug_list+0xc7/0x220
blk_finish_plug+0x2c/0x40
wb_writeback+0x18b/0x2f0
wb_workfn+0xfd/0x3c0
process_one_work+0x153/0x3f0
worker_thread+0x12b/0x4b0
kthread+0xc9/0xe0
ret_from_fork+0x1f/0x40
} hitcount: 1
{ stacktrace:
__blk_mq_insert_request+0x9e/0xd0
blk_mq_insert_request+0x88/0xc0
blk_mq_requeue_work+0xd0/0x120
process_one_work+0x153/0x3f0
worker_thread+0x12b/0x4b0
kthread+0xc9/0xe0
ret_from_fork+0x1f/0x40
} hitcount: 1
{ stacktrace:
blk_mq_insert_requests+0x142/0x1b0
blk_mq_flush_plug_list+0x127/0x140
blk_flush_plug_list+0xc7/0x220
blk_finish_plug+0x2c/0x40
__do_page_cache_readahead+0x182/0x220
ondemand_readahead+0x135/0x260
page_cache_sync_readahead+0x31/0x50
generic_file_read_iter+0x4c8/0x780
__vfs_read+0xbd/0x110
vfs_read+0x8e/0x140
kernel_read+0x41/0x60
prepare_binprm+0xe6/0x200
do_execveat_common.isra.34+0x457/0x6e0
SyS_execve+0x3a/0x50
do_syscall_64+0x69/0x110
return_from_SYSCALL_64+0x0/0x6a
} hitcount: 2
{ stacktrace:
blk_mq_insert_requests+0x142/0x1b0
blk_mq_flush_plug_list+0x127/0x140
blk_flush_plug_list+0xc7/0x220
blk_finish_plug+0x2c/0x40
__do_page_cache_readahead+0x182/0x220
filemap_fault+0x406/0x500
ext4_filemap_fault+0x36/0x50
__do_fault+0x5e/0xd0
handle_mm_fault+0xb98/0x1d20
__do_page_fault+0x1e0/0x4d0
do_page_fault+0x30/0x80
page_fault+0x28/0x30
clear_user+0x2b/0x40
padzero+0x24/0x40
load_elf_binary+0x8da/0x1650
search_binary_handler+0x9e/0x1e0
} hitcount: 2
{ stacktrace:
__blk_mq_insert_request+0x9e/0xd0
blk_sq_make_request+0x2af/0x3d0
generic_make_request+0xe1/0x1a0
submit_bio+0x68/0x130
submit_bh_wbc+0x12f/0x160
submit_bh+0x12/0x20
journal_submit_commit_record+0x1c7/0x1f0
jbd2_journal_commit_transaction+0xfce/0x1860
kjournald2+0xbb/0x230
kthread+0xc9/0xe0
ret_from_fork+0x1f/0x40
} hitcount: 3
{ stacktrace:
__blk_mq_insert_request+0x9e/0xd0
blk_mq_insert_request+0x88/0xc0
blk_mq_requeue_work+0xf6/0x120
process_one_work+0x153/0x3f0
worker_thread+0x12b/0x4b0
kthread+0xc9/0xe0
ret_from_fork+0x1f/0x40
} hitcount: 3
{ stacktrace:
blk_mq_insert_requests+0x142/0x1b0
blk_mq_flush_plug_list+0x127/0x140
blk_flush_plug_list+0xc7/0x220
blk_finish_plug+0x2c/0x40
jbd2_journal_commit_transaction+0xbf5/0x1860
kjournald2+0xbb/0x230
kthread+0xc9/0xe0
ret_from_fork+0x1f/0x40
} hitcount: 3
{ stacktrace:
blk_mq_insert_requests+0x142/0x1b0
blk_mq_flush_plug_list+0x127/0x140
blk_flush_plug_list+0xc7/0x220
blk_finish_plug+0x2c/0x40
generic_writepages+0x4d/0x60
blkdev_writepages+0xe/0x10
do_writepages+0x1e/0x30
__filemap_fdatawrite_range+0xaa/0xf0
filemap_fdatawrite+0x1f/0x30
fdatawrite_one_bdev+0x16/0x20
iterate_bdevs+0xe9/0x130
sys_sync+0x63/0x90
entry_SYSCALL_64_fastpath+0x1e/0xa8
} hitcount: 14
{ stacktrace:
blk_mq_insert_requests+0x142/0x1b0
blk_mq_flush_plug_list+0x127/0x140
blk_flush_plug_list+0xc7/0x220
blk_finish_plug+0x2c/0x40
ext4_writepages+0x4db/0xce0
do_writepages+0x1e/0x30
__filemap_fdatawrite_range+0xaa/0xf0
filemap_flush+0x1c/0x20
ext4_alloc_da_blocks+0x2c/0x70
ext4_rename+0x647/0x8a0
ext4_rename2+0x1d/0x30
vfs_rename+0x4aa/0x7f0
SyS_rename+0x345/0x3a0
entry_SYSCALL_64_fastpath+0x1e/0xa8
} hitcount: 72
Totals:
Hits: 101
Entries: 9
Dropped: 0
# echo '!hist:key=stacktrace' > \
/sys/kernel/debug/tracing/events/block/block_rq_insert/trigger
This is pretty useful for investigating disk I/O without an obvious reason, which can originate from an asynchronous kernel path. This is not only identifying the different paths, but doing so efficiently: using the entire stack trace as a key for the histogram, and counting it in kernel context.
Other fields can be added to that key, so one could count not just code paths to a function, but also other tracepoint fields.
user-level malloc()s by process name and PID
Now for a user-level example. I'll count which processes and PIDs are calling the libc malloc() routine:
# perf probe -x /lib/x86_64-linux-gnu/libc.so.6 malloc
Added new event:
probe_libc:malloc (on malloc in /lib/x86_64-linux-gnu/libc-2.19.so)
You can now use it in all perf tools, such as:
perf record -e probe_libc:malloc -aR sleep 1
# echo 'hist:key=common_pid.execname' > \
/sys/kernel/debug/tracing/events/probe_libc/malloc/trigger
# cat /sys/kernel/debug/tracing/events/probe_libc/malloc/hist
[...]
{ common_pid: chown [ 4663] } hitcount: 86
{ common_pid: chown [ 4633] } hitcount: 86
{ common_pid: chown [ 4567] } hitcount: 86
{ common_pid: chown [ 4575] } hitcount: 86
{ common_pid: chown [ 4553] } hitcount: 86
{ common_pid: chown [ 4605] } hitcount: 86
{ common_pid: chown [ 4562] } hitcount: 86
{ common_pid: chown [ 4541] } hitcount: 86
{ common_pid: ls [ 4649] } hitcount: 160
{ common_pid: snmpd [ 1617] } hitcount: 166
{ common_pid: tar [ 4563] } hitcount: 536291
[...]
# echo '!hist:key=common_pid.execname' >
/sys/kernel/debug/tracing/events/probe_libc/malloc/trigger
# perf probe --del probe_libc:malloc
Removed event: probe_libc:malloc
While tracing, the tar command called malloc() 536,291 times.
For this example, I borrowed the perf command to create the dynamic tracing probe of malloc(), and then hist triggers to instrument that probe. I didn't need to use perf: I could have set this all up via echo and /sys, but perf has better error checking.
Kernel-level dynamic tracing works too. And I can pull in arguments to functions, and their return values, and use them as keys in the histogram or sum them as values.
More info & warnings
This post demonstrates maybe one tenth of what hist triggers can do. To see more functionality, browse the official trace/events.txt documentation and search for "hist:".
This is all coming in Linux 4.7, although it is not currently enabled by default: you need to enable CONFIG_HIST_TRIGGERS.
WARNING: since hist triggers is all new code, so it would be wise to stress test it in the lab before any real use.
There's also the usual warnings about tracing: there will be overhead relative to the event rate multiplied by the event cost. Trace is a fast framework, and for some rough testing I was seeing 0.25 us overhead for counting kernel tracepoints, which is pretty fast. But, if you're doing millions of events per second, then 0.25 us can start to add up. User-level will be more costly.
Finally, the very last line of hist output has "Dropped: 0". If that's non-zero, then you've overflowed the default number of key slots (2048), and events will be dropped. It can be tuned by setting :size=4096 or whatever in the histogram config.
What about BPF?
Can't enhanced BPF do this too?
Yes, if you program it to. I feel like we've been waiting for advanced tracing in Linux for years, and now two busses have arrived at the same time. This overlap concern was raised and discussed on lkml, and it was eventually deemed that hist triggers was different enough to be included.
BPF can do a lot more than hist triggers, although it also requires a lot more effort. Hist triggers is a simple enhancement to trace, for some common system-wide observability. Hist triggers is also lightweight to enable in custom ways: you just need a shell, whereas BPF typically needs one (or more) compilers. I suspect there'll also be a little give and take between them for a while; for example, doing function execution counts I'm still using ftrace, since it's currently faster to initialize than BPF.
Most people won't need to know BPF or hist triggers in much detail, as I suspect most people will be using these capabilities via front end tools (either CLI or GUI). You do need to know the kinds of things that are possible (browse this post), so you can reach for the tools – or learn the interfaces – when you need them.
Thanks Tom Zanussi (Intel) for hist triggers!


Click here for Disqus comments (ad supported).