Stack trace support by Linux eBPF will make many new and awesome things possible, however, it didn't make it into the just-released Linux 4.4, which added other eBPF features. Envisaging some time on older kernels that have eBPF but not stack tracing, I've developed a hacky workaround for doing awesome things now.
I'll show my new bcc tools (eBPF front-end) that do this, then explain how it works.
stackcount: Frequency Counting Kernel Stack Traces
The stackcount tool frequency counts kernel stacks for a given function. This is performed in kernel for efficiency using an eBPF map. Only unique stacks and their counts are copied to user-level for printing.
For example, frequency counting kernel stack traces that led to submit_bio():
# ./stackcount submit_bio
Tracing 1 functions for "submit_bio"... Hit Ctrl-C to end.
^C
submit_bio
submit_bh
journal_submit_commit_record.isra.13
jbd2_journal_commit_transaction
kjournald2
kthread
ret_from_fork
mb_cache_list
1
[...truncated...]
submit_bio
submit_bh
jbd2_journal_commit_transaction
kjournald2
kthread
ret_from_fork
mb_cache_list
38
submit_bio
ext4_writepages
do_writepages
__filemap_fdatawrite_range
filemap_flush
ext4_alloc_da_blocks
ext4_rename
ext4_rename2
vfs_rename
sys_rename
entry_SYSCALL_64_fastpath
79
The order of printed stack traces is from least to most frequent. The most frequent in this example, printed last, was taken 79 times during tracing.
The last stack trace shows syscall handling, ext4_rename(), and filemap_flush(): looks like an application issued file rename has caused back end disk I/O due to ext4 block allocation and a filemap_flush().
This tool should be very useful for exploring and studying kernel behavior, quickly answering how a given function is being called.
stacksnoop: Printing Kernel Stack Traces
The stacksnoop tool prints kernel stack traces for each event. For example, for ext4_sync_fs():
# ./stacksnoop -v ext4_sync_fs TIME(s) COMM PID CPU STACK 42005557.056332998 sync 22352 1 42005557.056336999 sync 22352 1 ip: ffffffff81280461 ext4_sync_fs 42005557.056339003 sync 22352 1 r0: ffffffff811ed7f9 iterate_supers 42005557.056340002 sync 22352 1 r1: ffffffff8121ba25 sys_sync 42005557.056340002 sync 22352 1 r2: ffffffff81775cb6 entry_SYSCALL_64_fastpath 42005557.056358002 sync 22352 1 42005557.056358002 sync 22352 1 ip: ffffffff81280461 ext4_sync_fs 42005557.056359001 sync 22352 1 r0: ffffffff811ed7f9 iterate_supers 42005557.056359999 sync 22352 1 r1: ffffffff8121ba35 sys_sync 42005557.056359999 sync 22352 1 r2: ffffffff81775cb6 entry_SYSCALL_64_fastpath
Since the output is verbose, this isn't suitable for high frequency calls (eg, over 1,000 per second). You can use funccount from bcc tools to measure the rate of a function call, and if it is high, try stackcount instead.
How It Works: Crazy Stuff
eBPF is an in-kernel virtual machine that can do all sorts of things, including "crazy stuff". So I wrote a user-defined stack walker in eBPF, which the kernel can run. Here is the relevant code from stackcount (you are not expected to understand this):
#define MAXDEPTH 10
struct key_t {
u64 ip;
u64 ret[MAXDEPTH];
};
BPF_HASH(counts, struct key_t);
static u64 get_frame(u64 *bp) {
if (*bp) {
// The following stack walker is x86_64 specific
u64 ret = 0;
if (bpf_probe_read(&ret, sizeof(ret), (void *)(*bp+8)))
return 0;
if (bpf_probe_read(bp, sizeof(*bp), (void *)*bp))
*bp = 0;
if (ret < __START_KERNEL_map)
return 0;
return ret;
}
return 0;
}
int trace_count(struct pt_regs *ctx) {
FILTER
struct key_t key = {};
u64 zero = 0, *val, bp = 0;
int depth = 0;
key.ip = ctx->ip;
bp = ctx->bp;
// unrolled loop, 10 (MAXDEPTH) frames deep:
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
out:
val = counts.lookup_or_init(&key, &zero);
(*val)++;
return 0;
}
Once eBPF supports this properly, much of the above code will become a single function call.
If you are curious: I've used an unrolled loop to walk each frame (eBPF doesn't do backwards jumps), with a maximum of ten frames in this case. It walks the RBP register (base pointer) and saves the return instruction pointer for each frame into an array. I've had to use explicit bpf_probe_read()s to dereference pointers (bcc can automatically do this in some cases). I've also left the unrolled loop in the code (Python could have generated it) to keep it simple, and to help illustrate overhead.
This hack (so far) only works for x86_64, kernel-mode, and to a limited stack depth. If I (or you) really need more, keep hacking, although bear in mind that this is just a workaround until proper stack walking exists.
Other Solutions
stackcount implements an important new capability for the core Linux kernel: frequency counting stack traces. Just printing stack traces, like what stacksnoop does, has been possible for a long time: ftrace can do this, which I use in my kprobe tool from perf-tools. perf_events can also dump stack traces and has a reporting mode that will print unique paths and percentages (although it is performed less efficiently in user mode).
SystemTap has long had the capability to frequency count kernel- and user-mode stack traces, also in kernel for efficiency, although it is an add-on and not part of the mainline kernel.
Future Readers
If you're on Linux 4.5 or later, then eBPF may officially support stack walking. To check, look for something like a BPF_FUNC_get_stack in bpf_func_id. Or check the latest source code to tools like stackcount – the tool should still exist, but the above stack walker hack may be replaced with a simple call.
Thanks to Brenden Blanco (PLUMgrid) for help with this hack. If you're at SCaLE14x you can catch his IO Visor eBPF talk on Saturday, and my Broken Linux Performance Tools talk on Sunday!


Click here for Disqus comments (ad supported).