USENIX LISA 2021: BPF Internals, Tracing Examples
Talk by Brendan Gregg for USENIX LISA 2021.Video: https://www.youtube.com/watch?v=_5Z2AU7QTH4
Description: "This talk is a deep dive that describes how BPF (eBPF) works internally on Linux, and dissects some modern performance observability tools. Details covered include the kernel BPF implementation: the verifier, JIT compilation, and the BPF execution environment; the BPF instruction set; different event sources; and how BPF is used by user space, using bpftrace programs as an example. This includes showing how bpftrace is compiled to LLVM IR and then BPF bytecode, and how per-event data and aggregated map data are fetched from the kernel."
PDF: LISA2021_BPF_Internals.pdf
Keywords (from pdftotext):
slide 1:
BPF Internals (eBPF) Tracing Examples Brendan Gregg USENIX Jun, 2021slide 2:
BPF Internals (Brendan Gregg) (it’s actually quite easy)slide 3:
Agenda Intro & Tracing By Example Slides are online Slides: http://www.brendangregg.com/Slides/LISA2021_BPF_Internals.pdf Video: https://www.usenix.org/conference/lisa21/presentation/gregg-bpf 1. Dynamic tracing and per-event output 2. Static tracing and map summaries For BPF internals by reference, see the References slide Learning objectives Gain a working knowledge of some BPF internals Evaluate ideas for BPF suitability BPF Internals (Brendan Gregg)slide 4:
BPF Intro BPF Internals (Brendan Gregg)slide 5:
BPF 1992: Berkeley Packet Filter # tcpdump -d host 127.0.0.1 and port 80 (000) ldh [12] (001) jeq #0x800 jt 2 jf 18 (002) ld [26] (003) jeq #0x7f000001 jt 6 jf 4 (004) ld [30] (005) jeq #0x7f000001 jt 6 jf 18 (006) ldb [23] (007) jeq #0x84 jt 10 jf 8 (008) jeq #0x6 jt 10 jf 9 (009) jeq #0x11 jt 10 jf 18 (010) ldh [20] (011) jset #0x1fff jt 18 jf 12 (012) ldxb 4*([14]&0xf) (013) ldh [x + 14] (014) jeq #0x50 jt 17 jf 15 (015) ldh [x + 16] (016) jeq #0x50 jt 17 jf 18 (017) ret #262144 (018) ret BPF Internals (Brendan Gregg) A runtime for efficient packet filters Also a narrow and arcane kernel technology that few knew existedslide 6:
eBPF 2013+ Extended BPF (eBPF) modernized BPF Classic BPF Extended BPF Word size 32-bit 64-bit Registers 10+1 Storage 16 slots 512 byte stack + infinite map storage Events packets many event sources Maintainers/creators: Alexei Starovoitov & Daniel Borkmann Old BPF is now “Classic BPF,” and eBPF is usually just “BPF” BPF Internals (Brendan Gregg)slide 7:
BPF 2021 BPF is now a technology name (like LLVM) Some still call it eBPF A generic in-kernel execution environment User-defined programs Limited & secure kernel access A new type of software BPF Internals (Brendan Gregg)slide 8:
A New Type of Software Execution User model defined Compilation Security User task yes any user based abort syscall, fault Kernel task static none panic direct BPF event yes JIT, CO-RE verified, JIT error message restricted helpers BPF Internals (Brendan Gregg) Failure mode Resource accessslide 9:
BPF program state model Off-CPU Enabled event fires program ended BPF preempt attach Loaded On-CPU Sleeping (restricted state) page fault helpers Kernel spin lock Spinning BPF Internals (Brendan Gregg)slide 10:
BPF Tracing BPF Internals (Brendan Gregg)slide 11:
BPF tracing (observability) tools BPF Internals (Brendan Gregg)slide 12:
Recommended BPF tracing front-ends I want to run some tools Unix analogies bcc, bpftrace /usr/bin/* I want to hack up some new tools bpftrace bash, awk I want to spend weeks developing a BPF product bcc libbpf C, bcc Python (maybe), gobpf, libbbpf-rs New, lightweight, CO-RE & BTF based BPF Internals (Brendan Gregg) Requires LLVM; becoming obsolete / special-use only C, C++slide 13:
BPF Internals (developing BPF was hard; understanding it is easy) BPF Internals (Brendan Gregg)slide 14:
BPF tracing/observability high-level From: BPF Performance Tools, Figure 2-1 BPF Internals (Brendan Gregg)slide 15:
Terminology AST: Abstract Syntax Tree LLVM: A compiler IR: Intermediate Representation JIT: Just-in-time compilation kprobes: Kernel dynamic instrumentation uprobes: User-level dynamic instrumentation tracepoints: Kernel static instrumentation BPF Internals (Brendan Gregg)slide 16:
1. Dynamic tracing and per-event output BPF Internals (Brendan Gregg)slide 17:
1. Dynamic tracing and per-event output bpftrace -e 'kprobe:do_nanosleep { printf("PID %d sleeping...\n", pid); BPF Internals (Brendan Gregg)slide 18:
Example output # bpftrace -e 'kprobe:do_nanosleep { printf("PID %d sleeping...\n", pid); Attaching 1 probe... PID 10287 sleeping... PID 10297 sleeping... PID 10287 sleeping… PID 10297 sleeping... PID 10287 sleeping... PID 2218 sleeping... PID 10297 sleeping... [...] BPF Internals (Brendan Gregg)slide 19:
Objective We have this: bpftrace -e 'kprobe:do_nanosleep { printf("PID %d sleeping...\n", pid); We want: - BPF bytecode - Kernel events mapped to the bytecode - User space printing events This is learning internals by example, including bpftrace internals. For a complete internals see the References slide at end. BPF Internals (Brendan Gregg)slide 20:
bpftrace mid-level internals BPF Internals (Brendan Gregg)slide 21:
Program transformations bpftrace program BPF Internals (Brendan Gregg) AST LLVM IR BPF machine codeslide 22:
bpftrace mid-level internals 1/13 BPF Internals (Brendan Gregg)slide 23:
bpftrace program kprobe:do_nanosleep { printf("PID %d sleeping...\n", pid); BPF Internals (Brendan Gregg)slide 24:
bpftrace mid-level internals 2/13 BPF Internals (Brendan Gregg)slide 25:
Converting to AST kprobe:do_nanosleep { printf("PID %d sleeping...\n", pid); probe(kprobe:do_nanosleep) call(printf) string(“PID %d sleeping...\n”) builtin(pid) BPF Internals (Brendan Gregg)slide 26:
Parsing bpftrace bpftrace program text src/lexer.l (regular expressions) src/parser.yy (grammer) lex yacc (very easy) BPF Internals (Brendan Gregg) ASTslide 27:
Lexer pid Regular expressions bpftrace src/lexer.l ident [_a-zA-Z][_a-zA-Z0-9]* map @{ident}|@ var ${ident} int [0-9]+|0[xX][0-9a-fA-F]+ cint :{int}: hex (x|X)[0-9a-fA-F]{1,2} [...] builtin arg[0-9]|args|cgroup|comm|cpid|cpu|ctx|curtask|elapsed|func| gid|nsecs|pid|probe|rand|retval|sarg[0-9]|tid|uid|username call avg|cat|cgroupid|clear|count|delete|exit|hist|join|kaddr|ksym| lhist|max|min|ntop|override_return|print|printf|reg|signal|stats|str| strncmp|sum|sym|system|time|uaddr|usym|zero BPF Internals (Brendan Gregg)slide 28:
Yacc builtin(pid) Grammar rules bpftrace src/parser.yy %tokenslide 29:gt; BUILTIN "builtin" %token gt; CALL "call" [...] expr : int { $$ = $1; } | STRING { $$ = new ast::String($1, @$); } | BUILTIN { $$ = new ast::Builtin($1, @$); } | CALL_BUILTIN { $$ = new ast::Builtin($1, @$); } | IDENT { $$ = new ast::Identifier($1, @$); } | STACK_MODE { $$ = new ast::StackMode($1, @$); } | ternary { $$ = $1; } | param { $$ = $1; } | map_or_var { $$ = $1; } | call { $$ = $1; } [...] call : CALL "(" ")" { $$ = new ast::Call($1, @$); } | CALL "(" vargs ")" { $$ = new ast::Call($1, $3, @$); } BPF Internals (Brendan Gregg)
Lexer (2) printf("PID %d sleeping...\n", pid); bpftrace src/lexer.l ident [_a-zA-Z][_a-zA-Z0-9]* map @{ident}|@ var ${ident} int [0-9]+|0[xX][0-9a-fA-F]+ cint :{int}: hex (x|X)[0-9a-fA-F]{1,2} [...] builtin arg[0-9]|args|cgroup|comm|cpid|cpu|ctx|curtask|elapsed|func| gid|nsecs|pid|probe|rand|retval|sarg[0-9]|tid|uid|username call avg|cat|cgroupid|clear|count|delete|exit|hist|join|kaddr|ksym| lhist|max|min|ntop|override_return|print|printf|reg|signal|stats|str| strncmp|sum|sym|system|time|uaddr|usym|zero BPF Internals (Brendan Gregg)slide 30:
Yacc (2) call(printf(...)); bpftrace src/parser.yy %tokenslide 31:gt; BUILTIN "builtin" %token gt; CALL "call" [...] expr : int { $$ = $1; } | STRING { $$ = new ast::String($1, @$); } | BUILTIN { $$ = new ast::Builtin($1, @$); } | CALL_BUILTIN { $$ = new ast::Builtin($1, @$); } | IDENT { $$ = new ast::Identifier($1, @$); } | STACK_MODE { $$ = new ast::StackMode($1, @$); } | ternary { $$ = $1; } | param { $$ = $1; } | map_or_var { $$ = $1; } | call { $$ = $1; } [...] call : CALL "(" ")" { $$ = new ast::Call($1, @$); } | CALL "(" vargs ")" { $$ = new ast::Call($1, $3, @$); } BPF Internals (Brendan Gregg)
Lexer (3) "PID %d sleeping...\n"slide 32:gt;{ [^\\\n\"]+ \\n \\t \\r \\\" \\\\ \\{oct} bpftrace src/lexer.l { yy_push_state(STR, yyscanner); buffer.clear(); } { yy_pop_state(yyscanner); return Parser::make_STRING(buffer, loc); } buffer += yytext; buffer += '\n'; buffer += '\t'; buffer += '\r'; buffer += '\"'; buffer += '\\'; long value = strtol(yytext+1, NULL, 8); if (value >gt; UCHAR_MAX) driver.error(loc, std::string("octal escape sequence out of range '") + yytext + "'"); buffer += value; \\{hex} buffer += strtol(yytext+2, NULL, 16); driver.error(loc, "unterminated string"); yy_pop_state(yyscanner); loc.lines(1); loc.step(); gt;>gt; driver.error(loc, "unterminated string"); yy_pop_state(yyscanner); \\. { driver.error(loc, std::string("invalid escape character '") + yytext + "'"); } driver.error(loc, "invalid character"); yy_pop_state(yyscanner); BPF Internals (Brendan Gregg)
Yacc (3) string("PID %d sleeping...\n") bpftrace src/parser.yy %tokenslide 33:gt; STRING "string" [...] expr : int | STRING | BUILTIN | CALL_BUILTIN | IDENT | STACK_MODE | ternary | param | map_or_var | call [...] BPF Internals (Brendan Gregg) { $$ = $1; } { $$ = new ast::String($1, @$); } { $$ = new ast::Builtin($1, @$); } { $$ = new ast::Builtin($1, @$); } { $$ = new ast::Identifier($1, @$); } { $$ = new ast::StackMode($1, @$); } { $$ = $1; } { $$ = $1; } { $$ = $1; } { $$ = $1; }
Lexer & Yacc (4) kprobe:do_nanosleep bpftrace src/lexer.l ident map [...] [_a-zA-Z][_a-zA-Z0-9]* @{ident}|@ bpftrace src/parser.yy attach_point : ident | ident ":" wildcard | ident ":" wildcard PLUS INT | ident PATH STRING [...] wildcard : wildcard ident { $$ = $1 + $2; } | wildcard MUL { $$ = $1 + "*"; } | wildcard LBRACKET { $$ = $1 + "["; } | wildcard RBRACKET { $$ = $1 + "]"; } | wildcard DOT { $$ = $1 + "."; } { $$ = ""; } BPF Internals (Brendan Gregg) { $$ = new ast::AttachPoint($1, @$); } { $$ = new ast::AttachPoint($1, $3, @$); } { $$ = new ast::AttachPoint($1, $3, $5, @$); } { $$ = new ast::AttachPoint($1, $2.substr(1, $2.si...slide 34:
Lexer & Yacc (4) kprobe:do_nanosleep ident map [...] bpftrace src/lexer.l [_a-zA-Z][_a-zA-Z0-9]* @{ident}|@ bpftrace src/parser.yy attach_point : ident | ident ":" wildcard | ident ":" wildcard PLUS INT | ident PATH STRING [...] wildcard : wildcard ident { $$ = $1 + $2; } | wildcard MUL { $$ = $1 + "*"; } | wildcard LBRACKET { $$ = $1 + "["; } | wildcard RBRACKET { $$ = $1 + "]"; } | wildcard DOT { $$ = $1 + "."; } { $$ = ""; } BPF Internals (Brendan Gregg) { $$ = new ast::AttachPoint($1, @$); } { $$ = new ast::AttachPoint($1, $3, @$); } { $$ = new ast::AttachPoint($1, $3, $5, @$); } { $$ = new ast::AttachPoint($1, $2.substr(1, $2.si...slide 35:
Yacc (5) { statements; ... } Plus more grammar for program structure... bpftrace src/parser.yy probe : attach_points pred block { $$ = new ast::Probe($1, $2, $3); } attach_points : attach_points "," attach_point { $$ = $1; $1->gt;push_back($3); } | attach_point { $$ = new ast::AttachPointList; $$->gt;push_back($1); } [...] block : "{" stmts "}" { $$ = $2; } semicolon_ended_stmt: stmt ";" { $$ = $1; } stmts : semicolon_ended_stmt stmts { $$ = $2; $2->gt;insert($2->gt;begin(), $1); } | block_stmt stmts { $$ = $2; $2->gt;insert($2->gt;begin(), $1); } | stmt { $$ = new ast::StatementList; $$->gt;push_back($1); } { $$ = new ast::StatementList; } BPF Internals (Brendan Gregg)slide 36:
Now you have AST nodes! # bpftrace -d -e 'kprobe:do_nanosleep { printf("PID %d sleeping...\n", pid); AST ------------------Program kprobe:do_nanosleep call: printf string: PID %d sleeping...\n builtin: pid [...] BPF Internals (Brendan Gregg)slide 37:
bpftrace mid-level internals 3/13 BPF Internals (Brendan Gregg)slide 38:
Tracepoint & Clang struct parsers kprobe:do_nanosleep { printf("PID %d sleeping...\n", pid); Not needed for this example (no struct member dereferencing) BPF Internals (Brendan Gregg)slide 39:
bpftrace mid-level internals 4/13 BPF Internals (Brendan Gregg)slide 40:
Semantic analyzer Catches many program errors; E.g.: # bpftrace -e 'kprobe:do_nanosleep { printf("PID %d sleeping...\n", pidd); }' stdin:2:36-38: ERROR: Unknown identifier: 'pidd' printf("PID %d sleeping...\n", pidd); bpftrace src/ast/semantic_analyser.cpp void SemanticAnalyser::visit(Identifier &identifier) if (bpftrace_.enums_.count(identifier.ident) != 0) { identifier.type = SizedType(Type::integer, 8); else { identifier.type = SizedType(Type::none, 0); error("Unknown identifier: '" + identifier.ident + "'", identifier.loc); BPF Internals (Brendan Gregg) >gt;2000 lines of codeslide 41:
bpftrace mid-level internals 5/13 BPF Internals (Brendan Gregg)slide 42:
AST → LLVM IR bpftrace src/ast/codegen_llvm.cpp void CodegenLLVM::visit(Builtin &builtin) […] else if (builtin.ident == "pid" || builtin.ident == "tid") Value *pidtgid = b_.CreateGetPidTgid(); if (builtin.ident == "pid") expr_ = b_.CreateLShr(pidtgid, 32); […] BPF logical shift right instruction BPF Internals (Brendan Gregg)slide 43:
AST → LLVM IR (2) bpftrace src/ast/irbuilderbpf.cpp CallInst *IRBuilderBPF::CreateGetPidTgid() // u64 bpf_get_current_pid_tgid(void) // Return: current->gt;tgidslide 44:gt;pid FunctionType *getpidtgid_func_type = FunctionType::get(getInt64Ty(), false); PointerType *getpidtgid_func_ptr_type = PointerType::get(getpidtgid_func_type, 0); Constant *getpidtgid_func = ConstantExpr::getCast( Instruction::IntToPtr, getInt64(libbpf::BPF_FUNC_get_current_pid_tgid), getpidtgid_func_ptr_type); return CreateCall(getpidtgid_func, {}, "get_pid_tgid"); BPF helper call number BPF Internals (Brendan Gregg)
BPF helper calls Linux include/uapi/linux/bpf.h #define __BPF_FUNC_MAPPER(FN) FN(unspec), FN(map_lookup_elem), FN(map_update_elem), FN(map_delete_elem), FN(probe_read), FN(ktime_get_ns), FN(trace_printk), FN(get_prandom_u32), FN(get_smp_processor_id), FN(skb_store_bytes), FN(l3_csum_replace), FN(l4_csum_replace), FN(tail_call), FN(clone_redirect), FN(get_current_pid_tgid), FN(get_current_uid_gid), [...] BPF Internals (Brendan Gregg) #14slide 45:
Now you have LLVM IR! # bpftrace -d -e 'kprobe:do_nanosleep { printf("PID %d sleeping...\n", pid); }' […] define i64 @"kprobe:do_nanosleep"(i8*) local_unnamed_addr section "s_kprobe:do_nanosleep_1" { entry: %printf_args = alloca %printf_t, align 8 %1 = bitcast %printf_t* %printf_args to i8* call void @llvm.lifetime.start.p0i8(i64 -1, i8* nonnull %1) %2 = getelementptr inbounds %printf_t, %printf_t* %printf_args, i64 0, i32 0 store i64 0, i64* %2, align 8 %get_pid_tgid = tail call i64 inttoptr (i64 14 to i64 ()*)() %3 = lshr i64 %get_pid_tgid, 32 %4 = getelementptr inbounds %printf_t, %printf_t* %printf_args, i64 0, i32 1 store i64 %3, i64* %4, align 8 %pseudo = tail call i64 @llvm.bpf.pseudo(i64 1, i64 1) %get_cpu_id = tail call i64 inttoptr (i64 8 to i64 ()*)() %perf_event_output = call i64 inttoptr (i64 25 to i64 (i8*, i64, i64, %printf_t*, i64)*)(i8* %0, i64 %pseudo, i64 %get_cpu_id, %printf_t* nonnull %printf_args, i64 16) call void @llvm.lifetime.end.p0i8(i64 -1, i8* nonnull %1) ret i64 0 BPF Internals (Brendan Gregg)slide 46:
Now you have LLVM IR! (2) # bpftrace -d -e 'kprobe:do_nanosleep { printf("PID %d sleeping...\n", pid); }' […] define i64 @"kprobe:do_nanosleep"(i8*) local_unnamed_addr section "s_kprobe:do_nanosleep_1" { entry: %printf_args = alloca %printf_t, align 8 %1 = bitcast %printf_t* %printf_args to i8* call void @llvm.lifetime.start.p0i8(i64 -1, i8* nonnull %1) %2 = getelementptr inbounds %printf_t, %printf_t* %printf_args, i64 0, i32 0 store i64 0, i64* %2, align 8 %get_pid_tgid = tail call i64 inttoptr (i64 14 to i64 ()*)() %3 = lshr i64 %get_pid_tgid, 32 %4 = getelementptr inbounds %printf_t, %printf_t* %printf_args, i64 0, i32 1 store i64 %3, i64* %4, align 8 %pseudo = tail call i64 @llvm.bpf.pseudo(i64 1, i64 1) %get_cpu_id = tail call i64 inttoptr (i64 8 to i64 ()*)() %perf_event_output = call i64 inttoptr (i64 25 to i64 (i8*, i64, i64, %printf_t*, i64)*)(i8* %0, i64 %pseudo, i64 %get_cpu_id, %printf_t* nonnull %printf_args, i64 16) call void @llvm.lifetime.end.p0i8(i64 -1, i8* nonnull %1) ret i64 0 BPF Internals (Brendan Gregg)slide 47:
Now you have LLVM IR! (3) # bpftrace -d -e 'kprobe:do_nanosleep { printf("PID %d sleeping...\n", pid); }' […] define i64 @"kprobe:do_nanosleep"(i8*) local_unnamed_addr section "s_kprobe:do_nanosleep_1" { entry: %printf_args = alloca %printf_t, align 8 %1 = bitcast %printf_t* %printf_args to i8* call void @llvm.lifetime.start.p0i8(i64 -1, i8* nonnull %1) %2 = getelementptr inbounds %printf_t, %printf_t* %printf_args, i64 0, i32 0 store i64 0, i64* %2, align 8 [...] This is all generated from LLVM IR calls in bpftrace src/ast/* Lots of CreateAllocaBPF(), CreateGEP(), etc. (this gets verbose) BPF Internals (Brendan Gregg)slide 48:
bpftrace mid-level internals 6/13 BPF Internals (Brendan Gregg)slide 49:
Extended BPF instruction (bytecode) format ← 64-bit → opcode dest src reg reg signed offset signed immediate constant 8-bit 4-bit 4-bit 16-bit 32-bit E.g., for ALU & JMP classes: opcode src inst. class 4-bit 1-bit 3-bit BPF Internals (Brendan Gregg)slide 50:
Extended BPF instruction (bytecode) format (2) E.g., call get_current_pid_tgid opcode dest src reg reg signed offset 8-bit 4-bit 4-bit 16-bit E.g., for ALU & JMP classes: BPF_CALL opcode 4-bit BPF Internals (Brendan Gregg) 32-bit BPF_JMP inst. src class 1-bit signed immediate constant 3-bitslide 51:
Extended BPF instruction (bytecode) format (3) E.g., call get_current_pid_tgid opcode dest src reg reg signed offset 8-bit 4-bit 4-bit 16-bit signed immediate constant 32-bit Linux include/uapi/linux/bpf.h E.g., for ALU & JMP classes: BPF_CALL opcode BPF_JMP inst. src class #define 4-bit 1-bit 3-bit 0x05 Linux include/uapi/linux/bpf_common.h #define BPF Internals (Brendan Gregg) BPF_JMP BPF_CALL 0x80slide 52:
Extended BPF instruction (bytecode) format (4) E.g., call get_current_pid_tgid 0x85 0x0 0x0 0x00 0x00 dest src signed opcode reg reg offset 8-bit 4-bit 4-bit 0xe0 0x00 0x00 0x00 signed immediate constant 16-bit 32-bit Linux include/uapi/linux/bpf.h E.g., for ALU & JMP classes: BPF_CALL opcode BPF_JMP inst. src class #define 4-bit 1-bit 3-bit 0x05 Linux include/uapi/linux/bpf_common.h #define BPF Internals (Brendan Gregg) BPF_JMP BPF_CALL 0x80slide 53:
Extended BPF instruction (bytecode) format (5) E.g., call get_current_pid_tgid (hex) 85 00 00 e0 00 00 00 As per the BPF specification (currently Linux headers) BPF Internals (Brendan Gregg)slide 54:
LLVM/Clang has a BPF target --target bpf LLVM BPF bytecode Future: bpftrace may include its own lightweight bpftrace compiler (BC) as an option (pros: no dependencies; cons: less optimal code) BPF Internals (Brendan Gregg)slide 55:
LLVM/Clang has a BPF target (2) BPF specification (#defines) Linux include/uapi/linux/bpf_common.h Linux include/uapi/linux/bpf.h Linux include/uapi/linux/filter.h LLVM BPF target LLVM BPF Internals (Brendan Gregg) LLVM BPF bytecodeslide 56:
LLVM IR → BPF E.g., tail call i64 inttoptr (i64 14 to i64 ()*)() LLVM llvm/lib/Target/BPF/BPFInstrInfo.td class CALLslide 57:gt; : TYPE_ALU_JMP gt; { bits gt; BrDst; let Inst{31-0} = BrDst; let BPFClass = BPF_JMP; Plus more llvm boilerplate & BPF headers shown earlier 85 00 00 e0 00 00 00 BPF Internals (Brendan Gregg)
Now you have BPF bytecode! bf 16 00 00 00 00 00 00 b7 01 00 00 00 00 00 00 7b 1a f0 ff 00 00 00 00 85 00 00 00 0e 00 00 00 77 00 00 00 20 00 00 00 7b 0a f8 ff 00 00 00 00 18 17 00 00 30 00 00 00 00 00 00 00 00 00 00 00 85 00 00 00 08 00 00 00 bf a4 00 00 00 00 00 00 07 04 00 00 f0 ff ff ff bf 61 00 00 00 00 00 00 bf 72 00 00 00 00 00 00 bf 03 00 00 00 00 00 00 b7 05 00 00 10 00 00 00 85 00 00 00 19 00 00 00 b7 00 00 00 00 00 00 00 95 00 00 00 00 00 00 00 BPF Internals (Brendan Gregg)slide 58:
Now you have BPF bytecode! (2) bf 16 00 00 00 00 00 00 0x05 (BPF_JMP) | 0x80 (BPF_CALL) b7 01 00 00 00 00 00 00 7b 1a f0 ff 00 00 00 00 85 00 00 00 0e 00 00 00 77 00 00 00 20 00 00 00 14 (BPF_FUNC_get_current_pid_tgid) 7b 0a f8 ff 00 00 00 00 18 17 00 00 30 00 00 00 00 00 00 00 00 00 00 00 85 00 00 00 08 00 00 00 bf a4 00 00 00 00 00 00 07 04 00 00 f0 ff ff ff bf 61 00 00 00 00 00 00 bf 72 00 00 00 00 00 00 bf 03 00 00 00 00 00 00 b7 05 00 00 10 00 00 00 85 00 00 00 19 00 00 00 b7 00 00 00 00 00 00 00 95 00 00 00 00 00 00 00 BPF Internals (Brendan Gregg)slide 59:
bpftrace mid-level internals 7/13 BPF Internals (Brendan Gregg)slide 60:
Sending BPF bytecode to the kernel # strace -fe bpf bpftrace -e 'kprobe:do_nanosleep { printf("PID %d sleeping...\n", pid); [...] bpf(BPF_PROG_LOAD, {prog_type=BPF_PROG_TYPE_KPROBE, insn_cnt=18, insns=0x7fdde5305000, license="GPL", log_level=0, log_size=0, log_buf=NULL, kern_version=KERNEL_VERSION(5, 8, 18), prog_flags=0, prog_name="do_nanosleep", prog_ifindex=0, expected_attach_type=BPF_CGROUP_INET_INGRESS, prog_btf_fd=0, func_info_rec_size=0, func_info=NULL, func_info_cnt=0, line_info_rec_size=0, line_info=NULL, line_info_cnt=0, attach_btf_id=0, attach_prog_fd=0}, 120) = 14 Success! Passed the verifier... BPF Internals (Brendan Gregg)slide 61:
bpftrace mid-level internals 8/13 BPF Internals (Brendan Gregg)slide 62:
BPF mid-level internals From: BPF Performance Tools, Figure 2-3 BPF Internals (Brendan Gregg)slide 63:
Verifying BPF instructions 85 00 00 00 12 34 56 78 Imagine we call a bogus function... Linux kernel/bpf/verifier.c static int do_check(struct bpf_verifier_env *env) [...] } else if (class == BPF_JMP || class == BPF_JMP32) { u8 opcode = BPF_OP(insn->gt;code); env->gt;jmps_processed++; if (opcode == BPF_CALL) { [...] err = check_helper_call(env, insn->gt;imm, env->gt;insn_idx); [...] static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn_idx) const struct bpf_func_proto *fn = NULL; struct bpf_reg_state *regs; struct bpf_call_arg_meta meta; bool changes_data; int i, err; If ... /* find function prototype */ if (func_idslide 64:gt;= __BPF_FUNC_MAX_ID) { verbose(env, "invalid func %s#%d\n", func_id_name(func_id), func_id); return -EINVAL; BPF Internals (Brendan Gregg) >gt;9000 lines of code
BPF verifier >gt;9000 lines of code >gt;260 error returns Checks every instruction Checks every code path Rewrites some bytecode BPF Internals (Brendan Gregg) Verifier functions: check_subprogs check_reg_arg check_stack_write check_stack_read check_stack_access check_map_access_type check_mem_region_access check_map_access check_packet_access check_ctx_access check_flow_keys_access check_sock_access check_pkt_ptr_alignment check_generic_ptr_alignment check_ptr_alignment check_max_stack_depth check_tp_buffer_access check_ptr_to_btf_access check_mem_access check_xadd check_stack_boundary check_helper_mem_access check_func_arg check_map_func_compatibility check_func_proto check_func_call check_reference_leak check_helper_call check_alu_op check_cond_jmp_op check_ld_imm check_ld_abs check_return_code check_cfg check_btf_func check_btf_line check_btf_info check_map_prealloc check_map_prog_compatibility check_struct_ops_btf_id check_attach_modify_return check_attach_btf_idslide 65:
Verifying Instructions Memory access Memory Direct access extremely restricted Can only read initialized memory Other kernel memory must pass through the bpf_probe_read() helper and its checks Arguments are the correct type Register usage allowed E.g., no frame pointer writes No write overflows No addr leaks Etc. BPF Internals (Brendan Gregg) random addr bpf_probe_read STACK LOAD STORE CTX MAP_VALUE SOCKETslide 66:
Verifying Code Paths All instruction must lead to exit No unreachable instructions No backwards branches (loops) except BPF bounded loops biolatency as GraphViz dot: BPF Internals (Brendan Gregg)slide 67:
Verifying Code Paths All instruction must lead to exit No unreachable instructions No backwards branches (loops) except BPF bounded loops biolatency as GraphViz dot: BPF Internals (Brendan Gregg)slide 68:
Pre-verifier BPF bytecode bf 16 00 00 00 00 00 00 b7 01 00 00 00 00 00 00 7b 1a f0 ff 00 00 00 00 85 00 00 00 0e 00 00 00 77 00 00 00 20 00 00 00 7b 0a f8 ff 00 00 00 00 18 17 00 00 30 00 00 00 00 00 00 00 00 00 00 00 85 00 00 00 08 00 00 00 bf a4 00 00 00 00 00 00 07 04 00 00 f0 ff ff ff bf 61 00 00 00 00 00 00 bf 72 00 00 00 00 00 00 bf 03 00 00 00 00 00 00 b7 05 00 00 10 00 00 00 85 00 00 00 19 00 00 00 b7 00 00 00 00 00 00 00 95 00 00 00 00 00 00 00 BPF Internals (Brendan Gregg)slide 69:
Post-verifier BPF bytecode bf 16 00 00 00 00 00 00 b7 01 00 00 00 00 00 00 7b 1a f0 ff 00 00 00 00 85 00 00 00 d0 81 01 00 77 00 00 00 20 00 00 00 7b 0a f8 ff 00 00 00 00 18 17 00 00 18 00 00 00 00 00 00 00 00 00 00 00 85 00 00 00 f0 80 01 00 bf a4 00 00 00 00 00 00 07 04 00 00 f0 ff ff ff bf 61 00 00 00 00 00 00 bf 72 00 00 00 00 00 00 bf 03 00 00 00 00 00 00 b7 05 00 00 10 00 00 00 85 00 00 00 30 2c ff ff b7 00 00 00 00 00 00 00 95 00 00 00 00 00 00 00 BPF Internals (Brendan Gregg)slide 70:
Post-verifier BPF bytecode (2) bf 16 00 00 00 00 00 00 b7 01 00 00 00 00 00 00 E.g., call get_current_pid_tgid 7b 1a f0 ff 00 00 00 00 85 00 00 00 d0 81 01 00 helper index value has become an instruction 77 00 00 00 20 00 00 00 offset addresses from __bpf_call_base 7b 0a f8 ff 00 00 00 00 18 17 00 00 18 00 00 00 00 00 00 00 00 00 00 00 85 00 00 00 f0 80 01 00 bf a4 00 00 00 00 00 00 07 04 00 00 f0 ff ff ff bf 61 00 00 00 00 00 00 bf 72 00 00 00 00 00 00 bf 03 00 00 00 00 00 00 b7 05 00 00 10 00 00 00 85 00 00 00 30 2c ff ff b7 00 00 00 00 00 00 00 95 00 00 00 00 00 00 00 BPF Internals (Brendan Gregg)slide 71:
BPF bytecode with human words # bpftool prog show […] 70: kprobe name do_nanosleep tag 8dc93a3b6a21ef3b gpl loaded_at 2021-05-02T00:44:26+0000 uid 0 xlated 144B jited 96B memlock 4096B map_ids 24 # bpftool prog dump xlated id 70 opcodes 0: (bf) r6 = r1 bf 16 00 00 00 00 00 00 1: (b7) r1 = 0 b7 01 00 00 00 00 00 00 2: (7b) *(u64 *)(r10 -16) = r1 7b 1a f0 ff 00 00 00 00 3: (85) call bpf_get_current_pid_tgid#98768 85 00 00 00 d0 81 01 00 4: (77) r0 >gt;>gt;= 32 77 00 00 00 20 00 00 00 5: (7b) *(u64 *)(r10 -8) = r0 7b 0a f8 ff 00 00 00 00 6: (18) r7 = map[id:24] 18 17 00 00 18 00 00 00 00 00 00 00 00 00 00 00 [...] BPF Internals (Brendan Gregg) Using bpftool on a running instance of the programslide 72:
BPF bytecode with human words (2) # bpftool prog show […] 70: kprobe name do_nanosleep tag 8dc93a3b6a21ef3b gpl loaded_at 2021-05-02T00:44:26+0000 uid 0 xlated 144B jited 96B memlock 4096B map_ids 24 # bpftool prog dump xlated id 70 opcodes 0: (bf) r6 = r1 bf 16 00 00 00 00 00 00 1: (b7) r1 = 0 b7 01 00 00 00 00 00 00 2: (7b) *(u64 *)(r10 -16) = r1 7b 1a f0 ff 00 00 00 00 3: (85) call bpf_get_current_pid_tgid#98768 85 00 00 00 d0 81 01 00 4: (77) r0 >gt;>gt;= 32 77 00 00 00 20 00 00 00 5: (7b) *(u64 *)(r10 -8) = r0 7b 0a f8 ff 00 00 00 00 6: (18) r7 = map[id:24] 18 17 00 00 18 00 00 00 00 00 00 00 00 00 00 00 [...] BPF Internals (Brendan Gregg) Using bpftool on a running instance of the programslide 73:
BPF bytecode, opcodes only # bpftool prog dump xlated id 70 0: (bf) r6 = r1 1: (b7) r1 = 0 2: (7b) *(u64 *)(r10 -16) = r1 3: (85) call bpf_get_current_pid_tgid#98768 4: (77) r0 >gt;>gt;= 32 5: (7b) *(u64 *)(r10 -8) = r0 6: (18) r7 = map[id:24] 8: (85) call bpf_get_smp_processor_id#98544 9: (bf) r4 = r10 10: (07) r4 += -16 11: (bf) r1 = r6 12: (bf) r2 = r7 13: (bf) r3 = r0 14: (b7) r5 = 16 15: (85) call bpf_perf_event_output#-54224 16: (b7) r0 = 0 17: (95) exit just the opcode 8 bits BPF Internals (Brendan Gregg)slide 74:
bpftrace mid-level internals 9/13 BPF Internals (Brendan Gregg)slide 75:
BPF bytecode → native machine code BPF bytecode arch/x86/net/bpf_jit_comp.c x86 machine code arch/arm64/net/bpf_jit_comp.c arm machine code arch/sparc/net/bpf_jit_* sparc machine code BPF Internals (Brendan Gregg)slide 76:
BPF bytecode → x86 machine code Linux arch/x86/net/bpf_jit_comp.c static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, int oldproglen, struct jit_context *ctx) […] If ... for (i = 1; islide 77:gt;code) { […] case BPF_JMP | BPF_CALL: func = (u8 *) __bpf_call_base + imm32; if (!imm32 || emit_call(&prog, func, image + addrs[i - 1])) return -EINVAL; break; […] static int emit_call(u8 **pprog, void *func, void *ip) return emit_patch(pprog, func, ip, 0xE8); BPF Internals (Brendan Gregg)
BPF bytecode → x86 machine code Linux arch/x86/net/bpf_jit_comp.c static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, int oldproglen, struct jit_context *ctx) […] If ... for (i = 1; islide 78:gt;code) { […] case BPF_JMP | BPF_CALL: func = (u8 *) __bpf_call_base + imm32; if (!imm32 || emit_call(&prog, func, image + addrs[i - 1])) return -EINVAL; break; […] static int emit_call(u8 **pprog, void *func, void *ip) return emit_patch(pprog, func, ip, 0xE8); 0xe8 is x86 CALL BPF Internals (Brendan Gregg)
Now you have x86 machine code! # bpftool prog dump jited id 80 opcodes | grep -v : 48 89 e5 48 81 ec 10 00 00 00 41 55 41 56 41 57 6a 00 48 89 fb 31 ff 48 89 7d f0 e8 a0 8b 44 c2 48 c1 e8 20 48 89 45 f8 49 bd 00 b6 7b b8 3a 9d ff ff e8 a9 8a 44 c2 48 89 e9 48 83 c1 f0 48 89 df 31 instructions [...] BPF Internals (Brendan Gregg)slide 79:
Now you have x86 machine code! # bpftool prog dump jited id 80 opcodes | grep -v : 48 89 e5 48 81 ec 10 00 00 00 41 55 41 56 41 57 6a 00 48 89 fb 31 ff 48 89 7d f0 CALL get_current_pid_tgid e8 a0 8b 44 c2 48 c1 e8 20 48 89 45 f8 49 bd 00 b6 7b b8 3a 9d ff ff e8 a9 8a 44 c2 48 89 e9 48 83 c1 f0 48 89 df 31 instructions [...] BPF Internals (Brendan Gregg)slide 80:
… or you have ARM machine code! # bpftool prog dump jited id 71 opcodes | grep -v : fd 7b bf a9 fd 03 00 91 f3 53 bf a9 f5 5b bf a9 f9 6b bf a9 f9 03 00 91 1a 00 80 d2 ff 43 00 d1 13 00 00 91 00 00 80 d2 ea 01 80 92 20 6b 2a f8 ea 48 9e 92 0a 05 a2 f2 0a 00 d0 f2 40 01 3f d6 07 00 00 91 e7 fc 60 d3 ea 00 80 92 48 instructions [...] BPF Internals (Brendan Gregg)slide 81:
x86 instruction disassembly # bpftool prog dump jited id 80 0xffffffffc0192b2e: push %rbp mov %rsp,%rbp sub $0x10,%rsp push %rbx push %r13 push %r14 10: push %r15 12: pushq $0x0 14: mov %rdi,%rbx 17: xor %edi,%edi 19: mov %rdi,-0x10(%rbp) 1d: callq 0xffffffffc2448bc2 22: shr $0x20,%rax 26: mov %rax,-0x8(%rbp) 2a: movabs $0xffff9d3ab87bb600,%r13 34: callq 0xffffffffc2448ae2 39: mov %rbp,%rcx 3c: add $0xfffffffffffffff0,%rcx [...] BPF Internals (Brendan Gregg)slide 82:
x86 instruction disassembly (2) # bpftool prog dump jited id 80 0xffffffffc0192b2e: push %rbp mov %rsp,%rbp sub $0x10,%rsp push %rbx BPF prologue push %r13 push %r14 10: push %r15 12: pushq $0x0 14: mov %rdi,%rbx BPF program 17: xor %edi,%edi 19: mov %rdi,-0x10(%rbp) get_current_pid_tgid 1d: callq 0xffffffffc2448bc2 22: shr $0x20,%rax 26: mov %rax,-0x8(%rbp) 2a: movabs $0xffff9d3ab87bb600,%r13 34: callq 0xffffffffc2448ae2 39: mov %rbp,%rcx 3c: add $0xfffffffffffffff0,%rcx [...] BPF Internals (Brendan Gregg)slide 83:
Plus you have BPF helper code Linux kernel/bpf/helpers.c BPF_CALL_0(bpf_get_current_pid_tgid) struct task_struct *task = current; if (unlikely(!task)) return -EINVAL; [...] BPF Internals (Brendan Gregg) return (u64) task->gt;tgidslide 84:gt;pid;
bpftrace mid-level internals 10/13 BPF Internals (Brendan Gregg)slide 85:
Attaching BPF to a kprobe # strace -fe perf_event_open,bpf,ioctl bpftrace -e 'kprobe:do_nanosleep { printf("PID %d sleeping...\n", pid); }' [...] bpf(BPF_PROG_LOAD, {prog_type=BPF_PROG_TYPE_KPROBE, insn_cnt=18, insns=0x7f6e826cf000, license="GPL", log_level=0, log_size=0, log_buf=NULL, kern_version=KERNEL_VERSION(5, 8, 18), prog_flags=0, prog_name="do_nanosleep", prog_ifindex=0, expected_attach_type=BPF_CGROUP_INET_INGRESS, prog_btf_fd=0, func_info_rec_size=0, func_info=NULL, func_info_cnt=0, line_info_rec_size=0, line_info=NULL, line_info_cnt=0, attach_btf_id=0, attach_prog_fd=0}, 120) = 14 perf_event_open({type=0x6 /* PERF_TYPE_??? */, size=PERF_ATTR_SIZE_VER5, config=0, ...}, -1, 0, -1, PERF_FLAG_FD_CLOEXEC) = 13 ioctl(13, PERF_EVENT_IOC_SET_BPF, 14) = 0 ioctl(13, PERF_EVENT_IOC_ENABLE, 0) = 0 BPF Internals (Brendan Gregg)slide 86:
Attaching BPF to a kprobe (2) # strace -fe perf_event_open,bpf,ioctl bpftrace -e 'kprobe:do_nanosleep { printf("PID %d sleeping...\n", pid); }' [...] bpf(BPF_PROG_LOAD, {prog_type=BPF_PROG_TYPE_KPROBE, insn_cnt=18, insns=0x7f6e826cf000, license="GPL", log_level=0, log_size=0, log_buf=NULL, kern_version=KERNEL_VERSION(5, 8, 18), prog_flags=0, prog_name="do_nanosleep", prog_ifindex=0, expected_attach_type=BPF_CGROUP_INET_INGRESS, prog_btf_fd=0, func_info_rec_size=0, func_info=NULL, func_info_cnt=0, line_info_rec_size=0, line_info=NULL, line_info_cnt=0, attach_btf_id=0, attach_prog_fd=0}, 120) = 14 perf_event_open({type=0x6 /* PERF_TYPE_??? */, size=PERF_ATTR_SIZE_VER5, config=0, ...}, -1, 0, -1, PERF_FLAG_FD_CLOEXEC) = 13 creates the kprobe ioctl(13, PERF_EVENT_IOC_SET_BPF, 14) = 0 ioctl(13, PERF_EVENT_IOC_ENABLE, 0) = 0 strace is lacking some translation bpf() BPF program (#14) BPF Internals (Brendan Gregg) perf_event_open() ioctl() kprobe (#13)slide 87:
bpftrace mid-level internals 11/13 BPF Internals (Brendan Gregg)slide 88:
kprobes How do we instrument this? Linux kernel/time/hrtimer.c static int __sched do_nanosleep(struct hrtimer_sleeper *t, enum hrtimer_mode mode) If ... struct restart_block *restart; do { [...] set_current_state(TASK_INTERRUPTIBLE); hrtimer_sleeper_start_expires(t, mode); if (likely(t->gt;task)) BPF Internals (Brendan Gregg) (it’s actually quite easy)slide 89:
Instrumenting live kernel functions (gdb) disas/r do_nanosleep Dump of assembler code for function do_nanosleep: 0xffffffff81b7d810slide 90:gt;: e8 1b 00 4f ff callq 0xffffffff81b7d815 gt;: push 0xffffffff81b7d816 gt;: 89 f1 mov 0xffffffff81b7d818 gt;: 48 89 e5 mov 0xffffffff81b7d81b gt;: 41 55 push 0xffffffff81b7d81d gt;: 41 54 push 0xffffffff81b7d81f gt;: 53 push [...] BPF Internals (Brendan Gregg) 0xffffffff8106d830 gt; %rbp %esi,%ecx %rsp,%rbp %r13 %r12 %rbx
Instrumenting live kernel functions (2) (gdb) disas/r do_nanosleep Dump of assembler code for function do_nanosleep: 0xffffffff81b7d810slide 91:gt;: e8 1b 00 4f ff callq 0xffffffff81b7d815 gt;: push 0xffffffff81b7d816 gt;: 89 f1 mov 0xffffffff81b7d818 gt;: 48 89 e5 mov 0xffffffff81b7d81b gt;: 41 55 push 0xffffffff81b7d81d gt;: 41 54 push 0xffffffff81b7d81f gt;: 53 push [...] 0xffffffff8106d830 gt; %rbp %esi,%ecx %rsp,%rbp %r13 %r12 this is usually %rbx nop’d out A) Ftrace is already there. Kprobes can add a handler. B) Or a breakpoint written (e.g., int3). C) Or a jmp is written. May need to stop_machine() to ensure other cores don’t execute changing instruction text BPF Internals (Brendan Gregg)
bpftrace mid-level internals 12/13 BPF Internals (Brendan Gregg)slide 92:
Perf output buffers # strace -fe perf_event_open bpftrace -e 'kprobe:do_nanosleep { printf("PID %d sleeping...\n", pid); }' strace: Process 3229968 attached [pid 3229968] +++ exited with 0 +++ perf_event_open({type=PERF_TYPE_SOFTWARE, ..., config=PERF_COUNT_SW_BPF_OUTPUT, ...}, -1, 0, -1, PERF_FLAG_FD_CLOEXEC) = 5 perf_event_open({type=PERF_TYPE_SOFTWARE, ..., config=PERF_COUNT_SW_BPF_OUTPUT, ...}, -1, 1, -1, PERF_FLAG_FD_CLOEXEC) = 6 perf_event_open({type=PERF_TYPE_SOFTWARE, ..., config=PERF_COUNT_SW_BPF_OUTPUT, ...}, -1, 2, -1, PERF_FLAG_FD_CLOEXEC) = 7 [...] BPF Internals (Brendan Gregg)slide 93:
Perf output buffers # strace -fe perf_event_open bpftrace -e 'kprobe:do_nanosleep { CPU ID printf("PID %d sleeping...\n", pid); }' strace: Process 3229968 attached [pid 3229968] +++ exited with 0 +++ perf_event_open({type=PERF_TYPE_SOFTWARE, ..., config=PERF_COUNT_SW_BPF_OUTPUT, ...}, -1, 0, -1, PERF_FLAG_FD_CLOEXEC) = 5 perf_event_open({type=PERF_TYPE_SOFTWARE, ..., config=PERF_COUNT_SW_BPF_OUTPUT, ...}, -1, 1, -1, PERF_FLAG_FD_CLOEXEC) = 6 perf_event_open({type=PERF_TYPE_SOFTWARE, ..., config=PERF_COUNT_SW_BPF_OUTPUT, ...}, -1, 2, -1, PERF_FLAG_FD_CLOEXEC) = 7 [...] output buffer 5 CPU 0 output buffer 6 CPU 1 output buffer 7 CPU 2 fd 5 bpftrace fd 6 fd 7 bpftrace waits for events using epoll_wait(2) BPF Internals (Brendan Gregg)slide 94:
bpftrace mid-level internals 13/13 BPF Internals (Brendan Gregg)slide 95:
bpftrace printf & async actions bpftrace perf output message format: printf_id arguments 64-bit E.g., printf_id 0: "PID %d sleeping...\n" bpftrace src/types.h High numbers are used for other async actions: BPF Internals (Brendan Gregg) enum class AsyncAction // clang-format off printf = 0, // printf reserves 0-9999 for printf_ids syscall = 10000, // system reserves 10000-19999 for printf_ids cat = 20000, // cat reserves 20000-29999 for printf_ids exit = 30000, print, clear, [...]slide 96:
bpftrace printf & async actions (2) bpftrace src/bpftrace.cpp void perf_event_printer(void *cb_cookie, void *data, int size) […] auto printf_id = *reinterpret_castslide 97:gt;(arg_data); […] // async actions if (printf_id == asyncactionint(AsyncAction::exit)) bpftrace->gt;request_finalize(); return; […] // printf auto fmt = std::get gt;(bpftrace->gt;printf_args_[printf_id]); auto args = std::get gt;(bpftrace->gt;printf_args_[printf_id]); auto arg_values = bpftrace->gt;get_arg_values(args, arg_data); bpftrace->gt;out_->gt;message(MessageType::printf, format(fmt, arg_values), false); message() just prints it out BPF Internals (Brendan Gregg)
Final output # bpftrace -e 'kprobe:do_nanosleep { printf("PID %d sleeping...\n", pid); Attaching 1 probe... PID 10287 sleeping... PID 10297 sleeping... PID 10287 sleeping… PID 10297 sleeping... PID 10287 sleeping... PID 2218 sleeping... PID 10297 sleeping... [...] BPF Internals (Brendan Gregg)slide 98:
2. Static tracing and map summaries BPF Internals (Brendan Gregg)slide 99:
2. Static tracing and map summaries bpftrace -e 'tracepoint:block:block_rq_issue { @[comm] = count(); BPF Internals (Brendan Gregg)slide 100:
Example output # bpftrace -e 'tracepoint:block:block_rq_issue { @[comm] = count(); Attaching 1 probe… @[kworker/2:2H]: 131 @[chrome]: 135 @[kworker/7:1H]: 185 @[Xorg]: 245 @[tar]: 1204 @[dmcrypt_write/2]: 1993 BPF Internals (Brendan Gregg)slide 101:
bpftrace mid-level internals 1/4 BPF Internals (Brendan Gregg)slide 102:
Lexer & Yacc bpftrace src/lexer.l map [...] @{ident}|@ bpftrace src/parser.yy %tokenslide 103:gt; MAP "map" [...] stmt : expr { $$ = new ast::ExprStatement($1, @1); } | compound_assignment { $$ = $1; } | jump_stmt { $$ = $1; } | map "=" expr { $$ = new ast::AssignMapStatement($1, $3, false, @2); } [...] map : MAP { $$ = new ast::Map($1, @$); } | MAP "[" vargs "]" { $$ = new ast::Map($1, $3, @$); } BPF Internals (Brendan Gregg)
bpftrace mid-level internals 2/4 BPF Internals (Brendan Gregg)slide 104:
BPF maps Custom data storage Can be a key/value store (hash) Also used for histogram summaries (BPF code calculates a bucket index as the key) BPF Internals (Brendan Gregg) keys values tar chrome Xorgslide 105:
BPF map operations User-space BPF-space (kernel) bpf(2) syscall BPF helpers BPF_MAP_CREATE BPF_MAP_LOOKUP_ELEM BPF_MAP_UPDATE_ELEM BPF_MAP_DELETE_ELEM BPF_MAP_GET_NEXT_KEY [...] bpf_map_lookup_elem() bpf_map_update_elem() bpf_map_delete_elem() [...] BPF Internals (Brendan Gregg)slide 106:
Creating BPF maps # strace -febpf bpftrace -e 'block:block_rq_issue { @[comm] = count(); }' bpf(BPF_MAP_CREATE, {map_type=BPF_MAP_TYPE_ARRAY, key_size=4, value_size=4, max_entries=1, map_flags=0, inner_map_fd=0, map_name="", map_ifindex=0, btf_fd=0, btf_key_type_id=0, btf_value_type_id=0}, 120) = 3 bpf(BPF_MAP_CREATE, {map_type=BPF_MAP_TYPE_PERCPU_HASH, key_size=16, value_size=8, max_entries=4096, map_flags=0, inner_map_fd=0, map_name="@", map_ifindex=0, btf_fd=0, btf_key_type_id=0, btf_value_type_id=0}, 120) = -1 EINVAL (Invalid argument) bpf(BPF_MAP_CREATE, {map_type=BPF_MAP_TYPE_PERCPU_HASH, key_size=16, value_size=8, max_entries=4096, map_flags=0, inner_map_fd=0, map_name="", map_ifindex=0, btf_fd=0, btf_key_type_id=0, btf_value_type_id=0}, 120) = 3 bpf(BPF_MAP_CREATE, {map_type=BPF_MAP_TYPE_PERF_EVENT_ARRAY, key_size=4, value_size=4, max_entries=8, map_flags=0, inner_map_fd=0, map_name="printf", map_ifindex=0, btf_fd=0, btf_key_type_id=0, btf_value_type_id=0}, 120) = 4 [...] BPF Internals (Brendan Gregg)slide 107:
Creating BPF maps # strace -febpf bpftrace -e 'block:block_rq_issue { @[comm] = count(); }' bpf(BPF_MAP_CREATE, {map_type=BPF_MAP_TYPE_ARRAY, key_size=4, value_size=4, max_entries=1, map_flags=0, inner_map_fd=0, map_name="", map_ifindex=0, btf_fd=0, btf_key_type_id=0, btf_value_type_id=0}, 120) = 3 bpf(BPF_MAP_CREATE, {map_type=BPF_MAP_TYPE_PERCPU_HASH, key_size=16, value_size=8, max_entries=4096, map_flags=0, inner_map_fd=0, map_name="@", map_ifindex=0, btf_fd=0, btf_key_type_id=0, btf_value_type_id=0}, 120) = -1 EINVAL (Invalid argument) bpf(BPF_MAP_CREATE, {map_type=BPF_MAP_TYPE_PERCPU_HASH, key_size=16, value_size=8, max_entries=4096, map_flags=0, inner_map_fd=0, map_name="", map_ifindex=0, btf_fd=0, btf_key_type_id=0, btf_value_type_id=0}, 120) = 3 bpf(BPF_MAP_CREATE, {map_type=BPF_MAP_TYPE_PERF_EVENT_ARRAY, key_size=4, value_size=4, max_entries=8, map_flags=0, inner_map_fd=0, map_name="printf", map_ifindex=0, btf_fd=0, btf_key_type_id=0, btf_value_type_id=0}, 120) = 4 [...] BPF Internals (Brendan Gregg)slide 108:
bpftrace mid-level internals 3/4 BPF Internals (Brendan Gregg)slide 109:
Tracepoint defined Linux include/trace/events/block.h DECLARE_EVENT_CLASS(block_rq, TP_PROTO(struct request_queue *q, struct request *rq), TP_ARGS(q, rq), If ... TP_STRUCT__entry( __field( dev_t, dev __field( sector_t, sector __field( unsigned int, nr_sector __field( unsigned int, bytes __array( char, rwbs, RWBS_LEN __array( char, comm, TASK_COMM_LEN __dynamic_array( char, cmd, TP_fast_assign( __entry->gt;dev __entry->gt;sector = rq->gt;rq_disk ? disk_devt(rq->gt;rq_disk) : 0; = blk_rq_trace_sector(rq); [...] DEFINE_EVENT(block_rq, block_rq_issue, TP_PROTO(struct request_queue *q, struct request *rq), TP_ARGS(q, rq) BPF Internals (Brendan Gregg)slide 110:
Tracepoints in code Linux block/block-mq.c void blk_mq_start_request(struct request *rq) If ... struct request_queue *q = rq->gt;q; trace_block_rq_issue(q, rq); if (test_bit(QUEUE_FLAG_STATS, &q->gt;queue_flags)) { rq->gt;io_start_time_ns = ktime_get_ns(); rq->gt;stats_sectors = blk_rq_sectors(rq); [...] This is a (best effort) stable interface Use tracepoints instead of kprobes when possible! BPF Internals (Brendan Gregg)slide 111:
Instrumenting tracepoints (2) How do we include the tracepoint without adding overhead? (gdb) disas/r blk_mq_start_request Dump of assembler code for function blk_mq_start_request: 0xffffffff815118e0slide 112:gt;: e8 4b bf b5 ff callq 0xffffffff8106d830 gt; 0xffffffff815118e5 gt;: push %rbp 0xffffffff815118e6 gt;: 48 89 e5 mov %rsp,%rbp 0xffffffff815118e9 gt;: 41 55 push %r13 0xffffffff815118eb gt;: 41 54 push %r12 0xffffffff815118ed gt;: 49 89 fc mov %rdi,%r12 0xffffffff815118f0 gt;: 53 push %rbx 0xffffffff815118f1 gt;: 4c 8b 2f mov (%rdi),%r13 0xffffffff815118f4 gt;: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 0xffffffff815118f9 gt;: 49 8b 45 60 mov 0x60(%r13),%rax [...] BPF Internals (Brendan Gregg) (this is actually quite easy)
Instrumenting tracepoints (2) How do we include the tracepoint without adding overhead? (gdb) disas/r blk_mq_start_request Dump of assembler code for function blk_mq_start_request: 0xffffffff815118e0slide 113:gt;: e8 4b bf b5 ff callq 0xffffffff8106d830 gt; 0xffffffff815118e5 gt;: push %rbp 0xffffffff815118e6 gt;: 48 89 e5 mov %rsp,%rbp 0xffffffff815118e9 gt;: 41 55 push %r13 0xffffffff815118eb gt;: 41 54 push %r12 0xffffffff815118ed gt;: 49 89 fc mov %rdi,%r12 0xffffffff815118f0 gt;: 53 push %rbx 0xffffffff815118f1 gt;: 4c 8b 2f mov (%rdi),%r13 0xffffffff815118f4 gt;: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 0xffffffff815118f9 gt;: 49 8b 45 60 mov 0x60(%r13),%rax [...] This 5-byte nop is a placeholder. Does nothing quickly. When the tracepoint is enabled, the nop becomes a jmp to the tracepoint trampoline. BPF Internals (Brendan Gregg)
bpftrace mid-level internals 4/4 BPF Internals (Brendan Gregg)slide 114:
User-space map iteration bpftrace src/bpftrace.cpp int BPFtrace::print_map(IMap &map, uint32_t top, uint32_t div) […] while (bpf_get_next_key(map.mapfd_, old_key.data(), key.data()) == 0) int value_size = map.type_.GetSize(); value_size *= nvalues; auto value = std::vectorgt;(value_size); int err = bpf_lookup_elem(map.mapfd_, key.data(), value.data()); if (err == -1) // key was removed by the eBPF program during bpf_get_next_key() and bpf_lookup_elem(), // let's skip this key continue; else if (err) LOG(ERROR) slide 115: Reading entire BPF maps # strace -febpf bpftrace -e 'block:block_rq_issue { @[comm] = count(); }' [...] bpf(BPF_MAP_LOOKUP_ELEM, {map_fd=3, key=0x557e422133e0, value=0x557e4221eab0, flags=BPF_ANY}, 120) = 1 ENOENT (No such file or directory) bpf(BPF_MAP_GET_NEXT_KEY, {map_fd=3, key=0x557e422133e0, next_key=0x557e4224d3f0}, 120) = 0 bpf(BPF_MAP_LOOKUP_ELEM, {map_fd=3, key=0x557e4224d3f0, value=0x557e4221eab0, flags=BPF_ANY}, 120) = 0 bpf(BPF_MAP_GET_NEXT_KEY, {map_fd=3, key=0x557e422133e0, next_key=0x557e4224d3f0}, 120) = 0 bpf(BPF_MAP_LOOKUP_ELEM, {map_fd=3, key=0x557e4224d3f0, value=0x557e4221eab0, flags=BPF_ANY}, 120) = 0 bpf(BPF_MAP_GET_NEXT_KEY, {map_fd=3, key=0x557e422133e0, next_key=0x557e4224d3f0}, 120) = 0 bpf(BPF_MAP_LOOKUP_ELEM, {map_fd=3, key=0x557e4224d3f0, value=0x557e4221eab0, flags=BPF_ANY}, 120) = 0 bpf(BPF_MAP_GET_NEXT_KEY, {map_fd=3, key=0x557e422133e0, next_key=0x557e4224d3f0}, 120) = 0 bpf(BPF_MAP_LOOKUP_ELEM, {map_fd=3, key=0x557e4224d3f0, value=0x557e4221eab0, flags=BPF_ANY}, 120) = 0 bpf(BPF_MAP_GET_NEXT_KEY, {map_fd=3, key=0x557e422133e0, next_key=0x557e4224d3f0}, 120) = 0 bpf(BPF_MAP_LOOKUP_ELEM, {map_fd=3, key=0x557e4224d3f0, value=0x557e4221eab0, flags=BPF_ANY}, 120) = 0 bpf(BPF_MAP_GET_NEXT_KEY, {map_fd=3, key=0x557e422133e0, next_key=0x557e4224d3f0}, 120) = 0 bpf(BPF_MAP_LOOKUP_ELEM, {map_fd=3, key=0x557e4224d3f0, value=0x557e4221eab0, flags=BPF_ANY}, 120) = 0 [...] This is an infrequent activity (this program only does this once) BPF Internals (Brendan Gregg)slide 116:Final output # bpftrace -e 'tracepoint:block:block_rq_issue { @[comm] = count(); Attaching 1 probe… @[kworker/2:2H]: 131 @[chrome]: 135 @[kworker/7:1H]: 185 @[Xorg]: 245 @[tar]: 1204 @[dmcrypt_write/2]: 1993 BPF Internals (Brendan Gregg)slide 117:Discussion of other internals Stack walking BTF CO-RE Tests raw_tracepoints & fentry Networking & XDP Security & Cgroups BPF Internals (Brendan Gregg)slide 118:BPF tracing/observability high-level recap From: BPF Performance Tools, Figure 2-1 BPF Internals (Brendan Gregg)slide 119:BPF mid-level internals recap From: BPF Performance Tools, Figure 2-3 BPF Internals (Brendan Gregg)slide 120:PSA CONFIG_DEBUG_INFO_BTF=y E.g., Ubuntu 20.10, Fedora 30, and RHEL 8.2 have it BPF Internals (Brendan Gregg)slide 121:References This is also where I recommend you go to learn more: https://events.static.linuxfound.org/sites/events/files/slides/ bpf_collabsummit_2015feb20.pdf Linux include/uapi/linux/bpf_common.h Linux include/uapi/linux/bpf.h Linux include/uapi/linux/filter.h https://docs.cilium.io/en/v1.9/bpf/#bpf-guide BPF Performance Tools, Addison-Wesley 2020 https://ebpf.io/what-is-ebpf http://www.brendangregg.com/ebpf.html https://github.com/iovisor/bcc https://github.com/iovisor/bpftrace BPF Internals (Brendan Gregg)slide 122:Thanks BPF: Alexei Starovoitov (Facebook), Daniel Borkmann (Isovalent), David S. Miller (Red Hat), Jakub Kicinski (Facebook), Yonghong Song (Facebook), Martin KaFai Lau (Facebook), John Fastabend (Isovalent), Quentin Monnet (Isovalent), Jesper Dangaard Brouer (Red Hat), Andrey Ignatov (Facebook), and Stanislav Fomichev (Google), Linus Torvalds, and many more in the BPF community LLVM BPF: Alexei Starovoitov, Chandler Carruth (Google), Yonghong Song, and more bpftrace: Alastair Robertson (Yellowbrick Data), Dan Xu (Facebook), Bas Smit, Mary Marchini (Netflix), Masanori Misono, Jiri Olsa, Viktor Malík, Dale Hamel, Willian Gaspar, Augusto Mecking Caringi, and many more in the bpftrace community USENIX https://ebpf.io Jun, 2021 BPF Internals (Brendan Gregg)