YOW! 2020: Linux Systems Performance
Talk by Brendan Gregg for YOW! 2020.Video: https://www.youtube.com/watch?v=_Lf-h5TDTN0
Description: "Systems performance studies the performance of computing systems, including all physical components and the full software stack to help you find performance wins for your application and kernel. However, most of us are not performance or kernel engineers, and have limited time to study this topic. This talk summarizes the topic for everyone, touring six important areas: observability tools, methodologies, benchmarking, profiling, tracing, and tuning. Included are recipes for Linux performance analysis and tuning (using vmstat, mpstat, iostat, etc), overviews of complex areas including profiling (perf_events) and tracing (ftrace, bcc/BPF, and bpftrace/BPF), advice about what is and isn't important to learn, and case studies to see how it is applied. This talk is aimed at everyone: developers, operations, sysadmins, etc, and in any environment running Linux, bare metal or the cloud."
next prev 1/64 | |
next prev 2/64 | |
next prev 3/64 | |
next prev 4/64 | |
next prev 5/64 | |
next prev 6/64 | |
next prev 7/64 | |
next prev 8/64 | |
next prev 9/64 | |
next prev 10/64 | |
next prev 11/64 | |
next prev 12/64 | |
next prev 13/64 | |
next prev 14/64 | |
next prev 15/64 | |
next prev 16/64 | |
next prev 17/64 | |
next prev 18/64 | |
next prev 19/64 | |
next prev 20/64 | |
next prev 21/64 | |
next prev 22/64 | |
next prev 23/64 | |
next prev 24/64 | |
next prev 25/64 | |
next prev 26/64 | |
next prev 27/64 | |
next prev 28/64 | |
next prev 29/64 | |
next prev 30/64 | |
next prev 31/64 | |
next prev 32/64 | |
next prev 33/64 | |
next prev 34/64 | |
next prev 35/64 | |
next prev 36/64 | |
next prev 37/64 | |
next prev 38/64 | |
next prev 39/64 | |
next prev 40/64 | |
next prev 41/64 | |
next prev 42/64 | |
next prev 43/64 | |
next prev 44/64 | |
next prev 45/64 | |
next prev 46/64 | |
next prev 47/64 | |
next prev 48/64 | |
next prev 49/64 | |
next prev 50/64 | |
next prev 51/64 | |
next prev 52/64 | |
next prev 53/64 | |
next prev 54/64 | |
next prev 55/64 | |
next prev 56/64 | |
next prev 57/64 | |
next prev 58/64 | |
next prev 59/64 | |
next prev 60/64 | |
next prev 61/64 | |
next prev 62/64 | |
next prev 63/64 | |
next prev 64/64 |
PDF: YOW2020_Linux_Systems_Performance.pdf
Keywords (from pdftotext):
slide 1:
Sep, 2020 Linux Systems Performance Brendan Gregg Senior Performance Engineer Auckland, Perth, Singapore, Hong Kongslide 2:
What I’m currently working onslide 3:
Application Request Time On-CPU Time Off-CPU Timeslide 4:
Application Request Time On-CPU Time Off-CPU Time CPU Flame Graph Off-CPU Flame Graphslide 5:
7fffc102ca03 nf_conntrack_in ([kernel.kallsyms]) 7fffc10d341c ipv4_conntrack_local ([kernel.kallsyms]) 7fff9deb09d8 nf_hook_slow ([kernel.kallsyms]) 7fff9debe6c7 __ip_local_out ([kernel.kallsyms]) 7fff9debe74c ip_local_out ([kernel.kallsyms]) 7fff9debeab0 ip_queue_xmit ([kernel.kallsyms]) 7fff9ded8e02 __tcp_transmit_skb ([kernel.kallsyms]) 7fff9deda3f4 tcp_write_xmit ([kernel.kallsyms]) 7fff9dedb215 __tcp_push_pending_frames ([kernel.kallsyms]) 7fff9dec684b tcp_push ([kernel.kallsyms]) 7fff9deca337 tcp_sendmsg_locked ([kernel.kallsyms]) Off-CPU stacks 7fff9decae5c tcp_sendmsg ([kernel.kallsyms]) often end 7fff9defa8ee inet_sendmsg ([kernel.kallsyms]) 7fff9de4a41e sock_sendmsg ([kernel.kallsyms]) abruptly 7fff9de4a9af SYSC_sendto ([kernel.kallsyms]) in libc 7fff9de4b49e sys_sendto ([kernel.kallsyms]) 7fff9d605bb3 do_syscall_64 ([kernel.kallsyms]) 7fff9e002081 entry_SYSCALL_64_after_hwframe ([kernel.kallsyms]) 119ae __libc_send (/lib/x86_64-linux-gnu/libpthread-2.27.so)slide 6:
More than one way to walk a stack Frame pointers Last branch record (LBR) Branch trace store (BTS) DWARF ORC Application exception handler We’re currently rolling out our own build of libc with frame pointers (-fno-omit-frame-pointer)slide 7:
Systems Performance in 45 mins • This is slides + discussion • For more detail and stand-alone texts:slide 8:
Agenda 1. Observability 2. Methodologies 3. Benchmarking 4. Profiling 5. Tracing 6. Tuningslide 9:
slide 10:
1. Observabilityslide 11:
How do you measure these?slide 12:
Linux Observability Toolsslide 13:
Why Learn Tools? • Most analysis at Netflix is via GUIs • Benefits of command-line tools: Helps you understand GUIs: they show the same metrics Often documented, unlike GUI metrics Often have useful options not exposed in GUIs • Installing essential tools (something like): $ sudo apt-get install sysstat bcc-tools bpftrace linux-tools-common \ linux-tools-$(uname -r) iproute2 msr-tools $ git clone https://github.com/brendangregg/msr-cloud-tools $ git clone https://github.com/brendangregg/bpf-perf-tools-book These are crisis tools and should be installed by default In a performance meltdown you may be unable to install themslide 14:
uptime • One way to print load averages: $ uptime 07:42:06 up 8:16, 1 user, load average: 2.27, 2.84, 2.91 • A measure of resource demand: CPUs + disks – Includes TASK_UNINTERRUPTIBLE state to show all demand types – You can use BPF & off-CPU flame graphs to explain this state: http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html – PSI in Linux 4.20 shows CPU, I/O, and memory loads • Exponentially-damped moving averages – With time constants of 1, 5, and 15 minutes. See historic trend. • Load >gt; # of CPUs, may mean CPU saturation Don’t spend more than 5 seconds studying theseslide 15:
top • System and per-process interval summary: $ top - 18:50:26 up 7:43, 1 user, load average: 4.11, 4.91, 5.22 Tasks: 209 total, 1 running, 206 sleeping, 0 stopped, 2 zombie Cpu(s): 47.1%us, 4.0%sy, 0.0%ni, 48.4%id, 0.0%wa, 0.0%hi, 0.3%si, 0.2%st Mem: 70197156k total, 44831072k used, 25366084k free, 36360k buffers Swap: 0k total, 0k used, 0k free, 11873356k cached PID USER VIRT RES SHR S %CPU %MEM 5738 apiprod 1386 apiprod 1 root 2 root […] 0 62.6g 29g 352m S 0 17452 1388 964 R 0 24340 2272 1340 S 0 S 417 44.2 0 0.0 0 0.0 0 0.0 TIME+ COMMAND 2144:15 java 0:00.02 top 0:01.51 init 0:00.00 kthreadd • %CPU is summed across all CPUs • Can miss short-lived processes (atop won’t)slide 16:
htop $ htop 1 [||||||||||70.0%] 13 [||||||||||70.6%] 2 [||||||||||68.7%] 14 [||||||||||69.4%] 3 [||||||||||68.2%] 15 [||||||||||68.5%] 4 [||||||||||69.3%] 16 [||||||||||69.2%] 5 [||||||||||68.0%] 17 [||||||||||67.6%] […] Mem[||||||||||||||||||||||||||||||176G/187G] Swp[ 0K/0K] 25 [||||||||||69.7%] 26 [||||||||||67.7%] 27 [||||||||||68.8%] 28 [||||||||||67.6%] 29 [||||||||||70.1%] 37 [||||||||||66.6%] 38 [||||||||||66.0%] 39 [||||||||||73.3%] 40 [||||||||||67.0%] 41 [||||||||||66.5%] Tasks: 80, 3206 thr; 43 running Load average: 36.95 37.19 38.29 Uptime: 01:39:36 PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command 4067 www-data 0 202G 173G 55392 S 3359 93.0 48h51:30 /apps/java/bin/java -Dnop -Djdk.map 6817 www-data 0 202G 173G 55392 R 56.9 93.0 48:37.89 /apps/java/bin/java -Dnop -Djdk.map 6826 www-data 0 202G 173G 55392 R 25.7 93.0 22:26.90 /apps/java/bin/java -Dnop -Djdk.map 6721 www-data 0 202G 173G 55392 S 25.0 93.0 22:05.51 /apps/java/bin/java -Dnop -Djdk.map 6616 www-data 0 202G 173G 55392 S 13.6 93.0 11:15.51 /apps/java/bin/java -Dnop -Djdk.map […] F1Help F2Setup F3SearchF4FilterF5Tree F6SortByF7Nice -F8Nice +F9Kill F10Quit Pros: configurable. Cons: misleading colors. dstat is similar, and now dead (May 2019); see pcp-dstatslide 17:
vmstat • Virtual memory statistics and more: $ vmstat –Sm 1 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---r b swpd free buff cache cs us sy id wa 8 0 12 25 34 0 0 7 0 0 205 186 46 13 0 0 8 0 8 210 435 39 21 0 0 8 0 0 218 219 42 17 0 0 […] • USAGE: vmstat [interval [count]] • First output line has some summary since boot values • High level CPU summary – “r” is runnable tasksslide 18:
iostat • Block I/O (disk) stats. 1st output is since boot. $ iostat -xz 1 Linux 5.0.21 (c099.xxxx) 06/24/19 _x86_64_ (32 CPU) [...] Device r/s w/s rkB/s wkB/s rrqm/s sda nvme3n1 20.39 293152.56 14758.05 nvme1n1 17.83 286402.15 13089.56 nvme0n1 19.70 258184.52 14218.55 wrqm/s %rrqm %wrqm \... 0.00 /... 0.00 18.81 \... 0.00 18.52 /... 0.00 19.51 \... Workload Very useful set of stats ...\ r_await w_await aqu-sz rareq-sz wareq-sz .../ ...\ .../ ...\ Resulting Performance svctm %utilslide 19:
free • Main memory usage: $ free -m Mem: Swap: total used free shared buff/cache • Recently added “available” column buff/cache: block device I/O cache + virtual page cache available: memory likely available to apps free: completely unused memory availableslide 20:
strace • System call tracer: $ strace –tttT –p 313 1408393285.779746 getgroups(0, NULL) = 1slide 21:gt; 1408393285.779873 getgroups(1, [0]) = 1 gt; 1408393285.780797 close(3) = 0 gt; 1408393285.781338 write(1, "wow much syscall\n", 17wow much syscall ) = 17 gt; • Translates syscall arguments • Not all kernel requests (e.g., page faults) • Currently has massive overhead (ptrace based) – Can slow the target by >gt; 100x. Skews measured time (-ttt, -T). http://www.brendangregg.com/blog/2014-05-11/strace-wow-much-syscall.html • perf trace will replace it: uses a ring buffer & BPF
tcpdump • Sniff network packets for post analysis: $ tcpdump -i eth0 -w /tmp/out.tcpdump tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes ^C7985 packets captured 8996 packets received by filter 1010 packets dropped by kernel # tcpdump -nr /tmp/out.tcpdump | head reading from file /tmp/out.tcpdump, link-type EN10MB (Ethernet) 20:41:05.038437 IP 10.44.107.151.22 >gt; 10.53.237.72.46425: Flags [P.], seq 18... 20:41:05.038533 IP 10.44.107.151.22 >gt; 10.53.237.72.46425: Flags [P.], seq 48... 20:41:05.038584 IP 10.44.107.151.22 >gt; 10.53.237.72.46425: Flags [P.], seq 96... […] • Study packet sequences with timestamps (us) • CPU overhead optimized (socket ring buffers), but can still be significant. Use BPF in-kernel summaries instead.slide 22:
nstat • Replacement for netstat from iproute2 • Various network protocol statistics: -s won’t reset counters, otherwise intervals can be examined -d for daemon mode • Linux keeps adding more counters $ nstat -s #kernel IpInReceives IpInDelivers IpOutRequests [...] TcpActiveOpens TcpPassiveOpens TcpAttemptFails TcpEstabResets TcpInSegs TcpOutSegs TcpRetransSegs TcpOutRsts [...]slide 23:
slabtop • Kernel slab allocator memory usage: $ slabtop Active / Total Objects (% used) : 4692768 / 4751161 (98.8%) Active / Total Slabs (% used) : 129083 / 129083 (100.0%) Active / Total Caches (% used) : 71 / 109 (65.1%) Active / Total Size (% used) : 729966.22K / 738277.47K (98.9%) Minimum / Average / Maximum Object : 0.01K / 0.16K / 8.00K OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME 3565575 3565575 100% 0.10K 91425 365700K buffer_head 314916 314066 99% 0.19K 14996 59984K dentry 184192 183751 99% 0.06K 11512K kmalloc-64 138618 138618 100% 0.94K 130464K xfs_inode 138602 138602 100% 0.21K 29968K xfs_ili 102116 99012 96% 0.55K 58352K radix_tree_node 97482 49093 50% 0.09K 9284K kmalloc-96 22695 20777 91% 0.05K 1068K shared_policy_node 21312 21312 100% 0.86K 18432K ext4_inode_cache 16288 14601 89% 0.25K 4072K kmalloc-256 […]slide 24:
pcstat • Show page cache residency by file: # ./pcstat data0* |----------+----------------+------------+-----------+---------| | Name | Size | Pages | Cached | Percent | |----------+----------------+------------+-----------+---------| | data00 | 104857600 | 25600 | 25600 | 100.000 | | data01 | 104857600 | 25600 | 25600 | 100.000 | | data02 | 104857600 | 25600 | 4080 | 015.938 | | data03 | 104857600 | 25600 | 25600 | 100.000 | | data04 | 104857600 | 25600 | 16010 | 062.539 | | data05 | 104857600 | 25600 | 0 | 000.000 | |----------+----------------+------------+-----------+---------| • Uses mincore(2) syscall. Used for database perf analysis.slide 25:
docker stats • Soft limits (cgroups) by container: # docker stats CONTAINER CPU % 353426a09db1 526.81% 6bf166a66e08 303.82% 58dcf8aed0a7 41.01% 61061566ffe5 85.92% bdc721460293 2.69% 6c80ed61ae63 477.45% 337292fb5b64 89.05% b652ede9a605 173.50% d7cd2599291f 504.28% 05bf9f3e0d13 314.46% 09082f005755 142.04% [...] MEM USAGE / LIMIT 4.061 GiB / 8.5 GiB 3.448 GiB / 8.5 GiB 1.322 GiB / 2.5 GiB 220.9 MiB / 3.023 GiB 1.204 GiB / 3.906 GiB 557.7 MiB / 8 GiB 766.2 MiB / 8 GiB 689.2 MiB / 8 GiB 673.2 MiB / 8 GiB 711.6 MiB / 8 GiB 693.9 MiB / 8 GiB MEM % 47.78% 40.57% 52.89% 7.14% 30.82% 6.81% 9.35% 8.41% 8.22% 8.69% 8.47% NET I/O 0 B / 0 B 0 B / 0 B 0 B / 0 B 0 B / 0 B 0 B / 0 B 0 B / 0 B 0 B / 0 B 0 B / 0 B 0 B / 0 B 0 B / 0 B 0 B / 0 B BLOCK I/O 2.818 MB / 0 B 2.032 MB / 0 B 0 B / 0 B 43.4 MB / 0 B 4.35 MB / 0 B 9.257 MB / 0 B 5.493 MB / 0 B 6.48 MB / 0 B 12.58 MB / 0 B 7.942 MB / 0 B 8.081 MB / 0 B PIDS • Stats are in /sys/fs/cgroups • CPU shares and bursting breaks monitoring assumptionsslide 26:
showboost • Determine current CPU clock rate # showboost Base CPU MHz : 2500 Set CPU MHz : 2500 Turbo MHz(s) : 3100 3200 3300 3500 Turbo Ratios : 124% 128% 132% 140% CPU 0 summary every 1 seconds... TIME 23:39:07 23:39:08 23:39:09 C0_MCYC C0_ACYC UTIL 64% 70% 99% RATIO MHz • Uses MSRs. Can also use PMCs for this. • Also see turbostat. https://github.com/brendangregg/msr-cloud-toolsslide 27:
pmcarch serverA# ./pmcarch -p 4093 10 K_CYCLES K_INSTR IPC BR_RETIRED BR_MISPRED 982412660 575706336 0.59 126424862460 2416880487 999621309 555043627 0.56 120449284756 2317302514 991146940 558145849 0.56 126350181501 2530383860 996314688 562276830 0.56 122215605985 2348638980 979890037 560268707 0.57 125609807909 2386085660 serverB# ./pmcarch -p 1928219 10 K_CYCLES K_INSTR IPC BR_RETIRED 147523816 222396364 1.51 46053921119 156634810 229801807 1.47 48236123575 152783226 237001219 1.55 49344315621 140787179 213570329 1.52 44518363978 136822760 219706637 1.61 45129020910 BR_MISPRED BMR% LLCREF LLCMISS LLC% 1.91 15724006692 10872315070 30.86 1.92 15378257714 11121882510 27.68 2.00 15965082710 11464682655 28.19 1.92 15558286345 10835594199 30.35 1.90 15828820588 11038597030 30.26 BMR% LLCREF 1.39 8880477235 1.35 9186609260 1.40 9314992450 1.42 8675999448 1.44 8689831639 LLCMISS LLC% 968809014 89.09 1183858023 87.11 879494418 90.56 712318917 91.79 617678747 92.89 • Measures instructions-per-cycle (IPC) and other metrics https://github.com/brendangregg/pmc-cloud-toolsslide 28:
cpuhot # dmesg […] [1914678.201791] CPU6: Package temperature above threshold, cpu clock throttled ... [1914678.206747] CPU5: Core temperature/speed normal [1914678.206748] CPU6: Package temperature/speed normal […] # ./cpuhot - CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 CPU8 CPU9 CPU10 CPU11 CPU12 CPU13 CPU14 CPU15 CPU16 PROCHOT 95 95 95 95 95 95 95 95 95 95 95 95 95 95 95 95 Celsius 77 75 76 73 76 77 75 72 76 72 75 77 76 72 76 75 Flags 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 • Various thermal info is available in MSRs https://github.com/brendangregg/msr-cloud-toolsslide 29:
Also: Static Performance Tuning Toolsslide 30:
Where do you start...and stop? Workload Observability Static Configurationslide 31:
2. Methodologiesslide 32:
Anti-Methodologies • The lack of a deliberate methodology… • Street Light Anti-Method • Drunk Man Anti-Methodslide 33:
Linux Perf Analysis in 60s uptime dmesg -T | tail vmstat 1 mpstat -P ALL 1 pidstat 1 iostat -xz 1 free -m sar -n DEV 1 sar -n TCP,ETCP 1 top load averages kernel errors overall stats by time CPU balance process usage disk I/O memory usage network I/O TCP stats check overview http://techblog.netflix.com/2015/11/linux-performance-analysis-in-60s.htmlslide 34:
USE Method For every resource, check: 1. Utilization 2. Saturation 3. Errors Saturation Errors Resource Utilization (%) For example, CPUs: Utilization: time busy Saturation: run queue length or latency Errors: ECC errors, etc. Start with the questions, then find the tools Can be applied to hardware and software (cgroups)slide 35:
Workload Characterization Analyze workload characteristics, not resulting performance For example, CPUs: 1. Who: which PIDs, programs, users 2. Why: code paths, context 3. What: CPU instructions, cycles 4. How: changing over time Workload Targetslide 36:
Other Methodologies Resource analysis Workload analysis Drill-down analysis Off-CPU analysis Static performance tuning Performance mantras Scientific method 5 whys All methodologies summarized: http://www.brendangregg.com/methodology.htmlslide 37:
3. Benchmarkingslide 38:
Benchmarking • An experimental analysis activity – Try observational analysis first; benchmarks can perturb • My favorite tools: – fio, lmbench, sysperf, iperf, netperf • Benchmarking is error prone – ~100% of benchmarks are wrong – You benchmark A, but actually measure B, and conclude you measured C caution: benchmarkingslide 39:
Solution: Active Benchmarking • Root cause analysis while the benchmark runs • For any given benchmark, ask: why not 10x? • This takes time, but uncovers most mistakes accurate benchmarking takes serious effortslide 40:
4. Profilingslide 41:
Profiling Can you do this? “As an experiment to investigate the performance of the resulting TCP/IP implementation ... the 11/750 is CPU saturated, but the 11/780 has about 30% idle time. The time spent in the system processing the data is spread out among handling for the Ethernet (20%), IP packet processing (10%), TCP processing (30%), checksumming (25%), and user system call handling (15%), with no single part of the handling dominating the time in the system.” – Bill Joy, 1981, TCP-IP Digest, Vol 1 #6 https://www.rfc-editor.org/rfc/museum/tcp-ip-digest/tcp-ip-digest.v1n6.1slide 42:
perf: CPU profiling • Sampling full stack traces at 99 Hertz, for 30 secs: # perf record -F 99 -ag -- sleep 30 [ perf record: Woken up 9 times to write data ] [ perf record: Captured and wrote 2.745 MB perf.data (~119930 samples) ] # perf report -n --stdio 1.40% java [kernel.kallsyms] [k] _raw_spin_lock --- _raw_spin_lock |--63.21%-- try_to_wake_up |--63.91%-- default_wake_function |--56.11%-- __wake_up_common __wake_up_locked ep_poll_callback __wake_up_common __wake_up_sync_key |--59.19%-- sock_def_readable […78,000 lines truncated…]slide 43:
Full "perf report" Outputslide 44:
… as a Flame Graphslide 45:
Flame Graphs • Visualizes a collection of stack traces – x-axis: alphabetical stack sort, to maximize merging – y-axis: stack depth – color: random (default), or a dimension • Perl + SVG + JavaScript – https://github.com/brendangregg/FlameGraph – Takes input from many different profilers – Multiple d3 versions are being developed • References: – http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html – http://queue.acm.org/detail.cfm?id=2927301 – "The Flame Graph" CACM, June 2016 Instructionsslide 46:
Linux CPU Flame Graphs Linux 2.6+, via perf: git clone --depth 1 https://github.com/brendangregg/FlameGraph cd FlameGraph These files can be read using FlameScope perf record -F 49 -a –g -- sleep 30 perf script --header >gt; out.perf01 ./stackcollapse-perf.plslide 47:gt; perf.svg Linux 4.9+, via BPF: git clone --depth 1 https://github.com/brendangregg/FlameGraph git clone --depth 1 https://github.com/iovisor/bcc ./bcc/tools/profile.py -dF 49 30 | ./FlameGraph/flamegraph.pl >gt; perf.svg – Most efficient: no perf.data file, summarizes in-kernel
Mixed-Mode Flame Graphs Kernel Java JVMslide 48:
FlameScope Analyze variance, perturbations Flame graph https://github.com/ Netflix/flamescope Subsecond-offset heat mapslide 49:
perf: Counters • Performance Monitoring Counters (PMCs): $ perf list | grep –i hardware cpu-cycles OR cycles stalled-cycles-frontend OR idle-cycles-frontend stalled-cycles-backend OR idle-cycles-backend instructions […] L1-dcache-loads L1-dcache-load-misses […] rNNN (see 'perf list --help' on how to encode it) mem:slide 50:gt;[:access] [Hardware event] [Hardware event] [Hardware event] [Hardware event] [Hardware cache event] [Hardware cache event] [Raw hardware event … [Hardware breakpoint] • Measure CPU operations, cycles, including stall cycles • PMCs only enabled for some cloud instance types My front-ends, incl. pmcarch: https://github.com/brendangregg/pmc-cloud-tools
5. Tracingslide 51:
Linux Tracing Eventsslide 52:
Tracing Stack add-on tools: trace-cmd, perf-tools, bcc, bpftrace front-end tools: perf tracing frameworks: back-end instrumentation: Ftrace, perf_events, BPF tracepoints, kprobes, uprobes BPF enables a new class of custom, efficient, and production safe performance analysis tools Linuxslide 53:
Ftrace: perf-tools funccount • Built-in kernel tracing capabilities, added by Steven Rostedt and others since Linux 2.6.27 # ./funccount -i 1 'bio_*' Tracing "bio_*"... Ctrl-C to end. FUNC [...] bio_alloc_bioset bio_endio bio_free bio_fs_destructor bio_init bio_integrity_enabled bio_put bio_add_page • Also see trace-cmd COUNTslide 54:
perf: Tracing Tracepoints perf was introduced earlier; it is also a powerful tracer # perf stat -e block:block_rq_complete -a sleep 10 Performance counter stats for 'system wide': In-kernel counts (efficient) block:block_rq_complete # perf record -e block:block_rq_complete -a sleep 10 Dump & post-process [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.428 MB perf.data (~18687 samples) ] # perf script run 30339 [000] 2083345.722857: block:block_rq_complete: 202,1 W () 12986336 + 8 [0] run 30339 [000] 2083345.723180: block:block_rq_complete: 202,1 W () 12986528 + 8 [0] swapper 0 [000] 2083345.723489: block:block_rq_complete: 202,1 W () 12986496 + 8 [0] swapper 0 [000] 2083346.745840: block:block_rq_complete: 202,1 WS () 1052984 + 144 [0] supervise 30342 [000] 2083346.746571: block:block_rq_complete: 202,1 WS () 1053128 + 8 [0] [...] http://www.brendangregg.com/perf.html https://perf.wiki.kernel.org/index.php/Main_Pageslide 55:
BCC/BPF: ext4slower • ext4 operations slower than the threshold: # ./ext4slower 1 Tracing ext4 operations slower than 1 ms TIME COMM PID T BYTES OFF_KB 06:49:17 bash R 128 06:49:17 cksum R 39552 06:49:17 cksum R 96 06:49:17 cksum R 96 06:49:17 cksum R 10320 06:49:17 cksum R 65536 06:49:17 cksum R 55400 06:49:17 cksum R 36792 […] LAT(ms) FILENAME 7.75 cksum 1.34 [ 5.36 2to3-2.7 14.94 2to3-3.4 6.82 411toppm 4.01 a2p 8.77 ab 16.34 aclocal-1.14 • Better indicator of application pain than disk I/O • Measures & filters in-kernel for efficiency using BPF https://github.com/iovisor/bccslide 56:
bpftrace: one-liners • Block I/O (disk) events by type; by size & comm: # bpftrace -e 't:block:block_rq_issue { @[args->gt;rwbs] = count(); }' Attaching 1 probe... @[WS]: 2 @[RM]: 12 @[RA]: 1609 @[R]: 86421 # bpftrace -e 't:block:block_rq_issue { @bytes[comm] = hist(args->gt;bytes); }' Attaching 1 probe... @bytes[dmcrypt_write]: [4K, 8K) 68 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [8K, 16K) 35 |@@@@@@@@@@@@@@@@@@@@@@@@@@ [16K, 32K) 4 |@@@ [32K, 64K) 1 | [64K, 128K) 2 |@ https://github.com/iovisor/bpftrace [...]slide 57:
BPF Perf Tools (2019) BCC & bpftrace repos contain many of these. The book has them all.slide 58:
Off-CPU Analysis • Explain all blocking events. High-overhead: needs BPF. directory read from disk file read from disk fstat from disk path read from disk pipe writeslide 59:
6. Tuningslide 60:
Ubuntu Bionic Tuning: Sep 2020 (1/2) CPU schedtool –B PID disable Ubuntu apport (crash reporter) upgrade to Bionic (scheduling improvements) Virtual Memory vm.swappiness = 0 # from 60 Memory echo madvise >gt; /sys/kernel/mm/transparent_hugepage/enabled kernel.numa_balancing = 0 File System vm.dirty_ratio = 80 # from 40 vm.dirty_background_ratio = 5 # from 10 vm.dirty_expire_centisecs = 12000 # from 3000 mount -o defaults,noatime,discard,nobarrier … Storage I/O /sys/block/*/queue/rq_affinity # or 2 /sys/block/*/queue/scheduler kyber /sys/block/*/queue/nr_requests /sys/block/*/queue/read_ahead_kb 128 mdadm –chunk=64 …slide 61:
Ubuntu Bionic Tuning: Sep 2020 (2/2) Networking net.core.default_qdisc = fq net.core.netdev_max_backlog = 5000 # may update to 1000 net.core.rmem_max = 16777216 net.core.somaxconn = 1024 # may update to 4096 net.core.wmem_max = 16777216 net.ipv4.ip_local_port_range = 10240 65535 net.ipv4.tcp_abort_on_overflow = 1 # maybe net.ipv4.tcp_congestion_control = bbr net.ipv4.tcp_max_syn_backlog = 8192 net.ipv4.tcp_rmem = 4096 12582912 16777216 # or 8388608 ... net.ipv4.tcp_slow_start_after_idle = 0 net.ipv4.tcp_syn_retries = 2 net.ipv4.tcp_tw_reuse = 1 net.ipv4.tcp_wmem = 4096 12582912 16777216 # or 8388608 ... Hypervisor echo tsc >gt; /sys/devices/…/current_clocksource Plus use AWS Nitro Other net.core.bpf_jit_enable = 1 sysctl -w kernel.perf_event_max_stack=1000slide 62:
Takeaways Systems Performance is: Observability, Methodologies, Benchmarking, Profiling, Tracing, Tuning Print out for your office wall: uptime dmesg -T | tail vmstat 1 mpstat -P ALL 1 pidstat 1 iostat -xz 1 free -m sar -n DEV 1 sar -n TCP,ETCP 1 topslide 63:
Links Netflix Tech Blog on Linux: http://techblog.netflix.com/2015/11/linux-performance-analysis-in-60s.html http://techblog.netflix.com/2015/08/netflix-at-velocity-2015-linux.html Linux Performance: http://www.brendangregg.com/linuxperf.html Linux perf: https://perf.wiki.kernel.org/index.php/Main_Page http://www.brendangregg.com/perf.html Linux ftrace: https://www.kernel.org/doc/Documentation/trace/ftrace.txt https://github.com/brendangregg/perf-tools Linux BPF: http://www.brendangregg.com/ebpf.html http://www.brendangregg.com/bpf-performance-tools-book.html https://github.com/iovisor/bcc https://github.com/iovisor/bpftrace Methodologies: http://www.brendangregg.com/USEmethod/use-linux.html http://www.brendangregg.com/activebenchmarking.html Flame Graphs & FlameScope: http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html http://queue.acm.org/detail.cfm?id=2927301 https://github.com/Netflix/flamescope MSRs and PMCs https://github.com/brendangregg/msr-cloud-tools https://github.com/brendangregg/pmc-cloud-toolsslide 64:
Thanks Q&A in #qa-brendan http://slideshare.net/brendangregg http://www.brendangregg.com bgregg@netflix.com @brendangregg Nov/Dec 2020 Auckland, Perth, Singapore, Hong Kong