Systems Performance 2nd Ed.



BPF Performance Tools book

Recent posts:
Blog index
About
RSS

YOW! 2020: Linux Systems Performance

Talk by Brendan Gregg for YOW! 2020.

Video: https://www.youtube.com/watch?v=_Lf-h5TDTN0

Description: "Systems performance studies the performance of computing systems, including all physical components and the full software stack to help you find performance wins for your application and kernel. However, most of us are not performance or kernel engineers, and have limited time to study this topic. This talk summarizes the topic for everyone, touring six important areas: observability tools, methodologies, benchmarking, profiling, tracing, and tuning. Included are recipes for Linux performance analysis and tuning (using vmstat, mpstat, iostat, etc), overviews of complex areas including profiling (perf_events) and tracing (ftrace, bcc/BPF, and bpftrace/BPF), advice about what is and isn't important to learn, and case studies to see how it is applied. This talk is aimed at everyone: developers, operations, sysadmins, etc, and in any environment running Linux, bare metal or the cloud."

next
prev
1/64
next
prev
2/64
next
prev
3/64
next
prev
4/64
next
prev
5/64
next
prev
6/64
next
prev
7/64
next
prev
8/64
next
prev
9/64
next
prev
10/64
next
prev
11/64
next
prev
12/64
next
prev
13/64
next
prev
14/64
next
prev
15/64
next
prev
16/64
next
prev
17/64
next
prev
18/64
next
prev
19/64
next
prev
20/64
next
prev
21/64
next
prev
22/64
next
prev
23/64
next
prev
24/64
next
prev
25/64
next
prev
26/64
next
prev
27/64
next
prev
28/64
next
prev
29/64
next
prev
30/64
next
prev
31/64
next
prev
32/64
next
prev
33/64
next
prev
34/64
next
prev
35/64
next
prev
36/64
next
prev
37/64
next
prev
38/64
next
prev
39/64
next
prev
40/64
next
prev
41/64
next
prev
42/64
next
prev
43/64
next
prev
44/64
next
prev
45/64
next
prev
46/64
next
prev
47/64
next
prev
48/64
next
prev
49/64
next
prev
50/64
next
prev
51/64
next
prev
52/64
next
prev
53/64
next
prev
54/64
next
prev
55/64
next
prev
56/64
next
prev
57/64
next
prev
58/64
next
prev
59/64
next
prev
60/64
next
prev
61/64
next
prev
62/64
next
prev
63/64
next
prev
64/64

PDF: YOW2020_Linux_Systems_Performance.pdf

Keywords (from pdftotext):

slide 1:
    Sep, 2020
    Linux Systems
    Performance
    Brendan Gregg
    Senior Performance Engineer
    Auckland, Perth, Singapore, Hong Kong
    
slide 2:
    What I’m currently working on
    
slide 3:
    Application Request Time
    On-CPU Time
    Off-CPU Time
    
slide 4:
    Application Request Time
    On-CPU Time
    Off-CPU Time
    CPU Flame Graph
    Off-CPU Flame Graph
    
slide 5:
    7fffc102ca03 nf_conntrack_in ([kernel.kallsyms])
    7fffc10d341c ipv4_conntrack_local ([kernel.kallsyms])
    7fff9deb09d8 nf_hook_slow ([kernel.kallsyms])
    7fff9debe6c7 __ip_local_out ([kernel.kallsyms])
    7fff9debe74c ip_local_out ([kernel.kallsyms])
    7fff9debeab0 ip_queue_xmit ([kernel.kallsyms])
    7fff9ded8e02 __tcp_transmit_skb ([kernel.kallsyms])
    7fff9deda3f4 tcp_write_xmit ([kernel.kallsyms])
    7fff9dedb215 __tcp_push_pending_frames ([kernel.kallsyms])
    7fff9dec684b tcp_push ([kernel.kallsyms])
    7fff9deca337 tcp_sendmsg_locked ([kernel.kallsyms])
    Off-CPU stacks
    7fff9decae5c tcp_sendmsg ([kernel.kallsyms])
    often end
    7fff9defa8ee inet_sendmsg ([kernel.kallsyms])
    7fff9de4a41e sock_sendmsg ([kernel.kallsyms])
    abruptly
    7fff9de4a9af SYSC_sendto ([kernel.kallsyms])
    in libc
    7fff9de4b49e sys_sendto ([kernel.kallsyms])
    7fff9d605bb3 do_syscall_64 ([kernel.kallsyms])
    7fff9e002081 entry_SYSCALL_64_after_hwframe ([kernel.kallsyms])
    119ae __libc_send (/lib/x86_64-linux-gnu/libpthread-2.27.so)
    
slide 6:
    More than one way to walk a stack
    Frame pointers
    Last branch record (LBR)
    Branch trace store (BTS)
    DWARF
    ORC
    Application exception handler
    We’re currently rolling out
    our own build of libc
    with frame pointers
    (-fno-omit-frame-pointer)
    
slide 7:
    Systems Performance in 45 mins
    • This is slides + discussion
    • For more detail and stand-alone texts:
    
slide 8:
    Agenda
    1. Observability
    2. Methodologies
    3. Benchmarking
    4. Profiling
    5. Tracing
    6. Tuning
    
slide 9:
slide 10:
    1. Observability
    
slide 11:
    How do you measure these?
    
slide 12:
    Linux Observability Tools
    
slide 13:
    Why Learn Tools?
    • Most analysis at Netflix is via GUIs
    • Benefits of command-line tools:
    Helps you understand GUIs: they show the same metrics
    Often documented, unlike GUI metrics
    Often have useful options not exposed in GUIs
    • Installing essential tools (something like):
    $ sudo apt-get install sysstat bcc-tools bpftrace linux-tools-common \
    linux-tools-$(uname -r) iproute2 msr-tools
    $ git clone https://github.com/brendangregg/msr-cloud-tools
    $ git clone https://github.com/brendangregg/bpf-perf-tools-book
    These are crisis tools and should be installed by default
    In a performance meltdown you may be unable to install them
    
slide 14:
    uptime
    • One way to print load averages:
    $ uptime
    07:42:06 up
    8:16,
    1 user,
    load average: 2.27, 2.84, 2.91
    • A measure of resource demand: CPUs + disks
    – Includes TASK_UNINTERRUPTIBLE state to show all demand types
    – You can use BPF & off-CPU flame graphs to explain this state:
    http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html
    – PSI in Linux 4.20 shows CPU, I/O, and memory loads
    • Exponentially-damped moving averages
    – With time constants of 1, 5, and 15 minutes. See historic trend.
    • Load >gt; # of CPUs, may mean CPU saturation
    Don’t spend more than 5 seconds studying these
    
slide 15:
    top
    • System and per-process interval summary:
    $ top - 18:50:26 up 7:43, 1 user, load average: 4.11, 4.91, 5.22
    Tasks: 209 total,
    1 running, 206 sleeping,
    0 stopped,
    2 zombie
    Cpu(s): 47.1%us, 4.0%sy, 0.0%ni, 48.4%id, 0.0%wa, 0.0%hi, 0.3%si, 0.2%st
    Mem: 70197156k total, 44831072k used, 25366084k free,
    36360k buffers
    Swap:
    0k total,
    0k used,
    0k free, 11873356k cached
    PID USER
    VIRT
    RES
    SHR S %CPU %MEM
    5738 apiprod
    1386 apiprod
    1 root
    2 root
    […]
    0 62.6g 29g 352m S
    0 17452 1388 964 R
    0 24340 2272 1340 S
    0 S
    417 44.2
    0 0.0
    0 0.0
    0 0.0
    TIME+
    COMMAND
    2144:15 java
    0:00.02 top
    0:01.51 init
    0:00.00 kthreadd
    • %CPU is summed across all CPUs
    • Can miss short-lived processes (atop won’t)
    
slide 16:
    htop
    $ htop
    1 [||||||||||70.0%]
    13 [||||||||||70.6%]
    2 [||||||||||68.7%]
    14 [||||||||||69.4%]
    3 [||||||||||68.2%]
    15 [||||||||||68.5%]
    4 [||||||||||69.3%]
    16 [||||||||||69.2%]
    5 [||||||||||68.0%]
    17 [||||||||||67.6%]
    […]
    Mem[||||||||||||||||||||||||||||||176G/187G]
    Swp[
    0K/0K]
    25 [||||||||||69.7%]
    26 [||||||||||67.7%]
    27 [||||||||||68.8%]
    28 [||||||||||67.6%]
    29 [||||||||||70.1%]
    37 [||||||||||66.6%]
    38 [||||||||||66.0%]
    39 [||||||||||73.3%]
    40 [||||||||||67.0%]
    41 [||||||||||66.5%]
    Tasks: 80, 3206 thr; 43 running
    Load average: 36.95 37.19 38.29
    Uptime: 01:39:36
    PID USER
    PRI NI VIRT
    RES
    SHR S CPU% MEM%
    TIME+ Command
    4067 www-data
    0 202G 173G 55392 S 3359 93.0 48h51:30 /apps/java/bin/java -Dnop -Djdk.map
    6817 www-data
    0 202G 173G 55392 R 56.9 93.0 48:37.89 /apps/java/bin/java -Dnop -Djdk.map
    6826 www-data
    0 202G 173G 55392 R 25.7 93.0 22:26.90 /apps/java/bin/java -Dnop -Djdk.map
    6721 www-data
    0 202G 173G 55392 S 25.0 93.0 22:05.51 /apps/java/bin/java -Dnop -Djdk.map
    6616 www-data
    0 202G 173G 55392 S 13.6 93.0 11:15.51 /apps/java/bin/java -Dnop -Djdk.map
    […]
    F1Help F2Setup F3SearchF4FilterF5Tree F6SortByF7Nice -F8Nice +F9Kill F10Quit
    Pros: configurable. Cons: misleading colors.
    dstat is similar, and now dead (May 2019); see pcp-dstat
    
slide 17:
    vmstat
    • Virtual memory statistics and more:
    $ vmstat –Sm 1
    procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---r b
    swpd
    free
    buff cache
    cs us sy id wa
    8 0
    12 25 34 0 0
    7 0
    0 205 186 46 13 0 0
    8 0
    8 210 435 39 21 0 0
    8 0
    0 218 219 42 17 0 0
    […]
    • USAGE: vmstat [interval [count]]
    • First output line has some summary since boot values
    • High level CPU summary
    – “r” is runnable tasks
    
slide 18:
    iostat
    • Block I/O (disk) stats. 1st output is since boot.
    $ iostat -xz 1
    Linux 5.0.21 (c099.xxxx)
    06/24/19
    _x86_64_
    (32 CPU)
    [...]
    Device
    r/s
    w/s
    rkB/s
    wkB/s
    rrqm/s
    sda
    nvme3n1
    20.39 293152.56 14758.05
    nvme1n1
    17.83 286402.15 13089.56
    nvme0n1
    19.70 258184.52 14218.55
    wrqm/s %rrqm %wrqm \...
    0.00 /...
    0.00 18.81 \...
    0.00 18.52 /...
    0.00 19.51 \...
    Workload
    Very useful
    set of stats
    ...\ r_await w_await aqu-sz rareq-sz wareq-sz
    .../
    ...\
    .../
    ...\
    Resulting Performance
    svctm
    %util
    
slide 19:
    free
    • Main memory usage:
    $ free -m
    Mem:
    Swap:
    total
    used
    free
    shared
    buff/cache
    • Recently added “available” column
    buff/cache: block device I/O cache + virtual page cache
    available: memory likely available to apps
    free: completely unused memory
    available
    
slide 20:
    strace
    • System call tracer:
    $ strace –tttT –p 313
    1408393285.779746 getgroups(0, NULL)
    = 1 gt;
    1408393285.779873 getgroups(1, [0])
    = 1 gt;
    1408393285.780797 close(3)
    = 0 gt;
    1408393285.781338 write(1, "wow much syscall\n", 17wow much syscall
    ) = 17 gt;
    • Translates syscall arguments
    • Not all kernel requests (e.g., page faults)
    • Currently has massive overhead (ptrace based)
    – Can slow the target by >gt; 100x. Skews measured time (-ttt, -T).
    http://www.brendangregg.com/blog/2014-05-11/strace-wow-much-syscall.html
    • perf trace will replace it: uses a ring buffer & BPF
    
slide 21:
    tcpdump
    • Sniff network packets for post analysis:
    $ tcpdump -i eth0 -w /tmp/out.tcpdump
    tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
    ^C7985 packets captured
    8996 packets received by filter
    1010 packets dropped by kernel
    # tcpdump -nr /tmp/out.tcpdump | head
    reading from file /tmp/out.tcpdump, link-type EN10MB (Ethernet)
    20:41:05.038437 IP 10.44.107.151.22 >gt; 10.53.237.72.46425: Flags [P.], seq 18...
    20:41:05.038533 IP 10.44.107.151.22 >gt; 10.53.237.72.46425: Flags [P.], seq 48...
    20:41:05.038584 IP 10.44.107.151.22 >gt; 10.53.237.72.46425: Flags [P.], seq 96...
    […]
    • Study packet sequences with timestamps (us)
    • CPU overhead optimized (socket ring buffers), but can
    still be significant. Use BPF in-kernel summaries
    instead.
    
slide 22:
    nstat
    • Replacement for netstat from iproute2
    • Various network protocol statistics:
    -s won’t reset counters,
    otherwise intervals
    can be examined
    -d for daemon mode
    • Linux keeps adding
    more counters
    $ nstat -s
    #kernel
    IpInReceives
    IpInDelivers
    IpOutRequests
    [...]
    TcpActiveOpens
    TcpPassiveOpens
    TcpAttemptFails
    TcpEstabResets
    TcpInSegs
    TcpOutSegs
    TcpRetransSegs
    TcpOutRsts
    [...]
    
slide 23:
    slabtop
    • Kernel slab allocator memory usage:
    $ slabtop
    Active / Total Objects (% used)
    : 4692768 / 4751161 (98.8%)
    Active / Total Slabs (% used)
    : 129083 / 129083 (100.0%)
    Active / Total Caches (% used)
    : 71 / 109 (65.1%)
    Active / Total Size (% used)
    : 729966.22K / 738277.47K (98.9%)
    Minimum / Average / Maximum Object : 0.01K / 0.16K / 8.00K
    OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
    3565575 3565575 100%
    0.10K 91425
    365700K buffer_head
    314916 314066 99%
    0.19K 14996
    59984K dentry
    184192 183751 99%
    0.06K
    11512K kmalloc-64
    138618 138618 100%
    0.94K
    130464K xfs_inode
    138602 138602 100%
    0.21K
    29968K xfs_ili
    102116 99012 96%
    0.55K
    58352K radix_tree_node
    97482 49093 50%
    0.09K
    9284K kmalloc-96
    22695 20777 91%
    0.05K
    1068K shared_policy_node
    21312 21312 100%
    0.86K
    18432K ext4_inode_cache
    16288 14601 89%
    0.25K
    4072K kmalloc-256
    […]
    
slide 24:
    pcstat
    • Show page cache residency by file:
    # ./pcstat data0*
    |----------+----------------+------------+-----------+---------|
    | Name
    | Size
    | Pages
    | Cached
    | Percent |
    |----------+----------------+------------+-----------+---------|
    | data00
    | 104857600
    | 25600
    | 25600
    | 100.000 |
    | data01
    | 104857600
    | 25600
    | 25600
    | 100.000 |
    | data02
    | 104857600
    | 25600
    | 4080
    | 015.938 |
    | data03
    | 104857600
    | 25600
    | 25600
    | 100.000 |
    | data04
    | 104857600
    | 25600
    | 16010
    | 062.539 |
    | data05
    | 104857600
    | 25600
    | 0
    | 000.000 |
    |----------+----------------+------------+-----------+---------|
    • Uses mincore(2) syscall. Used for database perf analysis.
    
slide 25:
    docker stats
    • Soft limits (cgroups) by container:
    # docker stats
    CONTAINER
    CPU %
    353426a09db1 526.81%
    6bf166a66e08 303.82%
    58dcf8aed0a7 41.01%
    61061566ffe5 85.92%
    bdc721460293 2.69%
    6c80ed61ae63 477.45%
    337292fb5b64 89.05%
    b652ede9a605 173.50%
    d7cd2599291f 504.28%
    05bf9f3e0d13 314.46%
    09082f005755 142.04%
    [...]
    MEM USAGE / LIMIT
    4.061 GiB / 8.5 GiB
    3.448 GiB / 8.5 GiB
    1.322 GiB / 2.5 GiB
    220.9 MiB / 3.023 GiB
    1.204 GiB / 3.906 GiB
    557.7 MiB / 8 GiB
    766.2 MiB / 8 GiB
    689.2 MiB / 8 GiB
    673.2 MiB / 8 GiB
    711.6 MiB / 8 GiB
    693.9 MiB / 8 GiB
    MEM %
    47.78%
    40.57%
    52.89%
    7.14%
    30.82%
    6.81%
    9.35%
    8.41%
    8.22%
    8.69%
    8.47%
    NET I/O
    0 B / 0 B
    0 B / 0 B
    0 B / 0 B
    0 B / 0 B
    0 B / 0 B
    0 B / 0 B
    0 B / 0 B
    0 B / 0 B
    0 B / 0 B
    0 B / 0 B
    0 B / 0 B
    BLOCK I/O
    2.818 MB / 0 B
    2.032 MB / 0 B
    0 B / 0 B
    43.4 MB / 0 B
    4.35 MB / 0 B
    9.257 MB / 0 B
    5.493 MB / 0 B
    6.48 MB / 0 B
    12.58 MB / 0 B
    7.942 MB / 0 B
    8.081 MB / 0 B
    PIDS
    • Stats are in /sys/fs/cgroups
    • CPU shares and bursting breaks monitoring assumptions
    
slide 26:
    showboost
    • Determine current CPU clock rate
    # showboost
    Base CPU MHz : 2500
    Set CPU MHz : 2500
    Turbo MHz(s) : 3100 3200 3300 3500
    Turbo Ratios : 124% 128% 132% 140%
    CPU 0 summary every 1 seconds...
    TIME
    23:39:07
    23:39:08
    23:39:09
    C0_MCYC
    C0_ACYC
    UTIL
    64%
    70%
    99%
    RATIO
    MHz
    • Uses MSRs. Can also use PMCs for this.
    • Also see turbostat.
    https://github.com/brendangregg/msr-cloud-tools
    
slide 27:
    pmcarch
    serverA# ./pmcarch -p 4093 10
    K_CYCLES
    K_INSTR
    IPC BR_RETIRED
    BR_MISPRED
    982412660 575706336
    0.59 126424862460 2416880487
    999621309 555043627
    0.56 120449284756 2317302514
    991146940 558145849
    0.56 126350181501 2530383860
    996314688 562276830
    0.56 122215605985 2348638980
    979890037 560268707
    0.57 125609807909 2386085660
    serverB# ./pmcarch -p 1928219 10
    K_CYCLES
    K_INSTR
    IPC BR_RETIRED
    147523816 222396364
    1.51 46053921119
    156634810 229801807
    1.47 48236123575
    152783226 237001219
    1.55 49344315621
    140787179 213570329
    1.52 44518363978
    136822760 219706637
    1.61 45129020910
    BR_MISPRED
    BMR% LLCREF
    LLCMISS
    LLC%
    1.91 15724006692 10872315070 30.86
    1.92 15378257714 11121882510 27.68
    2.00 15965082710 11464682655 28.19
    1.92 15558286345 10835594199 30.35
    1.90 15828820588 11038597030 30.26
    BMR% LLCREF
    1.39 8880477235
    1.35 9186609260
    1.40 9314992450
    1.42 8675999448
    1.44 8689831639
    LLCMISS
    LLC%
    968809014 89.09
    1183858023 87.11
    879494418 90.56
    712318917 91.79
    617678747 92.89
    • Measures instructions-per-cycle (IPC) and other metrics
    https://github.com/brendangregg/pmc-cloud-tools
    
slide 28:
    cpuhot
    # dmesg
    […]
    [1914678.201791] CPU6: Package temperature above threshold, cpu clock throttled ...
    [1914678.206747] CPU5: Core temperature/speed normal
    [1914678.206748] CPU6: Package temperature/speed normal
    […]
    # ./cpuhot
    - CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 CPU8 CPU9 CPU10 CPU11 CPU12 CPU13 CPU14 CPU15 CPU16
    PROCHOT 95 95 95 95 95 95 95 95 95 95 95 95 95 95 95 95
    Celsius 77 75 76 73 76 77 75 72 76 72 75 77 76 72 76 75
    Flags 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0
    • Various thermal info is available in MSRs
    https://github.com/brendangregg/msr-cloud-tools
    
slide 29:
    Also: Static Performance Tuning Tools
    
slide 30:
    Where do you start...and stop?
    Workload Observability
    Static Configuration
    
slide 31:
    2. Methodologies
    
slide 32:
    Anti-Methodologies
    • The lack of a deliberate methodology…
    • Street Light Anti-Method
    • Drunk Man Anti-Method
    
slide 33:
    Linux Perf Analysis in 60s
    uptime
    dmesg -T | tail
    vmstat 1
    mpstat -P ALL 1
    pidstat 1
    iostat -xz 1
    free -m
    sar -n DEV 1
    sar -n TCP,ETCP 1
    top
    load averages
    kernel errors
    overall stats by time
    CPU balance
    process usage
    disk I/O
    memory usage
    network I/O
    TCP stats
    check overview
    http://techblog.netflix.com/2015/11/linux-performance-analysis-in-60s.html
    
slide 34:
    USE Method
    For every resource, check:
    1. Utilization
    2. Saturation
    3. Errors
    Saturation
    Errors
    Resource
    Utilization
    (%)
    For example, CPUs:
    Utilization: time busy
    Saturation: run queue length or latency
    Errors: ECC errors, etc.
    Start with the questions,
    then find the tools
    Can be applied to hardware and software (cgroups)
    
slide 35:
    Workload Characterization
    Analyze workload characteristics, not resulting performance
    For example, CPUs:
    1. Who: which PIDs, programs, users
    2. Why: code paths, context
    3. What: CPU instructions, cycles
    4. How: changing over time
    Workload
    Target
    
slide 36:
    Other Methodologies
    Resource analysis
    Workload analysis
    Drill-down analysis
    Off-CPU analysis
    Static performance tuning
    Performance mantras
    Scientific method
    5 whys
    All methodologies summarized:
    http://www.brendangregg.com/methodology.html
    
slide 37:
    3. Benchmarking
    
slide 38:
    Benchmarking
    • An experimental analysis activity
    – Try observational analysis first; benchmarks can perturb
    • My favorite tools:
    – fio, lmbench, sysperf, iperf, netperf
    • Benchmarking is error prone
    – ~100% of benchmarks are wrong
    – You benchmark A, but actually measure B,
    and conclude you measured C
    caution: benchmarking
    
slide 39:
    Solution: Active Benchmarking
    • Root cause analysis while the benchmark runs
    • For any given benchmark, ask: why not 10x?
    • This takes time, but uncovers most mistakes
    accurate benchmarking
    takes serious effort
    
slide 40:
    4. Profiling
    
slide 41:
    Profiling
    Can you do this?
    “As an experiment to investigate the performance of the resulting TCP/IP
    implementation ... the 11/750 is CPU saturated, but the 11/780 has about
    30% idle time. The time spent in the system processing the data is spread
    out among handling for the Ethernet (20%), IP packet processing (10%),
    TCP processing (30%), checksumming (25%), and user system call
    handling (15%), with no single part of the handling dominating the time in
    the system.”
    – Bill Joy, 1981, TCP-IP Digest, Vol 1 #6
    https://www.rfc-editor.org/rfc/museum/tcp-ip-digest/tcp-ip-digest.v1n6.1
    
slide 42:
    perf: CPU profiling
    • Sampling full stack traces at 99 Hertz, for 30 secs:
    # perf record -F 99 -ag -- sleep 30
    [ perf record: Woken up 9 times to write data ]
    [ perf record: Captured and wrote 2.745 MB perf.data (~119930 samples) ]
    # perf report -n --stdio
    1.40%
    java [kernel.kallsyms]
    [k] _raw_spin_lock
    --- _raw_spin_lock
    |--63.21%-- try_to_wake_up
    |--63.91%-- default_wake_function
    |--56.11%-- __wake_up_common
    __wake_up_locked
    ep_poll_callback
    __wake_up_common
    __wake_up_sync_key
    |--59.19%-- sock_def_readable
    […78,000 lines truncated…]
    
slide 43:
    Full "perf report" Output
    
slide 44:
    … as a Flame Graph
    
slide 45:
    Flame Graphs
    • Visualizes a collection of stack traces
    – x-axis: alphabetical stack sort, to maximize merging
    – y-axis: stack depth
    – color: random (default), or a dimension
    • Perl + SVG + JavaScript
    – https://github.com/brendangregg/FlameGraph
    – Takes input from many different profilers
    – Multiple d3 versions are being developed
    • References:
    – http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html
    – http://queue.acm.org/detail.cfm?id=2927301
    – "The Flame Graph" CACM, June 2016
    Instructions
    
slide 46:
    Linux CPU Flame Graphs
    Linux 2.6+, via perf:
    git clone --depth 1 https://github.com/brendangregg/FlameGraph
    cd FlameGraph
    These files can be read using FlameScope
    perf record -F 49 -a –g -- sleep 30
    perf script --header >gt; out.perf01
    ./stackcollapse-perf.pl gt; perf.svg
    Linux 4.9+, via BPF:
    git clone --depth 1 https://github.com/brendangregg/FlameGraph
    git clone --depth 1 https://github.com/iovisor/bcc
    ./bcc/tools/profile.py -dF 49 30 | ./FlameGraph/flamegraph.pl >gt; perf.svg
    – Most efficient: no perf.data file, summarizes in-kernel
    
slide 47:
    Mixed-Mode Flame Graphs
    Kernel
    Java
    JVM
    
slide 48:
    FlameScope
    Analyze variance, perturbations
    Flame graph
    https://github.com/
    Netflix/flamescope
    Subsecond-offset heat map
    
slide 49:
    perf: Counters
    • Performance Monitoring Counters (PMCs):
    $ perf list | grep –i hardware
    cpu-cycles OR cycles
    stalled-cycles-frontend OR idle-cycles-frontend
    stalled-cycles-backend OR idle-cycles-backend
    instructions
    […]
    L1-dcache-loads
    L1-dcache-load-misses
    […]
    rNNN (see 'perf list --help' on how to encode it)
    mem:gt;[:access]
    [Hardware event]
    [Hardware event]
    [Hardware event]
    [Hardware event]
    [Hardware cache event]
    [Hardware cache event]
    [Raw hardware event …
    [Hardware breakpoint]
    • Measure CPU operations, cycles, including stall cycles
    • PMCs only enabled for some cloud instance types
    My front-ends, incl. pmcarch:
    https://github.com/brendangregg/pmc-cloud-tools
    
slide 50:
    5. Tracing
    
slide 51:
    Linux Tracing Events
    
slide 52:
    Tracing Stack
    add-on tools:
    trace-cmd, perf-tools, bcc, bpftrace
    front-end tools:
    perf
    tracing frameworks:
    back-end instrumentation:
    Ftrace, perf_events, BPF
    tracepoints, kprobes, uprobes
    BPF enables a new class of
    custom, efficient, and production safe
    performance analysis tools
    Linux
    
slide 53:
    Ftrace: perf-tools funccount
    • Built-in kernel tracing capabilities, added by Steven
    Rostedt and others since Linux 2.6.27
    # ./funccount -i 1 'bio_*'
    Tracing "bio_*"... Ctrl-C to end.
    FUNC
    [...]
    bio_alloc_bioset
    bio_endio
    bio_free
    bio_fs_destructor
    bio_init
    bio_integrity_enabled
    bio_put
    bio_add_page
    • Also see trace-cmd
    COUNT
    
slide 54:
    perf: Tracing Tracepoints
    perf was introduced earlier; it is also a powerful tracer
    # perf stat -e block:block_rq_complete -a sleep 10
    Performance counter stats for 'system wide':
    In-kernel counts (efficient)
    block:block_rq_complete
    # perf record -e block:block_rq_complete -a sleep 10
    Dump & post-process
    [ perf record: Woken up 1 times to write data ]
    [ perf record: Captured and wrote 0.428 MB perf.data (~18687 samples) ]
    # perf script
    run 30339 [000] 2083345.722857: block:block_rq_complete: 202,1 W () 12986336 + 8 [0]
    run 30339 [000] 2083345.723180: block:block_rq_complete: 202,1 W () 12986528 + 8 [0]
    swapper
    0 [000] 2083345.723489: block:block_rq_complete: 202,1 W () 12986496 + 8 [0]
    swapper
    0 [000] 2083346.745840: block:block_rq_complete: 202,1 WS () 1052984 + 144 [0]
    supervise 30342 [000] 2083346.746571: block:block_rq_complete: 202,1 WS () 1053128 + 8 [0]
    [...]
    http://www.brendangregg.com/perf.html
    https://perf.wiki.kernel.org/index.php/Main_Page
    
slide 55:
    BCC/BPF: ext4slower
    • ext4 operations slower than the threshold:
    # ./ext4slower 1
    Tracing ext4 operations slower than 1 ms
    TIME
    COMM
    PID
    T BYTES
    OFF_KB
    06:49:17 bash
    R 128
    06:49:17 cksum
    R 39552
    06:49:17 cksum
    R 96
    06:49:17 cksum
    R 96
    06:49:17 cksum
    R 10320
    06:49:17 cksum
    R 65536
    06:49:17 cksum
    R 55400
    06:49:17 cksum
    R 36792
    […]
    LAT(ms) FILENAME
    7.75 cksum
    1.34 [
    5.36 2to3-2.7
    14.94 2to3-3.4
    6.82 411toppm
    4.01 a2p
    8.77 ab
    16.34 aclocal-1.14
    • Better indicator of application pain than disk I/O
    • Measures & filters in-kernel for efficiency using BPF
    https://github.com/iovisor/bcc
    
slide 56:
    bpftrace: one-liners
    • Block I/O (disk) events by type; by size & comm:
    # bpftrace -e 't:block:block_rq_issue { @[args->gt;rwbs] = count(); }'
    Attaching 1 probe...
    @[WS]: 2
    @[RM]: 12
    @[RA]: 1609
    @[R]: 86421
    # bpftrace -e 't:block:block_rq_issue { @bytes[comm] = hist(args->gt;bytes); }'
    Attaching 1 probe...
    @bytes[dmcrypt_write]:
    [4K, 8K)
    68 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
    [8K, 16K)
    35 |@@@@@@@@@@@@@@@@@@@@@@@@@@
    [16K, 32K)
    4 |@@@
    [32K, 64K)
    1 |
    [64K, 128K)
    2 |@
    https://github.com/iovisor/bpftrace
    [...]
    
slide 57:
    BPF Perf
    Tools
    (2019)
    BCC & bpftrace repos
    contain many of these.
    The book has them all.
    
slide 58:
    Off-CPU Analysis
    • Explain all blocking events. High-overhead: needs BPF.
    directory read
    from disk
    file read
    from disk
    fstat from disk
    path read from disk
    pipe write
    
slide 59:
    6. Tuning
    
slide 60:
    Ubuntu Bionic Tuning: Sep 2020 (1/2)
    CPU
    schedtool –B PID
    disable Ubuntu apport (crash reporter)
    upgrade to Bionic (scheduling improvements)
    Virtual Memory
    vm.swappiness = 0
    # from 60
    Memory
    echo madvise >gt; /sys/kernel/mm/transparent_hugepage/enabled
    kernel.numa_balancing = 0
    File System
    vm.dirty_ratio = 80
    # from 40
    vm.dirty_background_ratio = 5
    # from 10
    vm.dirty_expire_centisecs = 12000
    # from 3000
    mount -o defaults,noatime,discard,nobarrier …
    Storage I/O
    /sys/block/*/queue/rq_affinity
    # or 2
    /sys/block/*/queue/scheduler
    kyber
    /sys/block/*/queue/nr_requests
    /sys/block/*/queue/read_ahead_kb 128
    mdadm –chunk=64 …
    
slide 61:
    Ubuntu Bionic Tuning: Sep 2020 (2/2)
    Networking
    net.core.default_qdisc = fq
    net.core.netdev_max_backlog = 5000
    # may update to 1000
    net.core.rmem_max = 16777216
    net.core.somaxconn = 1024
    # may update to 4096
    net.core.wmem_max = 16777216
    net.ipv4.ip_local_port_range = 10240 65535
    net.ipv4.tcp_abort_on_overflow = 1
    # maybe
    net.ipv4.tcp_congestion_control = bbr
    net.ipv4.tcp_max_syn_backlog = 8192
    net.ipv4.tcp_rmem = 4096 12582912 16777216
    # or 8388608 ...
    net.ipv4.tcp_slow_start_after_idle = 0
    net.ipv4.tcp_syn_retries = 2
    net.ipv4.tcp_tw_reuse = 1
    net.ipv4.tcp_wmem = 4096 12582912 16777216
    # or 8388608 ...
    Hypervisor
    echo tsc >gt; /sys/devices/…/current_clocksource
    Plus use AWS Nitro
    Other
    net.core.bpf_jit_enable = 1
    sysctl -w kernel.perf_event_max_stack=1000
    
slide 62:
    Takeaways
    Systems Performance is:
    Observability, Methodologies, Benchmarking, Profiling, Tracing, Tuning
    Print out for your office wall:
    uptime
    dmesg -T | tail
    vmstat 1
    mpstat -P ALL 1
    pidstat 1
    iostat -xz 1
    free -m
    sar -n DEV 1
    sar -n TCP,ETCP 1
    top
    
slide 63:
    Links
    Netflix Tech Blog on Linux:
    http://techblog.netflix.com/2015/11/linux-performance-analysis-in-60s.html
    http://techblog.netflix.com/2015/08/netflix-at-velocity-2015-linux.html
    Linux Performance:
    http://www.brendangregg.com/linuxperf.html
    Linux perf:
    https://perf.wiki.kernel.org/index.php/Main_Page
    http://www.brendangregg.com/perf.html
    Linux ftrace:
    https://www.kernel.org/doc/Documentation/trace/ftrace.txt
    https://github.com/brendangregg/perf-tools
    Linux BPF:
    http://www.brendangregg.com/ebpf.html
    http://www.brendangregg.com/bpf-performance-tools-book.html
    https://github.com/iovisor/bcc
    https://github.com/iovisor/bpftrace
    Methodologies:
    http://www.brendangregg.com/USEmethod/use-linux.html
    http://www.brendangregg.com/activebenchmarking.html
    Flame Graphs & FlameScope:
    http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html
    http://queue.acm.org/detail.cfm?id=2927301
    https://github.com/Netflix/flamescope
    MSRs and PMCs
    https://github.com/brendangregg/msr-cloud-tools
    https://github.com/brendangregg/pmc-cloud-tools
    
slide 64:
    Thanks
    Q&A in #qa-brendan
    http://slideshare.net/brendangregg
    http://www.brendangregg.com
    bgregg@netflix.com
    @brendangregg
    Nov/Dec 2020
    Auckland, Perth, Singapore, Hong Kong