USE Method: Linux Performance Checklist

The USE Method provides a strategy for performing a complete check of system health, identifying common bottlenecks and errors. For each system resource, metrics for utilization, saturation and errors are identified and checked. Any issues discovered are then investigated using further strategies.

This is an example USE-based metric list for Linux operating systems (eg, Ubuntu, CentOS, Fedora). This is primarily intended for system administrators of the physical systems, who are using command line tools. Some of these metrics can be found in remote monitoring tools.

Physical Resources

componenttypemetric
CPUutilizationsystem-wide: vmstat 1, "us" + "sy" + "st"; sar -u, sum fields except "%idle" and "%iowait"; dstat -c, sum fields except "idl" and "wai"; per-cpu: mpstat -P ALL 1, sum fields except "%idle" and "%iowait"; sar -P ALL, same as mpstat; per-process: top, "%CPU"; htop, "CPU%"; ps -o pcpu; pidstat 1, "%CPU"; per-kernel-thread: top/htop ("K" to toggle), where VIRT == 0 (heuristic). [1]
CPUsaturationsystem-wide: vmstat 1, "r" > CPU count [2]; sar -q, "runq-sz" > CPU count; dstat -p, "run" > CPU count; per-process: /proc/PID/schedstat 2nd field (sched_info.run_delay); perf sched latency (shows "Average" and "Maximum" delay per-schedule); dynamic tracing, eg, SystemTap schedtimes.stp "queued(us)" [3]
CPUerrorsperf (LPE) if processor specific error events (CPC) are available; eg, AMD64's "04Ah Single-bit ECC Errors Recorded by Scrubber" [4]
Memory capacityutilizationsystem-wide: free -m, "Mem:" (main memory), "Swap:" (virtual memory); vmstat 1, "free" (main memory), "swap" (virtual memory); sar -r, "%memused"; dstat -m, "free"; slabtop -s c for kmem slab usage; per-process: top/htop, "RES" (resident main memory), "VIRT" (virtual memory), "Mem" for system-wide summary
Memory capacitysaturationsystem-wide: vmstat 1, "si"/"so" (swapping); sar -B, "pgscank" + "pgscand" (scanning); sar -W; per-process: 10th field (min_flt) from /proc/PID/stat for minor-fault rate, or dynamic tracing [5]; OOM killer: dmesg | grep killed
Memory capacityerrorsdmesg for physical failures; dynamic tracing, eg, SystemTap uprobes for failed malloc()s
Network Interfacesutilizationsar -n DEV 1, "rxKB/s"/max "txKB/s"/max; ip -s link, RX/TX tput / max bandwidth; /proc/net/dev, "bytes" RX/TX tput/max; nicstat "%Util" [6]
Network Interfacessaturationifconfig, "overruns", "dropped"; netstat -s, "segments retransmited"; sar -n EDEV, *drop and *fifo metrics; /proc/net/dev, RX/TX "drop"; nicstat "Sat" [6]; dynamic tracing for other TCP/IP stack queueing [7]
Network Interfaceserrorsifconfig, "errors", "dropped"; netstat -i, "RX-ERR"/"TX-ERR"; ip -s link, "errors"; sar -n EDEV, "rxerr/s" "txerr/s"; /proc/net/dev, "errs", "drop"; extra counters may be under /sys/class/net/...; dynamic tracing of driver function returns 76]
Storage device I/Outilizationsystem-wide: iostat -xz 1, "%util"; sar -d, "%util"; per-process: iotop; pidstat -d; /proc/PID/sched "se.statistics.iowait_sum"
Storage device I/Osaturationiostat -xnz 1, "avgqu-sz" > 1, or high "await"; sar -d same; LPE block probes for queue length/latency; dynamic/static tracing of I/O subsystem (incl. LPE block probes)
Storage device I/Oerrors/sys/devices/.../ioerr_cnt; smartctl; dynamic/static tracing of I/O subsystem response codes [8]
Storage capacityutilizationswap: swapon -s; free; /proc/meminfo "SwapFree"/"SwapTotal"; file systems: "df -h"
Storage capacitysaturationnot sure this one makes sense - once it's full, ENOSPC
Storage capacityerrorsstrace for ENOSPC; dynamic tracing for ENOSPC; /var/log/messages errs, depending on FS
Storage controllerutilizationiostat -xz 1, sum devices and compare to known IOPS/tput limits per-card
Storage controllersaturationsee storage device saturation, ...
Storage controllererrorssee storage device errors, ...
Network controllerutilizationinfer from ip -s link (or /proc/net/dev) and known controller max tput for its interfaces
Network controllersaturationsee network interface saturation, ...
Network controllererrorssee network interface errors, ...
CPU interconnectutilizationLPE (CPC) for CPU interconnect ports, tput / max
CPU interconnectsaturationLPE (CPC) for stall cycles
CPU interconnecterrorsLPE (CPC) for whatever is available
Memory interconnectutilizationLPE (CPC) for memory busses, tput / max; or CPI greater than, say, 5; CPC may also have local vs remote counters
Memory interconnectsaturationLPE (CPC) for stall cycles
Memory interconnecterrorsLPE (CPC) for whatever is available
I/O interconnectutilizationLPE (CPC) for tput / max if available; inference via known tput from iostat/ip/...
I/O interconnectsaturationLPE (CPC) for stall cycles
I/O interconnecterrorsLPE (CPC) for whatever is available

Software Resources

componenttypemetric
Kernel mutexutilizationWith CONFIG_LOCK_STATS=y, /proc/lock_stat "holdtime-totat" / "acquisitions" (also see "holdtime-min", "holdtime-max") [8]; dynamic tracing of lock functions or instructions (maybe)
Kernel mutexsaturationWith CONFIG_LOCK_STATS=y, /proc/lock_stat "waittime-total" / "contentions" (also see "waittime-min", "waittime-max"); dynamic tracing of lock functions or instructions (maybe); spinning shows up with profiling (perf record -a -g -F 997 ..., oprofile, dynamic tracing)
Kernel mutexerrorsdynamic tracing (eg, recusive mutex enter); other errors can cause kernel lockup/panic, debug with kdump/crash
User mutexutilizationvalgrind --tool=drd --exclusive-threshold=... (held time); dynamic tracing of lock to unlock function time
User mutexsaturationvalgrind --tool=drd to infer contention from held time; dynamic tracing of synchronization functions for wait time; profiling (oprofile, PEL, ...) user stacks for spins
User mutexerrorsvalgrind --tool=drd various errors; dynamic tracing of pthread_mutex_lock() for EAGAIN, EINVAL, EPERM, EDEADLK, ENOMEM, EOWNERDEAD, ...
Task capacityutilizationtop/htop, "Tasks" (current); sysctl kernel.threads-max, /proc/sys/kernel/threads-max (max)
Task capacitysaturationthreads blocking on memory allocation; at this point the page scanner should be running (sar -B "pgscan*"), else examine using dynamic tracing
Task capacityerrors"can't fork()" errors; user-level threads: pthread_create() failures with EAGAIN, EINVAL, ...; kernel: dynamic tracing of kernel_thread() ENOMEM
File descriptorsutilizationsystem-wide: sar -v, "file-nr" vs /proc/sys/fs/file-max; dstat --fs, "files"; or just /proc/sys/fs/file-nr; per-process: ls /proc/PID/fd | wc -l vs ulimit -n
File descriptorssaturationdoes this make sense? I don't think there is any queueing or blocking, other than on memory allocation.
File descriptorserrorsstrace errno == EMFILE on syscalls returning fds (eg, open(), accept(), ...).

What's Next

See the USE Method for the follow-up strategies after identifying a possible bottleneck. If you complete this checklist but still have a performance issue, move onto other strategies: drill-down analysis and latency analysis.

Acknowledgements

Resources used:

Filling this this checklist has required a lot of research, testing and experimentation. Please reference back to this post if it helps you develop related material.

It's quite possible I've missed something or included the wrong metric somewhere (sorry); I'll update the post to fix these up as they are understood.


Last updated: 29-Sep-2013