USE Method: Solaris Performance Checklist

The USE Method provides a strategy for performing a complete check of system health, identifying common bottlenecks and errors. For each system resource, metrics for utilization, saturation and errors are identified and checked. Any issues discovered are then investigated using further strategies.

This is an example USE-based metric list for the Solaris family of operating systems. I'm writing this for later Solaris 10, Oracle Solaris 11, and illumos-based systems: SmartOS, OmniOS. This is primarily intended for system administrators of the physical systems (not tenants of cloud or zone instances; for those users, see my SmartOS performance checklist).

Physical Resources

componenttypemetric
CPUutilizationper-cpu: mpstat 1, "usr" + "sys"; system-wide: vmstat 1, "us" + "sy"; per-process: prstat -c 1 ("CPU" == recent), prstat -mLc 1 ("USR" + "SYS"); per-kernel-thread: lockstat -Ii rate, DTrace profile stack()
CPUsaturationsystem-wide: uptime, load averages; vmstat 1, "r"; DTrace dispqlen.d (DTT) for a better "vmstat r"; per-process: prstat -mLc 1, "LAT"
CPUerrorsfmadm faulty; cpustat (CPC) for whatever error counters are supported (eg, thermal throttling)
Memory capacityutilizationsystem-wide: vmstat 1, "free" (main memory), "swap" (virtual memory); per-process: prstat -c, "RSS" (main memory), "SIZE" (virtual memory)
Memory capacitysaturationsystem-wide: vmstat 1, "sr" (bad now), "w" (was very bad); vmstat -p 1, "api" (anon page ins == pain), "apo"; per-process: prstat -mLc 1, "DFL"; DTrace anonpgpid.d (DTT), vminfo:::anonpgin on execname
Memory capacityerrorsfmadm faulty and prtdiag for physical failures; fmstat -s -m cpumem-retire (ECC events); DTrace failed malloc()s
Network Interfacesutilizationnicstat (latest version here); kstat; dladm show-link -s -i 1 interface
Network Interfacessaturationnicstat; kstat for whatever custom statistics are available (eg, "nocanputs", "defer", "norcvbuf", "noxmtbuf"); netstat -s, retransmits
Network Interfaceserrorsnetstat -i, error counters; dladm show-phys; kstat for extended errors, look in the interface and "link" statistics (there are often custom counters for the card); DTrace for driver internals
Storage device I/Outilizationsystem-wide: iostat -xnz 1, "%b"; per-process: DTrace iotop
Storage device I/Osaturationiostat -xnz 1, "wait"; DTrace iopending (DTT), sdqueue.d (DTB)
Storage device I/Oerrorsiostat -En; DTrace I/O subsystem, eg, ideerr.d (DTB), satareasons.d (DTB), scsireasons.d (DTB), sdretry.d (DTB)
Storage capacityutilizationswap: swap -s; file systems: df -h; plus other commands depending on FS type
Storage capacitysaturationnot sure this one makes sense - once its full, ENOSPC
Storage capacityerrorsDTrace; /var/adm/messages file system full messages
Storage controllerutilizationiostat -Cxnz 1, compare to known IOPS/tput limits per-card
Storage controllersaturationlook for kernel queueing: sd (iostat "wait" again), ZFS zio pipeline
Storage controllererrorsDTrace the driver, eg, mptevents.d (DTB); /var/adm/messages
Network controllerutilizationinfer from nicstat and known controller max tput
Network controllersaturationsee network interface saturation
Network controllererrorskstat for whatever is there / DTrace
CPU interconnectutilizationcpustat (CPC) for CPU interconnect ports, tput / max (eg, see the amd64htcpu script)
CPU interconnectsaturationcpustat (CPC) for stall cycles
CPU interconnecterrorscpustat (CPC) for whatever is available
Memory interconnectutilizationcpustat (CPC) for memory busses, tput / max; or CPI greater than, say, 5; CPC may also have local vs remote counters
Memory interconnectsaturationcpustat (CPC) for stall cycles
Memory interconnecterrorscpustat (CPC) for whatever is available
I/O interconnectutilizationbusstat (SPARC only); cpustat for tput / max if available; inference via known tput from iostat/nicstat/...
I/O interconnectsaturationcpustat (CPC) for stall cycles
I/O interconnecterrorscpustat (CPC) for whatever is available

Software Resources

componenttypemetric
Kernel mutexutilizationlockstat -H (held time); DTrace lockstat provider
Kernel mutexsaturationlockstat -C (contention); DTrace lockstat provider; spinning shows up with dtrace -n 'profile-997 { @[stack()] = count(); }'
Kernel mutexerrorslockstat -E, eg recusive mutex enter (other errors can cause kernel lockup/panic, debug with mdb -k)
User mutexutilizationplockstat -H (held time); DTrace plockstat provider
User mutexsaturationplockstat -C (contention); prstat -mLc 1, "LCK"; DTrace plockstat provider
User mutexerrorsDTrace plockstat and pid providers, for EAGAIN, EINVAL, EPERM, EDEADLK, ENOMEM, EOWNERDEAD, ... see pthread_mutex_lock(3C)
Process capacityutilizationsar -v, "proc-sz"; kstat, "unix:0:var:v_proc" for max, "unix:0:system_misc:nproc" for current; DTrace (`nproc vs `max_nprocs)
Process capacitysaturationnot sure this makes sense; you might get queueing on pidlinklock in pid_allocate(), as it scans for available slots once the table gets full
Process capacityerrors"can't fork()" messages
Thread capacityutilizationuser-level: kstat, "unix:0:lwp_cache:buf_inuse" for current, prctl -n zone.max-lwps -i zone ZONE for max; kernel: mdb -k or DTrace, "nthread" for current, limited by memory
Thread capacitysaturationthreads blocking on memory allocation; at this point the page scanner should be running (vmstat "sr"), else examine using DTrace/mdb.
Thread capacityerrorsuser-level: pthread_create() failures with EAGAIN, EINVAL, ...; kernel: thread_create() blocks for memory but won't fail.
File descriptorsutilizationsystem-wide (no limit other than RAM); per-process: pfiles vs ulimit or prctl -t basic -n process.max-file-descriptor PID; a quicker check than pfiles is ls /proc/PID/fd | wc -l
File descriptorssaturationdoes this make sense? I don't think there is any queueing or blocking, other than on memory allocation.
File descriptorserrorstruss or DTrace (better) to look for errno == EMFILE on syscalls returning fds (eg, open(), accept(), ...).

What's Next

See the USE Method for the follow-up strategies after identifying a possible bottleneck. If you complete this checklist but still have a performance issue, move onto other strategies: drill-down analysis and latency analysis.


Last updated: 29-Sep-2013