The USE Method


Boeing 707 Emergency Checklist (1969)

The Utilization Saturation and Errors (USE) Method is a methodology for analyzing the performance of any system. It directs the construction of a checklist, which for server analysis can be used for quickly identifying resource bottlenecks or errors.

The resulting USE Method-derived checklists for different operating systems are listed on the left navigation panel (Linux, Solaris, etc). You can customize these for your environment, adding additional tools that your site uses. There is also the Rosetta Stone of Performance Checklists, automatically generated from some of these.

Intro

A serious performance issue arises, and you suspect it's caused by the server. What do you check first?

Getting started can be the hardest part. I developed the USE Method to help my students of systems performance classes get started and identify resource bottlenecks quickly, without overlooking important areas. Like an emergency checklist in a flight manual, it would be simple, straightforward, complete and fast. Myself and others have used this methodology many times successfully in enterprise environments, and more recently in cloud computing environments.

The USE Method is based on three metric types and a strategy for approaching a complex system. I find it solves about 80% of server issues with 5% of the effort, and, as I will demonstrate, it can be applied to systems other than servers. It should be thought of as a tool, one that is part of larger toolbox. There are many problem types it doesn't solve, which will require other methods and longer time spans.

Summary

The USE Method can be summarized as:

For every resource, check utilization, saturation, and errors.

It's intended to be used early in a performance investigation, to identify systemic bottlenecks.

Terminology definitions:

[1] It can be useful to consider some software resources as well, and see which metrics are possible.
[2] There is another definition where utilization describes the proportion of a resource that is used, and so 100% utilization means no more work can be accepted, unlike with the "busy" definition above.

The metrics are usually expressed in the following terms:

Errors should be investigated because they can degrade performance, and may not be immediately noticed when the failure mode is recoverable. This includes operations that fail and are retried, and devices from a pool of redundant devices that fail.

Does Low Utilization Mean No Saturation?

A burst of high utilization can cause saturation and performance issues, even though utilization is low over a long interval. This may be counter-intuitive!

I had a recent example of this where a customer had problems with CPU saturation (latency) even though their monitoring tools showed CPU utilization was never higher than 80%. The monitoring tool was reporting five minute averages, during which CPU utilization hit 100% for seconds at a time.

Resource List

To begin with, you need a list of resources to iterate through. Here is a generic list for servers:

Some components are two types of resources: storage devices are a service request resource (I/O) and also a capacity resource (population). Both types can become a system bottleneck. Request resources can be defined as queueing systems, which can queue and then service requests.

Some physical components have been left out, such as hardware caches (eg, MMU TLB/TSB, CPU). The USE Method is most effective for resources that suffer performance degradation under high utilization or saturation, leading to a bottleneck. Caches improve performance under high utilization.

Cache hit rates and other performance attributes can be checked after the USE Method - after systemic bottlenecks have been ruled out. If you are unsure whether to include a resource, include it, then see how well the metrics work.

Functional Block Diagram

Another way to iterate over resources is to find or draw a Functional Block Diagram for the system. These also show relationships, which can be very useful when looking for bottlenecks in the flow of data. Here is an example from the Sun Fire V480 Guide (page 82):

I love these diagrams, although they can be hard to come by. Hardware engineers can be the best resource – the people who actually build the things. Or you can try drawing your own.

While determining utilization for the various busses, annotate each bus on the functional diagram with its maximum bandwidth. This results in a diagram where systemic bottlenecks may be identified before a single measurement has been taken. (This is a useful exercise during hardware product design, when physical components can be changed.)

Interconnects

CPU, memory and I/O interconnects are often overlooked. Fortunately, they aren't commonly the system bottleneck. Unfortunately, if they are, it can be difficult to do much about (maybe you can upgrade the main board, or reduce load: eg, "zero copy" projects lighten memory bus load). With the USE Method, at least you become aware of what you weren't considering: interconnect performance. See Analyzing the HyperTransport for an example of an interconnect issue which I identified with the USE Method.

Metrics

Given the list of resources, consider the metric types: utilization, saturation and errors.

Here are some examples. In the table below, think about each resource and metric type, and see if you can fill in the blanks. Mousing over the empty cells will reveal some possible answers, described in generic Unix/Linux terms (you can be more specific):

resourcetypemetric
CPUutilization
CPU utilization (either per-CPU or a system-wide average)
CPUsaturation
run-queue length or scheduler latency(aka
Memory capacityutilization
available free memory (system-wide)
Memory capacitysaturation
anonymous paging or thread swapping (maybe "page scanning" too)
Network interfaceutilization
RX/TX throughput / max bandwidth
Storage device I/Outilization
device busy percent
Storage device I/Osaturation
wait queue length
Storage device I/Oerrors
device errors ("soft", "hard", ...)

Click here to reveal all. I've left off timing: these metrics are either averages per interval or counts. I've also left off how to fetch them: for your custom checklist, include which OS tool or monitoring software to use, and which statistic to read. For those metrics that aren't available, write "?". You will end up with a checklist that is easy and quick to follow, and is as complete as possible for your system.

Harder Metrics

Now for some harder combinations (again, try to think about these first!):

resourcetypemetric
CPUerrors
eg, correctable CPU cache ECC events or faulted CPUs (if the OS+HW supports that)
Memory capacityerrors
eg, failed malloc()s (although this is usually due to virtual memory exhaustion, not physical)
Networksaturation
saturation related NIC or OS events; eg "dropped", "overruns"
Storage controllerutilization
depends on the controller; it may have a max IOPS or throughput that can be checked vs current activity
CPU interconnectutilization
per port throughput / max bandwidth (CPU performance counters)
Memory interconnectsaturation
memory stall cycles, high CPI (CPU performance counters)
I/O interconnectutilization
bus throughput / max bandwidth (performance counters may exist on your HW; eg, Intel "uncore" events)

Click here to reveal all. These typically get harder to measure, depending on the OS, and I often have to write my own software to do them (eg, the "amd64htcpu" script from Analyzing the HyperTransport).

Repeat for all combinations, and include instructions for fetching each metric. You'll end up with a list of about thirty metrics, some of which can't be measured, and some of which are tricky to measure. Fortunately, the most common issues are usually found with the easy ones (eg, CPU saturation, memory capacity saturation, network interface utilization, disk utilization), which can be checked first.

See the top of this page for the example checklists for Linux, Solaris, Mac OS X, FreeBSD, etc.

In Practice

Reading metrics for every combination on your OS can very time consuming, especially once you start working through bus and interconnect metrics. You may only have time to check a subset: CPUs, memory capacity, storage capacity, storage device I/O, network interfaces. This is better than it sounds! The USE Method has made you aware of what you didn't check: what were once unknown-unknowns are now known-unknowns. And for that time when it's vital for your company to root cause a performance issue, you already have a to-do list of known extra work that can be performed for more thorough analysis, completing the USE Method for when it's really needed.

It's hoped that the subset of metrics that are easy to check grows over time, as more metrics are added to OSes to make the USE Method easier. Performance monitoring software can also help, adding USE method wizards to do the work for you.

Software Resources

Some software resources can be considered in a similar way. This usually applies to smaller components of software, not entire applications. For example:

Don't sweat this type. If the metrics work well, use them, otherwise software can be left to other methodologies (eg, latency).

Suggested Interpretations

The USE Method helps you identify which metrics to use. After learning how to read them from the operating system, your next task is to interpret their current values. For some, interpretation may be obvious (and well documented). Others, not so obvious, and may depend on workload requirements or expectations.

The following are some general suggestions for interpreting metric types:

It's easy to interpret the negative case: low utilization, no saturation, no errors. This is more useful than it sounds - narrowing down the scope of an investigation can quickly bring focus to the problem area.

Cloud Computing

In a cloud computing environment, software resource controls may be in place to limit or throttle tenants who are sharing one system. At Joyent we primarily use OS virtualization (SmartOS), which imposes memory limits, CPU limits and storage I/O throttling. Each of these resource limits can be examined with the USE Method, similar to examining the physical resources.

For example, in our environment "memory capacity utilization" can be the tenant's memory usage vs its memory cap. "memory capacity saturation" can be seen by anonymous paging activity, even though the traditional Unix page scanner may be idle.

Strategy

The USE Method is pictured as a flowchart below. Note that errors can be checked before utilization and saturation, as a minor optimization (they are usually quicker and easier to interpret).

The USE Method identifies problems which are likely to be system bottlenecks. Unfortunately, systems can be suffering more than one performance problem, and so the first one you find may be a problem but not the problem. Each discovery can be investigated using further methodologies, before continuing the USE Method as needed to iterate over more resources.

Strategies for further analysis include workload characterization and drill-down analysis. After completing these (if needed), you should have evidence for whether the corrective action is to adjust the load applied or to tune the resource itself.

Apollo

I said earlier that the USE Method could be applied beyond servers. Looking for a fun example, I thought of a system in which I have no expertise at all, and no idea where to start: the Apollo Lunar Module guidance system. The USE Method provides a simple procedure to try.

The first step is to find a list of resources, or better still, a functional block diagram. I found the following in the "Lunar Module - LM10 Through LM14 Familiarization Manual" (1969):

Some of these components may not exhibit utilization or saturation characteristics. After iterating through them, this can be redrawn to only include relevant components. (I'd also include more: the "erasable storage" section of memory, the "core set area" and "vac area" registers.)

I'll start with the Apollo guidance computer (AGC) itself. For each metric, I browsed various LM docs to see what might make sense:

Some of these details may be familiar to space ethusiasts: 1201 ("NO VAC AREAS") and 1202 alarms famously occurred during the Apollo 11 descent. ("VAC" is short for "vector accumulator", extra storage for jobs that process vector quantities; I think wikipedia's description as "vacant" may be incorrect).

Given Apollo 11's 1201 alarm, analysis can continue using other methodologies, such as workload characterization. The workload is mostly applied via interrupts, many of which can be seen in the functional diagram. This includes the rendezvous radar, used to track the Command Module, which was interrupting the AGC with work even though the LM was performing descent. This is an example of finding unnecessary work (or low priority work; some updates from the radar may have been desirable so that the LM AGC could immediately calculate an abort trajectory and CM rendezvous if needed).

As a harder example, I'll examine the rendezvous radar as a resource. Errors are the easiest to identify. There are three types: "DATA NO GOOD", "NO TRACK", and "SHAFT- AND TRUNNION-AXIS ERROR" signals. Utilization is harder: one type may be utilization of the drive motors - defined as the time they were busy responding to angle commands (seen in the functional diagram via the "COUPLING DATA UNIT"). I'll need to read the LM docs more to see if there saturation characteristics either with the drive motors or with the returned radar data.

In a short amount of time, using this methodology, I've gone from having no idea where to start, to having specific metrics to look for and research.

Other Methodologies

While the USE Method may find 80% of server issues, latency-based methodologies (eg, Method R) can approach finding 100% of all issues. However, these can take much more time if you are unfamiliar with software internals. They may be more suited for database administrators or application developers, who already have this familiarity. The USE Method is more suited for junior or senior system administrators, whose responsibility and expertise includes the operating system (OS) and hardware. It can also be employed by these other staff when a quick check of system health is desired.

Tools Method

For comparison with the USE Method, I'll describe a tools-based approach (I'll call this "Tools Method"):

  1. List available performance tools (optionally install or purchase more).
  2. For each tool, list useful metrics it provides.
  3. For each metric, list possible interpretation rules.

The result of this is a prescriptive checklist showing which tool to run, which metrics to read, and how to interpret them. While this can be fairly effective, one problem is that it relies exclusively on available (or known) tools, which can provide an incomplete view of the system. The user is also unaware that they have an incomplete view - and so the problem will remain.

The USE Method, instead, iterates over the system resources to create a complete list of questions to ask, then searches for tools to answer them. A more complete view is constructed, and unknown areas are documented and their existence known ("known unknowns"). Based on USE, a similar checklist can be developed showing which tool to run (where available), which metric to read, and how to interpret it.

Another problem can be when iterating through a large number of tools distracts from the goal - to find bottlenecks. The USE Method provides a strategy to find bottlenecks and errors efficiently, even with an unwieldy number of available tools and metrics.

Conclusion

The USE Method is a simple strategy you can use to perform a complete a check of system health, identifying common bottlenecks and errors. It can be deployed early in the investigation and quickly identify problem areas, which then can be studied in more detail other methodologies, if need be. The strength of USE is its speed and visibility: by considering all resources, you are unlikely to overlook any issues. It will, however, only find certain types of issues – bottlenecks and errors – and should be considered as one tool in a larger toolbox.

I explained the USE Method on this page and provided generic examples of metrics. See the example checklists in the left navigation pane for specific operating systems, where tools and metrics to apply the USE Method are suggested.

Also see the complementary thread-based methodology, the TSA Method.

Acknowledgments

Updates

USE Method updates:

Updates 2: 05-Apr-2014


Last updated: 05-Apr-2014