Chapter 2. Methodology

Performance issues can arise from software, hardware, and any component along the data path. Methodologies help us approach complex systems by showing where to start and what steps to take to locate and analyze performance issues. [p15]

Terminology

The following are key terms for systems performance. Later chapters provide additional terms and describe some of these in different contexts.

Models

System under Test

The performance of a system under test (SUT) is shown below:

Figure 2.1 System under test

Perturbations (interference) can affect results, including those caused by:

The origin of the perturbations may not be clear and determining it can be particularly difficult in some cloud environments, where other activity (by guest tenants) on the physical host system is not observable from within a guest SUT.

Another difficulty is that modern environments may be composed of several networked components needed to service the input workload, including load balancers, web servers, database servers, application servers, and storage systems. The mere act of mapping the environment may help to reveal previously overlooked sources of perturbations. The environment may also be modeled as a network of queueing systems, for analytical study.

Queueing System

Some components and resources can be modeled as a queueing system. The following figure shows a simple queueing system.

Figure 2.2 Simple queueing model

Concepts

Latency

The latency is the time spent waiting before an operation is performed. The following figure, as an example of latency, shows a network transfer (e.g. HTTP GET request):

Figure 2.3 Network connection latency

In this example, the operation is a network service request to transfer data. Before this operation can take place, the system must wait for a network connection to be established, which is latency for this operation. The response time spans this latency and the operation time.

Depending on the target, the latency can be measured differently. For example, the load time for a website may be composed of three different times:

At a higher level, the response time may be termed latency. [p19]

Time orders of magnitude and their abbreviations are listed in the following table:

Unit Abbreviation Fraction of 1 s
Minute m 60
Second s 1
Millisecond ms 0.001 or 1/1000 or 1 x 10-3
Microsecond μs 0.000001 or 1/1000000 or 1 x 10-6
Nanosecond ns 0.000000001 or 1/1000000000 or 1 x 10-9
Picosecond ps 0.000000000001 or 1/1000000000000 or 1 x 10-12

When possible, other metric types can be converted to latency or time so that they can be compared. For example:

Time Scales

System components operate over vastly different time scales (orders of magnitude).

The following table is an example Time Scale of System Latencies (3.3 GHz processor):

Event Latency Scaled
1 CPU cycle 0.3 ns 1 s
Level 1 cache access 0.9 ns 3 s
Level 2 cache access 2.8 ns 9 s
Level 3 cache access 12.9 ns 43 s
Main memory access (DRAM, from CPU) 120 ns 6 min
Solid-state disk I/O (flash memory) 50–150 μs 2–6 days
Rotational disk I/O 1–10 ms 1–12 months
Internet: San Francisco to New York 40 ms 4 years
Internet: San Francisco to United Kingdom 81 ms 8 years
Internet: San Francisco to Australia 183 ms 19 years
TCP packet retransmit 1–3 s 105–317 years
OS virtualization system reboot 4 s 423 years
SCSI command time-out 30 s 3 millennia
Hardware (HW) virtualization system reboot 40 s 4 millennia
Physical system reboot 5 m 32 millennia

Trade-offs

Be aware of some common performance trade-offs. The figure below shows the good/fast/cheap "pick two" trade-off on the left alongside the terminology adjusted for IT projects on the right.

Figure 2.4 Trade-offs: pick two

A common trade-off in performance tuning is that between CPU and memory:

[p21]

Tunable parameters often come with trade-offs. For examples:

Tuning Efforts

Performance tuning is most effective when done closest to where the work is performed (e.g. within application itself)). The following table shows an example of software stack, with tuning possibilities.

Layer Tuning Targets
Application database queries performed
Database database table layout, indexes, buffering
System calls memory-mapped or read/write, sync or async I/O flags
File system record size, cache size, file system tunables
Storage RAID level, number and type of disks, storage tunables
Application Level *

Tuning at the application level may improve performance significantly due to the following reasons:

  1. It may be possible to eliminate or reduce database queries and improve performance by a large factor (e.g., 20x).
    • Tuning down to the storage device level may eliminate or improve storage I/O, but tuning efforts have already been made executing higher-level OS stack code, so this may improve resulting application performance by only percentages (e.g., 20%).
  2. Since many of today’s environments target rapid deployment for features and functionality, application development and testing tend to focus on correctness, leaving little or no time for performance measurement or optimization before production deployment. These activities are conducted later, when performance becomes a problem.

The application isn’t necessarily the most effective level from which to base observation. Slow queries may be best understood from their time spent on-CPU, or from the file system and disk I/O that they perform. These are observable from operating system tools.

In many environments (especially cloud computing), the application level is under constant development, pushing software changes into production weekly or daily. Large performance improvment (including fixes for regressions) are frequently found as the application code changes. In these environments, tuning for the operating system and observability from the operating system can be easy to overlook. Remember that operating system performance analysis can also identify application-level issues, not just OS-level issues, in some cases more easily than from the application alone.

Level of Appropriateness

Different organizations and environments have different requirements for performance [p22]. This doesn’t necessarily mean that some organizations are doing it right and some wrong. It depends on the return on investment (ROI) for performance expertise:

Point-in-Time Recommendations

The performance characteristics of environments change over time, due to the addition of more users, newer hardware, and updated software or firmware.

[p23]

Performance recommendations, especially the values of tunable parameters, are valid only at a specific point in time. What may have been the best advice from a performance expert one week may become invalid a week later after a software or hardware upgrade, or after adding more users.

Load versus Architecture

An application can perform badly due to an issue with the software configuration and hardware on which it is running: its architecture. However, an application can also perform badly simply due to too much load applied, resulting in queueing and long latencies. Load and architecture are pictured in the figure below:

Figure 2.5 Load versus architecture

If analysis of the architecture shows queueing of work but no problems with how the work is performed, the issue may be one of too much load applied. In a cloud computing environment, this is the point where more nodes can be introduced to handle the work.

Single-threaded and multithreaded application *

For example,

Scalability

The performance of the system under increasing load is its scalability. The following figure shows a typical throughput profile as a system’s load increases:

Figure 2.6 Throughput versus load

For some period, linear scalability is observed. A point is then reached, marked with a dotted line, where contention for a resource begins to affect performance. This point can be described as a knee point, as it is the boundary between two pro files. Beyond this point, the throughput profile departs from linear scalability, as contention for the resource increases. Eventually the overheads for increased contention and coherency cause less work to be completed and throughput to decrease.

This point may occur when a component reaches 100% utilization: the saturation point. It may also occur when a component approaches 100% utilization, and queueing begins to be frequent and significant. This point may occur when a component reaches 100% utilization: the saturation point. It may also occur when a component approaches 100% utilization, and queueing begins to be frequent and significant.

An example system that may exhibit this profile is an application that performs heavy compute, with more load added as threads. As the CPUs approach 100% utilization, performance begins to degrade as CPU scheduler latency increases. After peak performance, at 100% utilization, throughput begins to decrease as more threads are added, causing more context switches, which consume CPU resources and cause less actual work to be completed.

The same curve can be seen if you replace "load" on the x axis with a resource such as CPU cores (detailed in Modeling)

The degradation of performance for nonlinear scalability, in terms of average response time or latency, is graphed in the following figure:

Figure 2.7 Performance degradation

Linear scalability of response time could occur if the application begins to return errors when resources are unavailable, instead of queueing work. For example, a web server may return 503 "Service Unavailable" instead of adding requests to a queue, so that those requests that are served can be performed with a consistent response time.

Known-Unknowns

The following notions are important:

Performance is a field where "the more you know, the more you don’t know". It’s the same principle: the more you learn about systems, the more unknownunknowns you become aware of, which are then known-unknowns that you can check on.

Metrics

Performance metrics are statistics generated by the system, applications, or additional tools that measure activity of interest. They are studied for performance analysis and monitoring, either numerically at the command line or graphically using visualizations.

Common types of systems performance metrics include:

Overhead

Since CPU cycles must be spent to gather and store the metrics. This causes overhead, which can negatively affect the performance of the target of measurement. This is called the observer effect.

Issues

The temptation is to assume that the software vendor has provided metrics that are well chosen, are bug-free, and provide complete visibility. In reality, metrics can be confusing, complicated, unreliable, inaccurate, and even plain wrong (due to bugs). Sometimes a metric was correct on one software version but did not get updated to reflect the addition of new code and code paths.

Utilization

The term utilization is often used for operating systems to describe device usage, such as for the CPU and disk devices. Utilization can be time-based or capacitybased.

Time-Based

Time-based utilization is the average amount of time the server or resource was busy, as defined in queueing theory, along with the ratio:

U = B/T

where:

The "utilization" is also available from operating system performance tools. The disk monitoring tool iostat(1) calls this metric %b for percent busy, a term that better conveys the underlying metric: B/T.

This utilization metric means how busy a component is: when a component approaches 100% utilization, performance can seriously degrade when there is contention for the resource. Other metrics can be checked to confirm and to see if the component has therefore become a system bottleneck.

Some components can service multiple operations in parallel. Performance may not degrade much at 100% utilization, as they can accept more work. [p28]

Capacity-Based

The other definition of utilization in the context of capacity planning is:

A system or component (such as a disk drive) is able to deliver a certain amount of throughput. At any level of performance, the system or component is working at some proportion of its capacity. That proportion is called the utilization.

This defines utilization in terms of capacity instead of time. It implies that a disk at 100% utilization cannot accept any more work. With the time-based definition, 100% utilization only means it is busy 100% of the time. Therefore, 100% busy does not mean 100% capacity.

Time-Based vs. Capacity-Based *

Use elevator as an example:

In an ideal world, we would be able to measure both types of utilization for a device, which usually isn't possible. [p29]

In this book, utilization usually refers to the time-based version. The capacity version is used for some volume-based metrics, such as memory usage.

Non-Idle Time

Non-idle time is a more accurate term to define utilization, but not yet in common usage. [p29]

Saturation

Saturation is the degree to which more work is requested of a resource than it can process. Saturation begins to occur at 100% utilization (capacity-based), as extra work cannot be processed and begins to queue. This is pictured in the following figure:

Figure 2.8 Utilization versus saturation

Any degree of saturation is a performance issue, as time is spent waiting (latency). For time-based utilization (percent busy), saturation may not begin at the 100% utilization mark, depending on the degree to which the resource can operate on work in parallel. [p30]

Profiling

Profiling is typically performed by sampling the state of the system at timed intervals, and then studying the set of samples.

Unlike the previous metrics covered, including IOPS and throughput, the use of sampling provides a coarse view of the target’s activity, depending on the rate of sampling.

For example, CPU usage can be understood in reasonable detail by sampling the CPU program counter or stack backtrace at frequent intervals to gather statistics on the code paths that are consuming CPU resources, which is detailed in Chapter 6.

Caching

Frequently used to improve performance, a cache stores results from a slower storage tier in a faster storage tier for reference. An example is caching disk blocks in main memory (RAM).

Caching is detailed in Section 3.2.11.

Cache metrics *

Hit ratio is a metric of cache performance. It represents the number of times the needed data was found in the cache (hits) versus the number of times it was not (misses). The higher, the better, as a higher ratio reflects more data successfully accessed from faster media. The following figure shows the expected performance improvement for increasing cache hit ratios.

Figure 2.9 Cache hit ratio and performance

This is a nonlinear profile because of the difference in speed between cache hits and misses (the two storage tiers). The performance difference between 98% and 99% is much greater than that between 10% and 11%. The greater the difference, the steeper the slope becomes.

Miss rate is another metric, in terms of misses per second. This is proportional (linear) to the performance penalty of each miss.

The total runtime for each workload can be calculated as:

runtime = (hit rate x hit latency) + (miss rate x miss latency)

This calculation uses the average hit and miss latencies and assumes the work is serialized.

Algorithms

Cache management algorithms and policies determine what to store in the limited space available for a cache:

Hot, Cold, and Warm Caches

The following words describe the state of the cache:

When caches are first initialized, they begin cold and then warm up over time. When the cache is large or the next-level storage is slow (or both), the cache can take a long time to become populated and warm.

[p32]

Perspectives

There are two common perspectives for performance analysis: workload analysis and resource analysis, which can be thought of as either top-down or bottom-up analysis of the operating system software stack, as show in the figure below:

Figure 2.10 Analysis perspectives

Resource Analysis

Resource analysis begins with analysis of the system resources: CPUs, memory, disks, network interfaces, busses, and interconnects. It is most likely performed by system administrators, who are responsible for the physical environment resources.

Activities *
Metrics with utilization as a focus *

Resource analysis focuses on utilization to identify when resources are at or approaching their limit.

Metrics best suited for resource analysis include:

These metrics measure the following:

Other types of metrics, including latency, are also of use to see how well the resource is responding for the given workload.

Documentation on "stat" tools *

Resource analysis is a common approach to performance analysis, in part because of the widely available documentation on the topic. Such documentation focuses on the operating system "stat" tools: vmstat(1), iostat(1), mpstat(1). Resource analysis is a perspective, but not the only perspective.

Workload Analysis

Workload analysis, as seen in the figure below, examines the performance of the applications, including the workload applied and how the application is responding. It is most commonly used by application developers and support staff, who are responsible for the application software and configuration.

Figure 2.11 Workload analysis

Targets for workload analysis *

Studying workload requests involves checking and summarizing their attributes: the process of workload characterization (detailed in Section 2.5). For databases, these attributes may include the client host, database name, tables, and query string. This data may help identify unnecessary work or unbalanced work. Examining these attributes may identify ways to reduce or eliminate the work applied. (The fastest query is the one you don’t do at all.)

Latency (response time) is the most important metric for expressing application performance. For instance: for a MySQL database, it’s query latency; for Apache, it’s HTTP request latency. In these contexts, the term latency is used to mean the same as response time (Section 2.3.1).

Identifying issues

The tasks of workload analysis are identifying and confirming issues. Latency, for example, can be done by:

  1. Looking for latency beyond an acceptable threshold,
  2. Finding the source of the latency (drill-down analysis),
  3. Confirming that the latency is improved after applying a fix.

Note that the starting point is the application. To investigate latency usually involves drilling down deeper into the application, libraries, and the operating system (kernel).

System issues may be identified by studying characteristics related to the completion of an event, including its error status. While a request may complete quickly, it may do so with an error status that causes the request to be retried, accumulating latency.

Metrics for workload analysis *

These measure the rate of requests and the resulting performance.

Methodology

This section describes methodologies and procedures for system performance analysis and tuning, and introduces some new methods, particularly the USE method. Some anti-methodologies have also been included.

These methodologies have been categorized as different types in the following table:

Methodology Type
Streetlight anti-method observational analysis
Random change anti-method experimental analysis
Blame-someone-else anti-method hypothetical analysis
Ad hoc checklist method observational and experimental analysis
Problem statement information gathering
Scientific method observational analysis
Diagnosis cycle analysis life cycle
Tools method observational analysis
USE method observational analysis
Workload characterization observational analysis, capacity planning
Drill-down analysis observational analysis
Latency analysis observational analysis
Method R observational analysis
Event tracing observational analysis
Baseline statistics observational analysis
Performance monitoring observational analysis, capacity planning
Queueing theory statistical analysis, capacity planning
Static performance tuning observational analysis, capacity planning
Cache tuning observational analysis, tuning
Micro-benchmarking experimental analysis
Capacity planning capacity planning, tuning

The following sections begin with commonly used but weaker methodologies for comparison, including the anti-methodologies. For the analysis of performance issues, the first methodology you should attempt is the problem statement method, before moving on to others.

Streetlight Anti-Method

This method is actually the absence of a deliberate methodology. The user analyzes performance by choosing observability tools that are familiar, found on the Internet, or at random to see if anything obvious shows up. This approach is hit or miss and can overlook many types of issues.

Tuning performance may be attempted in a similar trial-and-error fashion, setting whatever tunable parameters are known and familiar to different values to see if that helps.

Even when this method reveals an issue, it can be slow as tools or tunings unrelated to the issue are found and tried, just because they’re familiar. This methodology is therefore named after an observational bias called the streetlight effect, illustrated by this parable:

One night a police officer sees a drunk searching the ground beneath a streetlight and asks what he is looking for. The drunk says he has lost his keys. The police officer can’t find them either and asks: "Are you sure you lost them here, under the streetlight?" The drunk replies: "No, but this is where the light is best."

The performance equivalent would be looking at top(1), not because it makes sense, but because the user doesn’t know how to read other tools. This methodology does find may be an issue but not the issue. Other methodologies quantify findings, so that false positives can be ruled out more quickly.

Random Change Anti-Method

This is an experimental anti-methodology. The user randomly guesses where the problem may be and then changes things until it goes away. To determine whether performance has improved as a result of each change, a metric is studied, such as:

This approach is as follows:

  1. Pick a random item to change (e.g., a tunable parameter).
  2. Change it in one direction.
  3. Measure performance.
  4. Change it in the other direction.
  5. Measure performance.
  6. Check whether the results in step 3 or step 5 better than the baseline. If so, keep the change and go back to step 1.
Cons of the Random Change Anti-Method *

Although this process may eventually unearth tuning that works for the tested workload, it has the following disadvantages:

  1. It is very time-consuming and can also leave behind tuning that doesn’t make sense in the long term.
    • For example, an application change may improve performance because it works around a database or operating system bug, which is later fixed. But the application will still have that tuning that no longer makes sense, and that no one understood properly in the first place.
  2. A change that isn’t properly understood causes a worse problem during peak production load and a need to back out the change during this time.

Blame-Someone-Else Anti-Method

This anti-methodology follows these steps:

  1. Find a system or environment component for which you are not responsible.
  2. Hypothesize that the issue is with that component.
  3. Redirect the issue to the team responsible for that component.
  4. When proven wrong, go back to step 1.

Maybe it’s the network. Can you check with the network team if they have had dropped packets or something?

[p38]

This anti-methodology can be identified by a lack of data leading to the hypothesis. To avoid becoming a victim of blame-someone-else, ask the accuser for screen shots showing which tools were run and how the output was interpreted. You can take these screen shots and interpretations to someone else for a second opinion.

Ad Hoc Checklist Method

The ad hoc checklists used by support professionals are built from recent experience and issues for that system type. A typical scenario involves the deployment of a new server or application in production and a support professional checking for common issues when the system is under real load.

The following is an example checklist entry:

Run iostat –x 1 and check the await column. If this is consistently over 10 (ms) during load, then the disks are either slow or overloaded.

Cons of ad hoc checklists *

While these checklists can provide the most value in the shortest time frame, they have the following disadvantags:

Using ad hoc checklists correctly *

An ad hoc checklist can be an effective way to ensure that everyone knows how to check for the worst of the issues, and that all the obvious culprits have been checked. A checklist can be written to be clear and prescriptive, showing how to identify each issue and what the fix is. However, this list must be constantly updated.

Problem Statement

Defining the problem statement is a routine task for support staff when first responding to issues. It’s done by asking the customer the following questions:

  1. What makes you think there is a performance problem?
  2. Has this system ever performed well?
  3. What changed recently? Software? Hardware? Load?
  4. Can the problem be expressed in terms of latency or runtime?
  5. Does the problem affect other people or applications (or is it just you)?
  6. What is the environment? What software and hardware are used? Versions? Configuration?

Asking and answering these questions often points to an immediate cause and solution. The problem statement has therefore been included here as its own methodology and should be the first approach you use when tackling a new issue.

Scientific Method

The scientific method studies the unknown by making hypotheses and then testing them. It can be summarized in the following steps:

  1. Question: begin with the performance problem statement.
  2. Hypothesis: hypothesize what the cause of poor performance may be.
  3. Prediction: make a prediction based on the hypothesis
  4. Test: construct a test, which may be observational or experimental, that tests the prediction
  5. Analysis: finish with analysis of the test data collected.

For example, you may find that application performance is degraded after migrating to a system with less main memory, and you hypothesize that the cause of poor performance is a smaller file system cache. You might use the following tests:

The following are some more examples.

Example (Observational)
  1. Question: What is causing slow database queries?
  2. Hypothesis: Noisy neighbors (other cloud computing tenants) are performing disk I/O, contending with database disk I/O (via the file system).
  3. Prediction: If file system I/O latency is measured during a query, it will show that the file system is responsible for the slow queries.
  4. Test: Tracing of database file system latency as a ratio of query latency shows that less than 5% of the time is spent waiting for the file system.
  5. Analysis: The file system and disks are not responsible for slow queries.

Although the issue is still unsolved, some large components of the environment have been ruled out. The person conducting this investigation can return to step 2 and develop a new hypothesis

Example (Experimental)
  1. Question: Why do HTTP requests take longer from host A to host C than from host B to host C?
  2. Hypothesis: Host A and host B are in different data centers.
  3. Prediction: Moving host A to the same data center as host B will fix the problem.
  4. Test: Move host A and measure performance.
  5. Analysis: Performance has been fixed—consistent with the hypothesis.

If the problem wasn’t fixed, reverse the experimental change (move host A back, in this case) before beginning a new hypothesis!

Example (Experimental)
  1. Question: Why did file system performance degrade as the file system cache grew in size?
  2. Hypothesis: A larger cache stores more records, and more compute is required to manage a larger cache than a smaller one.
  3. Prediction: Making the record size progressively smaller, and therefore causing more records to be used to store the same amount of data, will make performance progressively worse.
  4. Test: Test the same workload with progressively smaller record sizes.
  5. Analysis: Results are graphed and are consistent with the prediction. Drilldown analysis is now performed on the cache management routines.

This is an example of a negative test: deliberately hurting performance to learn more about the target system.

Diagnosis Cycle

The diagnosis cycle is this:

hypothesis → instrumentation → data → hypothesis

Like the scientific method, this method also deliberately tests a hypothesis through the collection of data. The cycle emphasizes that the data can lead quickly to a new hypothesis, which is tested and refined, and so on. This is similar to a doctor making a series of small tests to diagnose a patient and refining the hypothesis based on the result of each test.

Both of these approaches have a good balance of theory and data: try to move from hypothesis to data quickly, so that bad theories can be identified early and discarded, and better ones developed.

Tools Method

A tools-oriented approach is as follows:

  1. List available performance tools.
  2. For each tool, list useful metrics it provides.
  3. For each metric, list possible rules for interpretation.

The result is a prescriptive checklist of tools, metrics and interpretation. While effective, this method relies exclusively on available (or known) tools, which can provide an incomplete view of the system, similar to the streetlight anti-method; it is worse if users are unaware of this. Issues that require custom tooling (e.g., dynamic tracing) may never be identified and solved.

Major problems of the tools method *

In practice, the tools method does identify certain resource bottlenecks, errors, and other types of problems, though often not efficiently.

When a large number of tools and metrics are available, it can be time-consuming to iterate through them. The situation gets worse when multiple tools appear to have the same functionality, and you spend additional time trying to understand the pros and cons of each. In some cases, such as file system micro-benchmark tools, there are over a dozen tools to choose from, when you may need only one.

The USE Method

The utilization, saturation, and errors (USE) method should be used early in a performance investigation to identify systemic bottlenecks. It can be summarized this way:

For every resource, check utilization, saturation, and errors.

Definitions of terms *

As explained in the earlier section, for some resource types such as main memory, utilization is the capacity of the resource that is used. This is different from the time-based definition. Once a capacity resource reaches 100% utilization, more work cannot be accepted, and the resource either queues the work (saturation) or returns errors, which are also identified using the USE method.

Errors should be investigated because they can degrade performance and may not be immediately noticed when the failure mode is recoverable. This includes operations that fail and are retried, and devices that fail in a pool of redundant devices.

Comparing to the tools method *

In contrast with the tools method, the USE method involves iterating over system resources instead of tools.

The USE method also directs analysis to a limited number of key metrics, so that all system resources are checked as quickly as possible. After this, if no issues have been found, other methodologies can be used.

Procedure

The USE method is pictured as the flowchart in the following figure. Errors are placed first before utilization and saturation are checked. Errors are usually quick and easy to interpret, and it can be time-efficient to rule them out before investigating the other metrics.

Figure 2.12 The USE method flow

This method identifies problems that are likely to be system bottlenecks. Unfortunately, a system may be suffering from more than one performance problem, so the first thing you find may be a problem but not the problem. Each discovery can be investigated using further methodologies, before returning to the USE method as needed to iterate over more resources.

Expressing Metrics

The USE method metrics are usually expressed as follows:

Though it may seem counterintuitive, a short burst of high utilization can cause saturation and performance issues, even though the overall utilization is low over a long interval. Some monitoring tools report utilization over 5-minute averages. For example, CPU utilization can vary dramatically from second to second, so a 5-minute average may disguise short periods of 100% utilization and saturation.

[p44]

Resource List

The first step in the USE method is to create a list of resources. Below is a generic list of server hardware resources, along with specific examples:

Each component typically acts as a single resource type. For example:

Some components can behave as multiple resource types. For example, a storage device is both an I/O resource and a capacity resource.

Consider all types that can lead to performance bottlenecks. I/O resources can be further studied as queueing systems, which queue and then service these requests.

Some physical components, such as hardware caches (e.g., CPU caches), can be left out of your checklist. The USE method is most effective for resources that suffer performance degradation under high utilization or saturation, leading to bottlenecks, while caches improve performance under high utilization. These can be checked using other methodologies. If you are unsure whether to include a resource, include it, then see how well the metrics work in practice.

Functional Block Diagram

A functional block diagram for the system helps iterating over resources, such as the one shown below.

Figure 2.13 Example two-processor functional block diagram

The diagram also shows relationships, which can be very useful when looking for bottlenecks in the flow of data.

CPU, memory, and I/O interconnects and busses are often overlooked, but they are not common system bottlenecks, as they are typically designed to provide an excess of throughput. If they are, the problem can be difficult to solve. Possible solutions can be:

For investigating interconnects, see CPU Performance Counters in Section 6.4.1 on hardware.

Metrics

With list of resources, consider the metric types: utilization, saturation, and errors. These metrics can be either averages per interval or counts. The following table show some example resources and metric types:

Resource Type Metric
CPU utilization CPU utilization (either per CPU or a system-wide average)
CPU saturation dispatcher-queue length (aka run-queue length)
Memory utilization available free memory (system-wide)
Memory saturation anonymous paging or thread swapping (page scanning is another indicator), or out-of-memory events
Network interface utilization receive throughput/max bandwidth, transmit throughput/ max bandwidth
Storage device I/O utilization device busy percent
Storage device I/O saturation wait-queue length
Storage device I/O errors device errors ("soft", "hard")

Some metrics are easy and some are difficult to check. [p47]

Some examples of harder combinations are provided in the following table:

Resource Type Metric
CPU errors for example, correctable CPU cache error-correcting code (ECC) events or faulted CPUs (if the OS + HW supports that)
Memory errors for example, failed malloc()s (although this is usually due to virtual memory exhaustion, not physical)
Network saturation saturation-related network interface or OS errors, e.g., Linux "overruns" or Solaris "nocanputs"
Storage controller utilization depends on the controller; it may have a maximum IOPS or throughput that can be checked against current activity
CPU interconnect utilization per-port throughput/maximum bandwidth (CPU performance counters)
Memory interconnect saturation memory stall cycles, high cycles per instruction (CPU performance counters)
I/O interconnect utilization bus throughput/maximum bandwidth (performance counters may exist on your HW, e.g., Intel "uncore" events)

Some of these may not be available from standard operating system tools and may require the use of dynamic tracing or the CPU performance counter facility.

Software Resources

Software resources, usually smaller components of software, can be examined. For example:

If these metrics do not work well in your case, use alternatives, such as latency analysis.

Suggested Interpretations

Suggestions for interpreting the metric types:

[p48]

Cloud Computing

In a cloud computing environment, software resource controls may be in place to limit tenants who are sharing one system.

OS virtualization at Joyent (SmartOS Zones) imposes memory limits, CPU limits, and storage I/O throttling. Each of these resource limits can be examined with the USE method, similarly to examining the physical resources.

For example:

Workload Characterization

Workload characterization is a simple and effective method for identifying load issues, by focusing on the input to the system, rather than the resulting performance. A system without architectural or configuration issues can be under more load than it can reasonably handle.

Who, Why, What and How? *

Workloads can be characterized by answering the following questions:

Checking all of these can be useful. You may be surprised even when you have expectations about the answer. For example, the performance issue with a database, whose clients are a pool of web servers, may be caused by the load originating from the Internet, i.e. being under denial-of-service (DoS) attack.

Eliminating unnecessary work *

Eliminating unnecessary work is important, since sometimes unnecessary work is caused by:

Characterizing the workload can identify these issues, and with maintenance or reconfiguration they may be eliminated.

Throttling workload *

If the identified workload cannot be eliminated, use system resource controls to throttle it. For example, a system backup task may be interfering with a production database by consuming CPU resources to compress the backup, and then network resources to transfer it. This CPU and network usage may be throttled using resource controls, so that the backup occurs more slowly without hurting the database.

Simulation benchmarks *

Apart from identifying issues, workload characterization can also be input for the design of simulation benchmarks. If the workload measurement is an average, ideally you will also collect details of the distribution and variation. This can be important for simulating the variety of workloads expected, rather than testing only an average workload. See Section 2.8, Statistics, for more about averages and variation (standard deviation), and Chapter 12, Benchmarking.

Separating load from architecture *

Analysis of the workload also helps separate problems of load from problems of architecture, by identifying the former. Load versus architecture was introduced in Section 2.3, Concepts.

Tools and metrics *

The specific tools and metrics for performing workload characterization depend on the target. Some applications record detailed logs of client activity, which can be the source for statistical analysis. They may also already provide daily or monthly reports of client usage, which can be mined for details.

Drill-Down Analysis

Drill-down analysis involves the following steps in order:

  1. Examine an issue at a high level.
  2. Narrow the focus based on the previous findings
  3. Discard uninteresting areas.
  4. Dig deeper into interesting areas.

The process can involve digging down through deeper layers of the software stack (to hardware if necessary) to find the root cause of the issue.

A drill-down analysis methodology has three stages:

  1. Monitoring: This is used for continually recording high-level statistics over time, and identifying or alerting if a problem may be present.
    • Traditionally, Simple Network Monitoring Protocol (SNMP) can be used to monitor supporting network-attached devices. The resulting data may reveal long-term patterns that may be missed when using command-line tools over short durations. Many monitoring solutions provide alerts if a problem is suspected, prompting analysis to move to the next stage.
  2. Identification: Given a suspected problem, this narrows the investigation to particular resources or areas of interest, identifying possible bottlenecks. Identification is performed interactively on the server using standard observability tools (vmstat(1), iostat(1), and mpstat(1)) to check system components. [p50]
  3. Analysis: Further examination of particular system areas is done to attempt to root-cause and quantify the issue. Analysis tools include those based on tracing or profiling for deeper inspection of suspect areas, such as strace(1), truss(1), perf, and DTrace.
Five Whys

Ask yourself "why?" then answer the question, and repeat up to five times in total (or more). For example:

  1. A database has begun to perform poorly for many queries. Why?
  2. It is delayed by disk I/O due to memory paging. Why?
  3. Database memory usage has grown too large. Why?
  4. The allocator is consuming more memory than it should. Why?
  5. The allocator has a memory fragmentation issue.

This is a real-world example that very unexpectedly led to a fix in a system memory allocation library. It was the persistent questioning and drilling down to the core issue that led to the fix.

Latency Analysis

Latency analysis examines the time taken to complete an operation, which is then broken down into smaller components. By subdividing the components with the highest latency, the root cause can be identified and quantified.

Similar to drill-down analysis, latency analysis may drill down through layers of the software stack to find the origin of latency issues:

  1. Examine how that workload was processed in the application.
  2. Drill down into the operating system libraries, system calls, the kernel, and device drivers.

For example, analysis of MySQL query latency could involve answering the following questions:

  1. Is there a query latency issue? (yes)
  2. Is the query time largely spent on-CPU or waiting off-CPU? (off-CPU)
  3. What is the off-CPU time spent waiting for? (file system I/O)
  4. Is the file system I/O time due to disk I/O or lock contention? (disk I/O)
  5. Is the disk I/O time likely due to random seeks or data transfer time? (transfer time)

Each step of the process posed a question that divided the latency into two parts, and then proceeded to analyze the larger part, which is a binary search. This process is shown in the figure below:

Figure 2.14 Latency analysis procedure

Method R

Method R is a performance analysis methodology developed for Oracle databases that focuses on finding the origin of latency.

[p52]

Event Tracing

Systems operate by processing discrete events, including

Performance analysis usually studies summaries of these events, such as:

Network troubleshooting often requires packet-by-packet inspection, with tools such as tcpdump(1) (Chapter 10, Network). The following example summarizes packets as single lines of text:

# tcpdump -ni eth4 -ttt
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth4, link-type EN10MB (Ethernet), capture size 65535 bytes
00:00:00.000000 IP 10.2.203.2.22 > 10.2.0.2.33986: Flags [P.], seq
1182098726:1182098918, ack 4234203806, win 132, options [nop,nop,TS val 1751498743
ecr 1751639660], length 192
00:00:00.000392 IP 10.2.0.2.33986 > 10.2.203.2.22: Flags [.], ack 192, win 501,
options [nop,nop,TS val 1751639684 ecr 1751498743], length 0
00:00:00.009561 IP 10.2.203.2.22 > 10.2.0.2.33986: Flags [P.], seq 192:560, ack 1,
win 132, options [nop,nop,TS val 1751498744 ecr 1751639684], length 368
00:00:00.000351 IP 10.2.0.2.33986 > 10.2.203.2.22: Flags [.], ack 560, win 501,
options [nop,nop,TS val 1751639685 ecr 1751498744], length 0
00:00:00.010489 IP 10.2.203.2.22 > 10.2.0.2.33986: Flags [P.], seq 560:896, ack 1,
win 132, options [nop,nop,TS val 1751498745 ecr 1751639685], length 336
00:00:00.000369 IP 10.2.0.2.33986 > 10.2.203.2.22: Flags [.], ack 896, win 501,
options [nop,nop,TS val 1751639686 ecr 1751498745], length 0

Storage device I/O at the block device layer can be traced using iosnoop(1M) (DTrace-based, see Chapter 9, Disks). [p53-54]

The system call layer is another common location for tracing, with tools including:

When performing event tracing, look for the following information:

The following are useful for understanding performance issues:

The study of prior events provides more information. A particularly bad latency event, known as a latency outlier, may be caused by previous events rather than the event itself. For example, the event at the tail of a queue may have high latency but is caused by the previous queued events, not its own properties. This case can be identified from the traced events.

Baseline Statistics

Comparing current performance metrics with past values is often enlightening:

Some observability tools (based on kernel counters) can show the summary-since-boot for comparison with current activity. which is coarse.

Baseline statistics is another approach that involves a wide range of system observability tools and logging the output for future reference. Unlike the summary-since-boot, which can hide variation, the baseline can include per-second statistics so that variation can be seen.

A baseline may be collected before and after system or application changes, so that performance changes can be analyzed. It may also be collected irregularly and included with site documentation, so that administrators have a reference for what is "normal". To perform this task at regular intervals each day is an activity that is served by performance monitoring (Section 2.9, Monitoring).

Static Performance Tuning

Static performance tuning focuses on issues of the configured architecture, contrary to methodologies that focus on the performance of the applied load (dynamic performance). Static performance analysis can be performed when the system is at rest and no load is applied.

For static performance analysis and tuning, step through all the components of the system and check the following:

Some examples of issues that may be found using static performance tuning:

These types of issues are easy to check for, but difficult to remember to do it.

Cache Tuning

Applications and operating systems may employ multiple caches for improving I/O performance, from the application down to the disks (detailed in Section 3.2.11).

The general strategy for tuning each cache level:

  1. Cache as high in the stack as possible (closer to where the work is performed) to reduce the operational overhead of cache hits.
  2. Check that the cache is enabled and working.
  3. Check the cache hit/miss ratios and miss rate.
  4. If the cache size is dynamic, check its current size.
  5. Tune the cache for the workload. This task depends on available cache tunable parameters.
  6. Tune the workload for the cache. For example:
    • Reduce unnecessary consumers of the cache, which frees up more space for the target workload.
    • Look out for double caching, e.g. two different caches that consume main memory and cache the same data twice.

Also consider the overall performance gain of each level of cache tuning. Tuning the CPU Level 1 cache may save nanoseconds, as cache misses may then be served by Level 2. But improving CPU Level 3 cache may avoid much slower DRAM accesses and result in a greater overall performance gain. (CPU caches are described in Chapter 6 CPUs)

Micro-Benchmarking

Modeling

Capacity Planning

Statistics

Monitoring

Visualizations

Doubts and Solutions

Verbatim

p17 on System under Test:

The mere act of mapping the environment may help to reveal previously overlooked sources of perturbations. The environment may also be modeled as a network of queueing systems, for analytical study.

WTF?

p21 on Trade-offs:

File system record size and network buffer size: small vs large

Further reading may be required to understand these trade-offs.

p49 on Workload Characterization:

Apart from identifying issues, workload characterization can also be input for the design of simulation benchmarks.

Does this mean something like load testing? It seems awkward that this entire book never mentions "load testing".