Chapter 7. Monitoring

Introduction

This chapter focuses on software monitoring. Software monitoring comprises myriad types of monitoring and the considerations that come with them. Activities as varied as collecting metrics at various levels (resources/OS/middleware/application-level), graphing and analyzing metrics, logging, generating alerts concerning system health status, and measuring user interactions all are a portion of what is meant by monitoring.

The insights available from monitoring fall into five different categories:

  1. Identifying failures and the associated faults both at runtime and during postmortems held after a failure has occurred.
  2. Identifying performance problems of both individual systems and collections of interacting systems.
  3. Characterizing workload for both short-term and long-term capacity planning and billing purposes.
  4. Measuring user reactions to various types of interfaces or business offerings. A/B testing is disucssed in Chapters 5 and Chapter 6.
  5. Detecting intruders who are attempting to break into the system.

The term monitoring refers to the process of observing and recording system state changes and data flows:

The software supporting such a process is called a monitoring system.

Monitoring a workload include the tools and infrastructure associated with operations activities. All of the activities in an environment contribute to a datacenter’s workload, and this includes both operations-centric and monitoring tools.

DevOps’ continuous delivery/ deployment practices and strong reliance on automation mean that changes to the system happen at a much higher frequency. Use of a microservice architecture also makes monitoring of data flows more challenging.

Some examples of the new challenges are:

Monitoring solutions must be tested and validated just as other portions of the infrastructure. Testing a monitoring solution in your various environments is one portion of the testing, but the scale of your non-production environments may not approach the scale of your production—which implies that your monitoring environments may be only partially tested prior to being placed into production

What to Monitor

The following table lists the insights you might gain from the monitoring data and the portions of the stack where such data can be collected: [p129]

Goal of Monitoring Source of Data
Failure detection Application and infrastructure
Performance degradation detection Application and infrastructure
Capacity planning Application and infrastructure
User reaction to business offerings Application
Intruder detection Application and infrastructure

The fundamental items to be monitored consist of inputs, resources, and outcomes:

Failure Detection

Any element of the physical infrastructure can fail. Total failures are relatively easy to detect: No data is flowing where data used to flow. It is the partial failures that are difficult to detect, for instance: a cable is not firmly seated and degrades performance; before a machine totally fails because of overheating it experiences intermittent failure; and so forth.

Detecting failure of the physical infrastructure is the datacenter provider’s problem. Instrumenting the operating system or its virtual equivalent will provide the data for the datacenter.

Software can also fail, either totally or partially. Total failure is relatively easy to detect. Partial software failures have myriad causes (similar to partial hardware failures):

Detecting software failures can be done in one of three fashions:

  1. The monitoring software performs health checks on the system from an external point.
  2. A special agent inside the system performs the monitoring.
  3. The system itself detects problems and reports them.

Partial failures may also manifest as performance problems (discussed in the following subsection).

Performance Degradation Detection

Detecting performance degradations is the most common use of monitoring data. Degraded performance can be observed by comparing current performance to historical data, or by complaints from clients or end users. Ideally, the monitoring system catches performance degradation before users are impacted at a notable strength.

Performance measures include latency, throughput, and utilization.

Latency

Latency is the time from the initiation of an activity to its completion, which can be measured at various levels of granularity:

Latency can also be measured at either the infrastructure or the application level. Measuring latency across different physical computers is more problematic because of the difficulty of synchronizing clocks.

Latency is cumulative in the sense that the latency of responding to a user request is the sum of the latency of all of the activities that occur until the request is satisfied, adjusted for parallelism. It is useful when diagnosing the cause of a latency problem to know the latency of the various subactivities performed in the satisfaction of the original request. [p131]

Throughput

Throughput is the number of operations of a particular type in a unit time. Although throughput could refer to infrastructure activities (e.g., the number of disk reads per minute), it is more commonly used at the application level. For example, the number of transactions per second is a common reporting measure.

Throughput provides a system-wide measure involving all of the users, whereas latency has a single-user or client focus. High throughput may or may not be related to low latency. The relation will depend on the number of users and their pattern of use.

A reduction in throughput is not, by itself, a problem. The reduction in throughput may be caused by a reduction in the number of users. Problems are indicated through the coupling of throughput and user numbers.

Utilization

Utilization is the relative amount of use of a resource and is typically measured by inserting probes on the resources of interest. For example, the CPU utilization may be 80%. High utilization can be used as either of the following:

The resources can either be at the infrastructure or application level:

Making sense of utilization frequently requires attributing usage to activities or applications. For example, app1 is using 20% of the CPU, disk compression is using 30%, and so on. Thus, connecting the measurements with applications or activities is an important portion of data collection.

Capacity Planning

There two types of capacity planning:

Long-Term Capacity Planning

Long-term capacity planning is intended to match hardware needs (whether real or virtualized) with workload requirements.

In both cases, the input to the capacity planning process is a characterization of the current workload gathered from monitoring data and a projection of the future workload based on business considerations and the current workload. Based on the future workload, the desired throughput and latency for the future workload, and the costs of various provisioning options, the organization will decide on one option and provide the budget for it.

Short-Term Capacity Planning

In the cloud, short-term capacity planning means creating a new virtual machine (VM) for an application or deleting an existing VM.

User Interaction

User satisfaction is an important element of a business. It depends on four elements that can be monitored:

  1. The latency of a user request. Users expect decent response times. Depending on the application, seemingly trivial variations in response can have a large impact.
  2. The reliability of the system with which the user is interacting. Failure and failure detection are discussed earlier.
  3. The effect of a particular business offering or user interface modification. A/B testing is discussed in Chapters 5 and Chapter 6. The measurements collected from A/B testing must be meaningful for the goal of the test, and the data must be associated with variant A or B of the system.
  4. The organization’s particular set of metrics. These metrics should be important indicators either of the following:
    • User satisfaction,
    • The effectiveness of the organization’s computer-based services.

There are generally two types of user interaction monitoring.

  1. Real user monitoring (RUM). RUM essentially records all user interactions with an application.
    • RUM data is used to assess the real service level a user experiences and whether server side changes are being propagated to users correctly.
    • RUM is usually passive in terms of not affecting the application payload without exerting load or changing the server-side application.
  2. Synthetic monitoring. It is similar to developers performing stress testing on an application.
    • Expected user behaviors are scripted either using some emulation system or using actual client software (such as a browser). However, the goal is often not to stress test with heavy loads, but to monitor the user experience.
    • Synthetic monitoring allows you to monitor user experience in a systematic and repeatable fashion, not dependent on how users are using the system right now.
    • Synthetic monitoring may be a portion of the automated user acceptance tests discussed in Chapter 5.

Intrusion Detection

Intruders can break into a system by subverting an application (for example, through incorrect authorization or a man-in-the-middle attack). Applications can monitor users and their activities to determine whether the activities are consistent with the users’ role in the organization or their past behavior.

For instance, if user John has a mobile phone using the application, and the phone is currently in Australia, any log-in attempts from, say, Nigeria should be seen as suspicious.

Intrusion detector *

An intrusion detector is a software application that monitors network traffic by looking for abnormalities. These abnormalities can be caused by:

Intrusion detectors use a variety of different techniques to identify attacks. They frequently use historical data from an organization’s network to understand what is normal. They also use libraries that contain the network traffic patterns observed during various attacks. Current traffic on a network is compared to the expected (from an organization’s history) and the abnormal (from the attack history) to decide whether an attack is currently under way.

Intrusion detectors can also monitor traffic to determine whether an organization’s security policies are being violated without malicious intent.

Intrusion detectors generate alerts and alarms as discussed in Section 7.5. Problems with false positives and false negatives exist with intrusion detectors as they do with all monitoring systems.

How to Monitor

Monitoring systems interact with the elements being monitored, as shown in the figure below.

Figure 7.1 Monitoring system interacting with the elements being monitored [Notation: Architecture]

The system to be monitored can be as broad as a collection of independent applications or services, or as narrow as a single application:

  1. Agentless monitoring. If the system is actively contributing to the data being monitored (the arrow labeled "agentless") then the monitoring is intrusive and affects the system design.
  2. Agent-based monitoring. If the system is not actively contributing to the data being monitored (the arrow labeled "agent-based") then the monitoring is nonintrusive and does not affect the system design.
  3. Health checks. A third source of data is indicted by the arrow labeled "health checks". External systems can also monitor system or application-level states through health checks, performance-related requests, or transaction monitoring

The data collected either through agents or through agentless means is eventually sent to a central repository ("Monitoring data storage" in Figure 7.1). The central repository is typically distributed (logically but not physically central). Each step from the initial collection to the central repository can do filtering and aggregation.

The considerations in determining the amount of filtering and aggregation are:

Retrieving the data from local nodes is important because the local node may fail and the data become unavailable. Sending all of the data directly to a central repository may introduce congestion to the network. Thus, selecting the intermediate steps from the local nodes to the central repository and the filtering and aggregation done at each step are important architectural decisions when setting up a monitoring framework.

Once monitoring data is collected, you can do many things:

The traditional view of the monitoring system (as discussed so far) is increasingly being challenged by new interactions between the monitoring system and other systems, which are shown outside of the dotted areas in Figure 7.1.

You can perform stream processing and (big) data analytics on monitoring data streams and historical data. Not only can you gain insights into system characteristics using system-level monitoring data, you may also gain insights into user behaviors and intentions using application- and user-level monitoring data.

Because of these growing different uses of monitoring data, many companies are starting to use a unified log and metrics-centric publish-subscribe architecture for both the monitoring system and the overall application system. More and more types of data, including nontraditional log and metrics data, are being put into a unified storage, where various other systems (whether monitoring-related or not) can subscribe to the data of interest. Several implications of the unified view are:

The line between the monitoring system and the system to be monitored is getting blurred when application and user monitoring data are treated the same as system-level monitoring data: data from anywhere and at any level could contribute to insights about both systems and users.

The following sections discuss the method of retrieving monitoring data, monitoring operations, and data collection and storage:

Agent-Based and Agentless Monitoring

In some situations, the system to be monitored already has internal monitoring facilities that can be accessed through a defined protocol. For example:

Agentless monitoring is particularly useful when you cannot install agents, and it can simplify the deployment of your monitoring system.

The agent-based and agentless approaches both have their strengths and weaknesses:

Questions to be considered when designing a system include:

Monitoring Operation Activities

Some operations tools (such as Chef) monitor resources such as configuration settings to determine whether they conform to prespecified settings. We also mentioned monitoring resource specification files to identify changes. Both of these types of monitoring are best done by agents that periodically sample the actual values and the files that specify those values.

Treating infrastructure-as-code implies that infrastructure should contribute monitoring information in the same fashion as other applications, which can be through any of the means discussed: agents, agentless, or external.

Chapter 14 discusses how to perform fine-grained monitoring of the behavior of operations tools and scripts. This can include assertions over monitoring data. For instance, during a rolling upgrade a number of VMs are taken out of service to be replaced with VMs running a newer version of the application. Then you can expect the average CPU utilization of the remaining machines to increase by a certain factor.

Collection and Storage

The core of monitoring is recoding and analyzing time series data (a sequence of time-stamped data points):

The monitoring system can conduct direct measurement or collect existing data, statistics, or logs and then turn them into metrics (with time and space). The data is then transferred to a repository. The incoming data streams need to be processed into a time series and stored in a time series database.

Three key challenges are: [p138]

The Round-Robin Database (RRD) is a popular time series database, which is designed for storing and displaying time series data with good retention policy configuration capabilities. Big data storage and processing solutions are increasingly used for monitoring data. You can treat your monitoring data as data streams feeding into streaming systems for real-time processing, combined with (big) historical data. You can load all your data into big data storage systems such as Hadoop Distributed File System (HDFS) or archive it in relatively inexpensive online storage systems such as Amazon Glacier.

When to Change the Monitoring Configuration

Monitoring is either time-based or event-based. Timing frequency and generation of events should all be configurable and changed in response to events occurring in the datacenter.

Some examples of events that could change the monitoring configuration are:

Interpreting Monitoring Data

Assume that the monitoring data (both time-based and event-based) has been collected in a central repository. This data is being added and examined continually, by both other systems and humans.

Logs

A log is a time series of events. Records are typically appended to the end of the log. Logs usually record the actions performed that may result in a state change of the system.

[p140]

Logs are used:

Some general rules about writing logs are:

Graphing and Display

Once you have all relevant data, it is useful to visualize it:

You can set up a dashboard showing important real-time aspects of your system and its components at an aggregated level. You can also dive into the details interactively or navigate through history when you detect an issue. An experienced operator will use visual patterns of graphs to discern problems.

[p141]

Alarms and Alerts

Monitoring systems inform the operator of significant events. This information can be in the form of either an alarm or an alert:

Alarms and alerts can be triggered by any of the following:

[p141]

The typical issues are:

A problem for operators is receiving false positive alarms or a flood of alerts from different channels about the same event. Under such conditions, operators will quickly get "alert fatigue" and start ignoring alerts or simply turn some of them off. On the other hand, if you try to reduce false positives, you may risk missing important events, which increases false negatives.

If your alarms are very specific in their triggering conditions, you may be informed about some subtle errors early in their occurrence. However, you may risk rendering your alarms less effective when the system undergoes changes over time, or when the system momentarily exhibits interference of legitimate but previously unknown operations. [p142]

Some general rules to improve the usefulness of alerts and alarms are:

Diagnosis and Reaction

Operators often use monitoring systems to diagnose the causes and observe the progress of mitigation and recovery. However, monitoring systems are not designed for interactive or automated diagnosis. Thus, operators, in ad hoc ways, will try to correlate events, dive into details and execute queries, and examine logs. Concurrently, they manually trigger more diagnostic tests and recovery actions (such as restarting processes or isolating problematic components) and observe their effects from the monitoring system.

The essence of the skill of a reliability engineer is the ability to diagnose a problem in the presence of uncertainty. Once the problem has been diagnosed, frequently the reaction is clear although, at times, possible reactions have different business consequences. [p142-143]

Monitoring DevOps Processes

DevOps processes should be monitored so that they can be improved and problems can be detected.

Five things that are important to monitor:

  1. A business metric
  2. Cycle time
  3. Mean time to detect errors
  4. Mean time to report errors
  5. Amount of scrap (rework)

Challenges

Challenge 1: Monitoring Under Continuous Changes

Deviation from normal behavior *

In operation, a deviation from normal behavior is a problem. Normal behavior assumes the system is relatively stable over time. However, in a large-scale complex environment, changes are the norm. Besides varying workloads or dynamic aspects of your application, which are often well anticipated, the new challenges come from both of:

Deploying a new version into production multiple times a day is becoming a common practice:

How to use the past monitoring data to do performance management, capacity planning, anomaly detection, and error diagnosis for the new system? *

In practice, operators may turn off monitoring during scheduled maintenance and upgrades as a work-around to reduce false positive alerts triggered by those changes. However, this can lead to no monitoring (e.g. flying blind).

The following techniques can solve this:

  1. Carefully identify the non-changing portions of the data.
    • For example, use dimensionless data (i.e., ratios). You may find that although individual variables change frequently, the ratio of two variables is relatively constant.
  2. Focus monitoring on things that have changed.
  3. Compare performance of the canaries with historical performance. (As discussed in Chapter 6, canary testing is a way of monitoring a small rollout of a new system for issues in production.) Changes that cannot be rationalized because of feature changes may indicate problems.
Specification of monitoring parameters *

The specification of monitoring parameters is another challenge related to monitoring under continuous changes [p144].

The complexity of setting up and maintaining a monitoring system consists of:

Continuous changes in the system infrastructure and the system itself complicate the setting of monitoring parameters. Your monitoring may need to be adjusted for variance on the infrastructure side. [p144]

As a consequence, it makes sense to automate the configuration of alarms, alerts, and thresholds as much as possible. The monitoring configuration process is just another DevOps process to be automated:

Challenge 2: Bottom-Up vs. Top-Down and Monitoring in the Cloud

Challenge 3: Monitoring a Microservice Architecture

Challenge 4: Dealing with Large Volumes of Distributed (Log) Data

Tools