Chapter 6. Deployment

Introduction

Deployment is the process of placing a version of a service into production. The initial deployment of a service can be viewed as going from no version of the service to the initial version of the service. Because an initial deployment happens only once for most systems and new versions happen frequently, this chapter discuss upgrading a service.

The overall goal of a deployment is to place an upgraded version of the service into production with minimal impact to the users of the system, whether it is through failures or downtime.

There are three reasons for changing a service:

The initial discussion assumes that deployment is an all-or-nothing process: at the end of the deployment either all of the virtual machines (VMs) running a service have had the upgraded version deployed or none of them have. Later this chapter, partial deployments are discussed.

Figure 6.1 Microservice 3 is being upgraded. (Adapted from Figure 4.1.) [Notation: Architecture]

In the figure above, Microservice 3 is being upgraded (shown in dark gray). Microservice 3 depends on microservices 4 and 5, and microservices 1 and 2 (clients of microservice 3) depend on it. For now, assume that any VM runs exactly one service. Other options are discussed later in this chapter.

The goal of a deployment is to move from the current state that has N VMs of the old version, A, of a service executing, to a new state where there are N VMs of the new version, B, of the same service in execution.

Strategies for Managing a Deployment

There are two popular strategies for managing a deployment: blue/green deployment and rolling upgrade. They differ in terms of costs and complexity.

Before discussing these strategies in more detail, we need to make the following two assumptions:

  1. Service to the clients should be maintained while the new version is being deployed.
    • Maintaining service to the clients with no downtime is essential for many Internet e-commerce businesses.
    • Organizations that have customers primarily localized in one geographic area can afford scheduled downtime, but scheduled off-hours during downtime requires system administrators and op operators to work in the off-hours.
  2. Any development team should be able to deploy a new version of their service at any time without coordinating with other teams.
    • This may have an impact on client services developed by other teams, but removes one cause for synchronous coordination.

The placement of a new VM with a version into production takes time. In order to place an upgraded VM of a service into production, the new version must be loaded onto a VM and be initialized and integrated into the environment, sometimes with dependency on placements of some other services first. This can take on the order of minutes. Consequently, depending on how parallel some actions can be and their impact on the system still serving clients, the upgrade of hundreds or thousands of VMs can take hours or, in extreme cases, even days.

Blue/Green Deployment

A blue/green deployment (sometimes called big flip or red/black deployment) consists of maintaining the N VMs containing version A in service while provisioning N VMs of virtual machines containing version B.

Once N VMs have been provisioned with version B and are ready to service requests, then client requests can be routed to version B. This is a matter of instructing the domain name server (DNS) or load balancer to change the routing of messages. This routing switch can be done in a single stroke for all requests. After a supervisory period, the N VMs provisioned with version A are removed from the system. If anything goes wrong during the supervisory period, the routing is switched back, so that the requests go to the VMs running version A again.

This strategy is conceptually simple, but has some disadvantage:

A variation of this model is to do the traffic switching gradually. A small percentage of requests are first routed to version B, effectively conducting a canary test. Canary testing is mentioned in Chapter 5 and discuss it in more detail in the section Canary Testing. If everything goes well for a while, more version B VMs can be provisioned and more requests can be routed to this pool of VMs, until all requests are routed to version B. This increases confidence in your deployment, but also introduces a number of consistency issues (discussed in Section 6.3).

Rolling Upgrade

A rolling upgrade consists of deploying a small number of version B VMs at a time directly to the current production environment, while switching off the same number of VMs running version A. For example, we deploy one version B VM at a time. Once an additional version B VM has been deployed and is receiving requests, one version A VM is removed from the system. Repeating this process N times results in a complete deployment of version B.

This strategy is inexpensive but more complicated. It may cost a small number of additional VMs for the duration of the deployment, but again introduces a number of issues of consistency and more risks in disturbing the current production environment.

The following figure provides a representation of a rolling upgrade within the Amazon cloud:

Figure 6.2 Representation of a rolling upgrade [Notation: BPMN]

  1. Each VM (containing one service) is decommissioned (removed, deregistered from the elastic load balancer (ELB), and terminated)
  2. Then, a new VM is started and registered with the ELB.
  3. This process continues until all of the VMs containing version A have been replaced with VMs containing version B.

The additional cost of a rolling upgrade can be low if you conduct your rolling upgrade when your VMs are not fully utilized, and your killing of one or a small number of VMs at a time still maintains your expected service level. It may cost a bit if you add a small number of VMs before you start the rolling upgrade to mitigate the performance impact and risk of your rolling upgrade.

During a rolling upgrade, one subset of the VMs is providing service with version A, and the remainder of the VMs are providing service with version B. This creates the possibility of failures as a result of mixed versions. This type of failure is discussed in the next section.

Logical Consistency

There are some types of logical consistency:

Multiple Versions of the Same Service Simultaneously Active

The following figure shows an instance of an inconsistency because of two active versions of the same service. Two components are shown: the client and two versions (versions A and B) of a service.

  1. The client sends a message that is routed to version B.
  2. Version B performs its actions and returns some state to the client.
  3. The client then includes that state in its next request to the service.
  4. The second request is routed to version A, and this version does not know what to make of the state, because the state assumes version B. Therefore, an error occurs.

This problem is called a mixed-version race condition.

Figure 6.3 Mixed-version race condition, leading to an error [Notation: UML Sequence Diagram]

Several techniques can prevent this situation:

These options are not mutually exclusive (some of these options can be used together). For example, you can use feature toggles within a backward compatible setting. Within a rolling upgrade you will have installed some VMs of the new version while still not having activated the new features. This requires the new version to be backward compatible.

Feature Toggling

[p107]

To coordinate the activation of the feature in two directions:

  1. All of the VMs for the service you just deployed must have the service’s portion of the feature activated.
  2. All of the services involved in implementing the feature must have their portion of the feature activated.

Feature toggles (described in Chapter 5) can be used to control whether a feature is activated. A feature toggle is a piece of code within an if statement where the if condition is based on an externally settable feature variable. Using this technique means that the problems associated with activating a feature are:

Both of these problems are examples of synchronizing across the elements of a distributed system. The primary modern methods for performing such synchronization are based on the Paxos or ZAB algorithms, which are difficult to implement correctly. However, standard implementations are available in systems such as ZooKeeper.

Example of feature toggling with ZooKeeper

Assume the service being deployed implements a portion of a single feature, Feature X. When a VM of the service is deployed, it registers itself as being interested in FeatureXActivationFlag. If the flag is false, then the feature is toggled off; if the flag is true, the feature is toggled on. If the state of the FeatureXActivationFlag changes, then the VM is informed of this and reacts accordingly.

An agent (which can human or automated) external to any of the services in the system being upgraded is responsible for setting FeatureXActivationFlag. The flag is maintained in ZooKeeper and thus kept consistent across the VMs involved. As long as all of the VMs are informed simultaneously of the toggling, then the feature is activated simultaneously and there is no version inconsistency that could lead to failures. The simultaneous information broadcast is performed by ZooKeeper. This particular use of ZooKeeper for feature toggling is often implemented in other tools. For example, Netflix’s Archaius tool provides configuration management for distributed systems. The configuration being managed can be feature toggles or any other property.

The agent is aware of the various services implementing Feature X and does not activate the feature until all of these services have been upgraded.

One complication comes from deciding when the VMs have been "sufficiently upgraded". VMs may fail or become unavailable. Waiting for these VMs to be upgraded before activating the feature is not desirable. The use of a registry/ load balancer as described in Chapter 4 enables the activation agent to avoid these problems. Recall that each VM must renew its registration periodically to indicate that it is still active. The activation agent examines the relevant VMs that are registered to determine when all VMs of the relevant services have been upgraded to the appropriate versions.

Backward and Forward Compatibility

Maintaining backward compatibility can be done using the pattern depicted in the figure below:

Figure 6.4 Maintaining backward compatibility for service interfaces [Notation: Architecture]

The service being upgraded makes a distinction between internal and external interfaces:

As far as a client is concerned, the old interfaces are still available for the new version. If a client wishes to use a new feature, then a new interface is available for that feature.

One consequence of using this pattern is that obsolete interfaces may be maintained beyond the point where any clients use them. Determining which clients use which interfaces can be done through monitoring and recording all service invocations. Once there are no usages for a sufficiently long time, the interface can be deprecated. The deprecating of an interface may result in additional maintenance work, so it should not be done lightly.

Forward and backward compatibility allows for independent upgrade for services under your control. Not all services will be under your control. In particular, third-party services, libraries, or legacy services may not be backward compatible. In this case, there are several techniques you can use, although none of them are foolproof:

Figure 6.5 Portability layer with two versions of the external system coexisting [Notation: Architecture]

Compatibility with Data Kept in a Database

Besides the compatibility of services, some services must also be able to read and write to a database in consistently. For example, that the data schema changes: In the old version of the schema, there is one field for customer address; in the new version, the address is broken into street, city, postal code, and country. Inconsistency, in this case, means that a service intends to write the address as a single field using the schema that has the address broken into portions.

Inconsistencies are triggered by a change in the database schema. A schema can be either explicit such as in relational database management systems (RDBMSs) or implicit such as in various NoSQL database management systems.

The most basic solution to such a schema change is not to modify existing fields but only to add new fields or tables, which can be done without affecting existing code. The use of the new fields or tables can be integrated into the application incrementally. One method for accomplishing this is to treat new fields or tables as new features in a release. That is, either the use of the new field or table is under the control of a feature toggle or the services are forward and backward compatible with respect to database fields and tables.

If a change to the schema is absolutely required you have two options:

  1. Convert the persistent data from the old schema to the new one.
  2. Convert data into the appropriate form during reads and writes. This could be done either by the service or by the database management system.

These options are not mutually exclusive. You might perform the conversion in the background and convert data on the fly while the conversion is ongoing. Modern RDBMSs provide the ability to reorganize data from one schema to another online while satisfying requests, although at a storage and performance cost. Database systems typically do not provide this capability, and so, if you use them, you have to engineer a solution for your particular situation. [p111]

Packaging

This section discusses consistency of the build process in terms of getting the latest versions into the services. Deciding that components package services and that each service is packaged as exactly one component (discussed in Chapter 4), does not end your packaging decisions. You must decide on the binding time among components residing on the same VM and a strategy for placing services into VMs. Packaging components onto a VM image is called baking and the options range from lightly baked to heavily baked (discussed in Chapter 5). What we add to that discussion here is the number of processes loaded into each VM.

A VM is an image that is running on top of a hypervisor that enables sharing a single bare metal processor, memory, and network among multiple tenants or VMs. The image of the VM is loaded onto the hypervisor from which it is scheduled.

A VM image could include multiple independent processes, each a service. The question is: Should multiple services be placed in a single VM image? The following figure shows two options:

Figure 6.6 Different options for packaging services [Notation: Architecture]

One minor difference in these two options is the number of times that a VM image must be baked:

A more important difference occurs when service 1 sends a message to service 2:

This means the latency for messages will be higher when each service is packaged into a single VM.

However, packaging multiple services into the same VM image opens up the possibility of deployment race conditions, because different development teams do not coordinate over their deployment schedules and they may be deploying their upgrades at (roughly) the same time.

The examples below assume the upgraded services are included in the deployed portion of the VM (heavily baked) and not loaded later by the deployed software.

Figure 6.7 One type of race condition when two development teams deploy independently [Notation: UML Sequence Diagram]

Figure 6.8 A different type of race condition when two development teams deploy independently [Notation: UML Sequence Diagram]

The tradeoff for including multiple services into the same VM is between reduced latency and the possibility of deployment race conditions.

Deploying to Multiple Environments

As long as services are independent and communicate only through messages, deployment to multiple environments (e.g. VMware and Amazon EC2) is possible basically with the design we have presented. The registry/load balancer discussed in Chapter 4 needs to be able to direct messages to different environments.

Business Continuity

Introduced in Chapter 2, business continuity is the ability for a business to maintain service when facing a disaster or serious outages. It is achieved by deploying to sites that are physically and logically separated from each other. This section differentiates between deploying to a public cloud and a private cloud, although the essential element, the management of state, is the same. Disaster recovery is discussed in Chapter 10 and case study in Chapter 11.

Public Cloud

Public clouds are extremely reliable in the aggregate. They consist of hundreds of thousands of physical servers and provide extensive replication and failover services. Failures do occur, which can be to particular VMs of your system or to other cloud services.

Amazon EC2 has multiple regions (nine as of this writing) scattered around the globe. Each region has multiple availability zones. Each availability zone is housed in a location that is physically distinct from other availability zones and that has its own power supply, physical security, and so forth:

Two considerations to keep in mind when you deploy to different availability zones or regions are state management and latency:

  1. State management. Making services stateless has several advantages, as discussed in Chapter 4.
    • If a service is stateless then additional VMs can be created at any time to handle increased workload. Additional VMs can also be created in the event of a VM failure.
    • The disadvantages of stateless services are that state must be maintained somewhere in the system and latency may increase when the service needs to obtain or change this state.
  2. Latency. Sending messages from one availability zone to another adds a bit of latency; messages sent from one region to another adds more latency to your system.
Private Cloud

[p116]

Partial Deployment

Up to this point the discussion has been focused on all-or-nothing deployments. Now we discuss two types of partial deployments: canary testing and A/B testing.

Canary Testing

A new version is deployed into production after having been tested in a staging environment, which is as close to a production environment as possible. There is still a possibility of errors existing in the new version. These errors can be either functional or have a quality impact. Performing an additional step of testing in a real production environment is the purpose of canary testing. A canary test is conceptually similar to a beta test in the shrink-wrapped software world.

One question is to whom to expose the canary servers. This can be a random sample of users. An alternative is to decide the question based on the organization a user belongs to, for example, the employees of the developing organization, or particular customers. The question could also be answered based on geography, for example, such that all requests that are routed to a particular datacenter are served by canary versions.

The mechanism for performing the canary tests depends on whether features are activated with feature toggles or whether services are assumed to be forward or backward compatible. In either case, a new feature cannot be fully tested in production until all of the services involved in delivering the feature have been partially deployed.

Messages can be routed to the canaries by making the registry/load balancer canary-aware and having it route messages from the designated testers to the canary versions. More and more messages can be routed until a desired level of performance has been exhibited.

In either case, you should carefully monitor the canaries, and they should be rolled back in the event an error is detected.

A/B Testing

A/B testing is introduced in Chapter 5. It is another form of testing that occurs in the production environment through partial deployment. The "A" and "B" refer to two different versions of a service that present either different user interfaces or different behavior. In this case, it is the behavior of the user when presented with these two different versions that is being tested.

If either A or B shows preferable behavior in terms of some business metric such as orders placed, then that version becomes the production version and the other version is retired.

Implementing A/B testing is similar to implementing canaries. The registry/ load balancer must be made aware of A/B testing and ensure that a single customer is served by VMs with either the A behavior or the B behavior but not both. The choice of users that are presented with version B (or A) may be randomized, or it may be deliberate. If deliberate, factors such as geographic location, age group (for registered users), or customer level (e.g., "gold" frequent flyers), may be taken into account.

Rollback

The new version of a service is on probation for some period after deployment. It has gone through testing of a variety of forms but it still is not fully trusted.

Rolling back means reverting to a prior release. It is also possible to roll forward, to correct the error and generate a new release with the error fixed. Rolling forward is essentially just an instance of upgrading.

Because of the sensitivity of a rollback and the possibility of rolling forward, rollbacks are rarely triggered automatically. A human should be in the loop who decides whether the error is serious enough to justify discontinuing the current deployment. The human then must decide whether to roll back or roll forward.

Rollback for blue/green deployment

If you still have VMs with version A available, as in the blue/green deployment model before decommissioning all version A VMs, rolling back can be done by simply redirecting the traffic back to these. One way of dealing with the persistent state problem is to keep version A VMs receiving a replicated copy of the requests version B has been receiving during the probation period.

Rollback for rolling upgrade deployment

With a rolling upgrade model or you cannot simply replace version B by version A as a whole, you have to replace a version B VM with a version A VM in more complicated ways. The new version B can be in one of four states during its lifetime:

The strategy for rolling back (version B is partially installed or fully installed but on probation) depends on whether feature toggles are being used and have been activated. This pertains to both of the remaining two states:

The remaining case deals with persistent data and is the most complicated. Suppose all of the version B VMs have been installed and version B’s features activated, but a rollback is necessary. Rolling back to the state where version B is installed but no features activated is a matter of toggling off the new features, which is a simple action. The complications come from consideration of persistent data.

A concern when an error is detected is that incorrect values have been written into the database. Dealing with erroneous database values is a delicate operation with significant business implications.

[p119-120]

Identifying and correcting incorrect values in the database is a delicate and complicated operation requiring the collection of much metadata.

Tools

One method for categorizing tools is to determine whether they directly affect the internals of the entity being deployed. As in Chapter 5, if a VM image contains all the required software including the new version, you can replace a whole VM of the old version with a whole VM of the new version. This is called using a heavily baked deployment approach.

Alternatively, you can use tools to change the internals of a VM, so as to deploy the new version by replacing the old version without terminating the VM. Even if you terminate the VM with the old version, you can start a new lightly baked VM but then access the machine from the inside to deploy the new version at a later stage of the deployment process.

Summary

Strategies for deploying multiple VMs of a service include blue/green deployment and rolling upgrade:

Solutions to the problems of logical consistency involve using some combination of feature toggles, forward and backward compatibility, and version awareness.

Deployments must occasionally be rolled back. Feature toggles support rolling back features, but the treatment of persistent data is especially sensitive when rolling back a deployment.

Deployment also plays an important role for achieving business continuity. Deploying into distinct sites provides one measure of continuity. Having an architecture that includes replication allows for a shorter time to repair and to resume processing in the event of an unexpected outage.

A variety of tools exist for managing deployment. The emergence of lightweight containers and image management tools is helping developers to deploy into small-scale production-like environments more easily for testing.