As you may know, a quote that shaped the way I think about architecture is from Werner Vogels, CTO at Amazon.com. He said:
“Failures are a given, and everything will eventually fail over time.”
Having worked on large-scale systems for more than a decade, if I could summarize in a single animation what I think about managing systems at scale and failure, it would be something like this. (this is a real video and the base jumper survived that failure)

But why? Good question Vincent!
The art of managing systems at scale lies in embracing failure and being at the edge — pushing the limits of your system and software performance ‘almost’ to breaking point, yet still being able to recover. From the outside it looks both impressive and ridiculous. That’s what resiliency is all about.
What is resiliency, and why you shouldn’t be afraid of failing.
Resilient systems embrace the idea that failures are normal, and that it’s perfectly OK to run systems in what we call partially failing mode.
When we are dealing with smaller systems of up to few tens of instances, 100% operational excellence is often the normal state and failure is an exceptional condition. However, when dealing with large-scale systems, probabilities are such that 100% operational excellence is near impossible to achieve. Therefore, the normal state of operation is partial failure.
While not suitable for life-critical applications, running in partially failing mode is a viable option for most web applications, from e-commerce services like Amazon.com to video-on-demand sites such as Netflix, to music services like Soundcloud (shameless plug to my playlists).
Of course, I’m not saying it doesn’t matter if your system fails. It does, and it might result in lost revenue, but it’s probably not life-critical.
This is the balance you should aim for: the cost of being highly resilient to failure versus the potential loss of money due to outages.
For me, building resilient architectures has been an amazing engineering journey, with ups-and-downs, some 1am wake-up calls, some Christmases spent debugging, some “I’m done, I quit” … but most of all, it’s been a learning experience that brought me to where I am today — and for that, I am forever grateful to Pagerduty 🙂
This blog post is a collection of tips and tricks that have served me well throughout my journey, and I hope they will serve you well too.
It’s not all about software.
One important thing to realize early on, is that building resilient architecture isn’t all about software. It starts at the infrastructure layer, progresses to the network and data, influences application design and extends to people and culture.
The following is a list of patterns that may help you build resilient architectures, so feel free to comment and share your own experiences below.
Redundancy
One of the first things — and also one of the most important things — to do when you deploy an application in the cloud is to architect your application to be redundant. Simply put, redundancy is the duplication of components of a system in order to increase the overall availability of that system.
In the AWS cloud, that means deploying it across multiple availability zones (multi-AZ) or, in some cases, across multi-regions.

While I discussed availability in one of my previous blog posts, it is worth remembering the mathematics behind this pattern.

As you can see in the previous picture, if you take component X with 99% availability (which is pretty bad by today’s standards), but put it in parallel, you increase your overall system availability to 99.99%!!! That’s right — simply by putting it in parallel. Now, if you put three component X’s in parallel, you get to the famous 6-nines! That is just ridiculous in terms of availability, and why AWS always advises customers to deploy their applications across multi-AZ — preferably three of them (to avoid overload).
This is also why using AWS regional services like S3, DynamoDB, SQS, Kinesis, Lambda or ELB just to name a few, is a good idea — they are, by default, using multiple AZs under the hood. And this is also why using RDS configured in multi-AZ deployment is neat!
To be able to design your architecture across multi-AZ, you must have a stateless applicationand potentially use an elastic load balancer to distribute requests to the application.
It is important to note that not all requests have to be stateless, since in some cases we want to maintain ‘stickiness’ — but a vast majority of them can and should be, otherwise you will end up with balancing issues.
If you are interested in building multi-region redundancy, I have posted a series of posts on the topic here.
Auto scaling
Once you have the multi-AZ capability sorted out, you need to enable Auto Scaling. Auto scaling means that your applications can automatically adjust capacity to demand.

Scaling your resources dynamically according to demand — as opposed to doing it manually — ensures your service can meet a variety of traffic patterns without anyone needing to plan for it.
Instances, containers, and functions all provide mechanisms for auto scaling either as a feature of the service itself, or as a support mechanism.

Until recently, auto scaling in AWS was linked to AWS services — but now, Application Auto Scaling can be used to add auto scaling capabilities to any services you build on AWS. That’s a big deal!!!
This feature allows the same auto scaling engine that powers services like EC2, DynamoDB, and EMR to power auto scaling in a service outside of AWS. Netflix is already leveraging this capability to scale its Titus container management platform.
Considerations for auto scaling:
The speed at which you can tolerate auto scaling will define what technology you should use. Let me explain:
I started using auto scaling with EC2 like most early adopters (back then, there wasn’t a choice). One of the first mistakes I made was to launch a new instance and have scripts (bash, puppet, or chef) do the software configuration whenever a new instance launched. Needless to say, it was very slow to scale and extremely error prone.
Creating custom AMIs that are pre-configured (known as Golden AMIs) enabled me to shift the heavy configuration setup from the instance launch to an earlier, ‘baking’ time — built within the CI/CD pipeline.
It is often difficult to find the right balance between what is baked into an AMI and what is done at scaling time. In some cases, particularly if you’re smart with service discovery, you don’t have to run or configure anything at startup time. In my opinion, this should be the goal because the less you have to do at startup time, the faster your scaling will be. In addition to being faster at scaling, the more scripts and configurations you run at startup time, the greater the chance that something will go wrong.
Finally, what became the holy grail of the Golden AMI was to get rid of the configuration scripts and replace them with Dockerfiles instead. That way, we didn’t have to maintain chef or puppet recipes — which at scale were impossible to manage — but more importantly, we could test the applications from the laptop all the way to production systems — using the same container — and do so quickly.
Today, scaling with container platforms like ECS or Lambda functions is even faster, and should be reflected in any architectural design. Large scale systems often have a combination of all these technologies.
Infrastructure as Code
One benefit of using infrastructure as code is repeatability. Let’s examine the task of configuring a datacenter, from configuring the networking and security to the deployment of applications. Consider for a moment the amount of work that would be required if you had to do that manually, for multiple environments in each of these regions. First it would be a tedious task, but it would most likely introduce configuration differences and drifts over time. Humans aren’t great at undertaking repetitive, manual tasks with 100% accuracy, but machines are. Give the same template to a computer and it will execute that template 10,000 times exactly the same way.
Now, imagine your environment being compromised, suffering an outage, or even being deleted by mistake (yes, I saw that happen once). You have data backup. You have the infrastructure templates. All you have to do is re-run the template in a new AWS account, a new VPC or even a new region, restore the backups, and voilà, you’re up-and-running in minutes. Doing that manually, and at scale, will result in a nightmare scenario, with a lot of sweat and tears — and worse: unhappy customers.
Another benefit of infrastructure as code is knowledge sharing. Indeed, if you version control your infrastructure, you can treat code the same way you treat application code. You can have teams committing code to it, and ask for improvements or changes in configuration. If that process goes through a pull-request, then the rest of the team can verify, challenge, and comment on that request — promoting better practices.
Another advantage is history preservation of the infrastructure evolution — and being able to answer “Why did we change that?” two months later.
Immutable Infrastructure
The principle of immutable infrastructure is fairly simple: Immutable components are replaced for every deployment, rather than being updated in place.
- No updates should ever be performed on live systems
- You always start from a new instance of the resource being provisioned: EC2 instance, Container, or a Lambda Function.
This deployment strategy supports the principle of Golden AMI and is based on theImmutable Server pattern which I love since it reduces configuration drift and ensures deployments are repeatable anywhere from source.
A typical immutable infrastructure update goes as follows:
To support application deployment in an immutable infrastructure, you should preferably use canary deployment, which is a technique used to reduce the risk of failure when new versions of applications enter production, by gradually rolling out the change to a small subset of users and then slowly rolling it out to the entire infrastructure and making it available to everybody.
According to Kat Eschner, the origin of the name canary deployment comes from an old British mining tradition where miners used canaries to detect carbon monoxide and toxic gases in coal mines. To make sure mines were safe to enter, miners would send canaries in first, and if the canary died or got ill, the mine was evacuated.

The benefit of canary deployment is, of course, the near immediate rollback it gives you — but more importantly, you get fast and safer deployments with real production test data!
There are few considerations to keep in mind during canary deployment:
- Being stateless: You should preferably not be dependent on any state between the different old and new versions deployed. If you need stickiness, keep it to the bare minimum.
- Since fast rollback is a feature, your application should be able to handle it gracefully.
- Managing canary deployment with database schema changes is hard — really hard. I often get asked how to do it. Unfortunately there’s no silver bullet. It depends on the changes, the application, and the architecture around the database.
Canary deployment using Route53 isn’t the only option. If you use auto scaling groups behind a load-balancer you can also use a red/black deployment. The approach of reusing the load-balancer helps avoid any DNS record set changes in Route53. It goes as follows:
For your serverless applications, you also have the option of using Amazon API Gateway since it now supports canary release deployments or you can implement canary deployments of AWS Lambda functions with alias traffic shifting. Following is an example that points an alias to two different Lambda function versions by configuring an alias to shift traffic between two function versions based on weights.
To experiment with canary deployment using AWS Lambda, here is a GitHub repository that will get you started.
Stateless application
As mentioned earlier, stateless applications are a prerequisite to auto scaling and immutable infrastructure since any request can be handled by available computing resources, regardless if that’s EC2, a container platform, or a set of lambda functions.

In a stateless service, the application must treat all client requests independently of prior requests or sessions, and should never store any information on local disks or memory.
Sharing state with any resources within the auto scaling group should be conducted using in-memory object caching systems such as Memcached, Redis, EVCache, or distributed databases like Cassandra or DynamoDB, depending on the structure of your object and requirements in terms of performances.
Putting it all together!
In order to have a resilient, reliable, auditable, repeatable, and well tested deployment and scaling process with a fast spin-up time, you’ll need to combine the idea of Golden AMI, infrastructure as code, immutable infrastructure and stateless application. And, to do that at scale, you’ll have to use automation. In fact, it’s your only option. So, my last advice for that post is: Automate! Automate! Automate!
In this post, I will focus on cascading failures, or what I like to call The Punisher— that super hero living in the darkness of your architecture taking revenge on those responsible for not thinking enough about the small details. I have been a victim several times, and The Punisher is bad, real bad!
Avoiding cascading failures
One of the most common triggers for outages is cascading failure, where one part of a system experiences a local failure and takes down the entire system through inter-connections and failure propagation. Often, that propagation follows the butterfly effect, where a seemingly small failure ripples out to produce a much larger one.
A classic example of a cascading failure is overload. This occurs when traffic load distributed between two clusters brutally changes due to one of the clusters having a failure. The sudden peak of traffic overloads the remaining cluster, which in turn, fails from resource exhaustion, taking the entire service down.
To avoid such scenarios, here are a few tricks that have served me well:
Backoff algorithms
Typical components in a distribute software system include servers, load-balancers, databases, and DNS servers. In operation, and subject to the failures discussed in part 1, any of these can start generating errors.
The default technique for dealing with errors is to implement retries on the requester side. This technique increases the reliability of the application and reduces operational costs for the developer.
However, at scale — and if requesters attempt to retry the failed operation as soon as an error occurs — the network can quickly become saturated with new and retired requests, each competing for network bandwidth. This pattern will continue until a full system meltdown occurs.
To avoid such scenarios, backoff algorithms such as the common exponential backoff must be used. Exponential backoff algorithms gradually increase the rate at which retries are performed, thus avoiding network congestion.
In its simplest form, a pseudo exponential backoff algorithm looks like this:

Luckily, many SDKs and software libraries, including those from AWS, implement a version (often more sophisticated) of these algorithms. However, never assume — always verify and test this is the case.
It is important to realize that this backoff alone is not enough. As explained by Marc Brooker in the post Exponential Backoff and Jitter, the wait()
function in the exponential backoff should always add jitter to prevent clusters of retry calls. One approach is to modify the wait()
as follows:
wait = random_between(0, min(cap, base * 2 ** attempt))
Timeouts
To illustrate the importance of timeouts, imagine that you have steady baseline traffic, and that suddenly, your database decides to ‘slow down’ and your INSERT
queries take a few seconds to complete. The baseline traffic doesn’t change, so this means that more request threads are holding on to database connections and the pool of connections quickly runs out. Then, other APIs start to fail because there are no connections left in the pool to access the database. This is a classic example of cascading failure, and if your API had timed out instead of holding on to the database, you could have degraded the service instead of failing.
The importance of thinking, planning and implementing timeouts is frequently underestimated. And today, many frameworks don’t expose timeouts in requests methods to developers — or even worse, have infinite default values.
Idempotent operations
Now that you have implemented retries and timeouts correctly, imagine for a moment that your client sends a message over HTTP to your applications, and that due to a transient error, you receive a timeout. What should you do? Was the request processed by the application before it timed out?
Often with timeouts, the default behavior for the user is to retry — e.g clicking on the submit button until it succeeds. But what if the initial message had already been received and processed by the application? Now, suppose the request was an INSERT
to the database. What happens when the retries are processed?
It’s easy to understand why retry mechanisms aren’t safe for all types of requests: if the retry leads to a second INSERT
of the same data, you’ll end-up with multiple records of the exact same data, which you probably don’t want. Often, this will lead to uniqueness constraint violation, resulting in a DuplicateKey exception thrown by the database.
Now what? Should you READ
before or after putting new data in the database to verify or prevent duplication exceptions? Probably not, as this is would be expensive in terms of the latency of the application. Which part of the application needs to handle this complexity?
Most of these scenarios can be avoided if the service implements idempotentoperations. An idempotent operation is an operation that can be repeated over and over again without side effects or failure of the application. In other words, idempotent operations always produce the same result.
One option is to use unique traceable identifiers in requests to your application and reject those that have been processed successfully. Keeping track of this information in the cache will help you verify the status of requests and determine whether you should return a canned response, the original response, or simply ignore it.
Awareness is key — more so than having a bullet proof system — because it forces developers to think about potential issues at development times, preventing Unhandled Exceptions before they happen.
Service degradation & fallbacks
Degradation simply means that instead of failing, your application degrades to a lower-quality service. There are two strategies here: 1) Offering a variant of the service which is easier to compute and deliver to the user; or 2) Dropping unimportant traffic.
Continuing with the above example, if your INSERT
queries get too slow, you can timeout and then fallback to a read-only mode on the database until the issue with the INSERT
is fixed. If your application does not support read-only mode, you can return cached content.
Imagine you want to return a personalized list of recommendations to your users. If your recommendation engine fails, maybe it’s OK to return generic lists like trending now, popular or often bought together. These lists can be generated on daily basis and stored in the cache and are therefore easy to access and deliver. Netflix is doing this remarkably well — and at scale.
“What we weren’t counting on was getting throttled by our database for sending it too much traffic. Fortunately, we implemented custom fallbacks for database selects and so our service dependency layer started automatically tripping the corresponding circuit and invoking our fallback method, which checked a local query cache on database failure.” Source here.
Rejection
Rejection is the final act of ‘self-defense’: you start dropping requests deliberately when the service begins to overload. Initially, you’ll want to drop unimportant requests, but eventually you might have to drop more important ones, too. Dropping requests can happen server side, on load-balancers or reverse-proxies, or even on the client’s side.
Resilience against intermittent and transient errors
An important problem to realize when dealing with large-scale distributed systems in the cloud is that given the size and complexity of the systems and architecture used, intermittent errors are the norm rather than the exception, which is why we want to build resilient systems.
Intermittent errors can be caused by transient network connectivity issues, request timeouts, or when dependency or external services are overloaded.
The question with intermittent errors is, how should you react? Should you tolerate errors or should you react immediately? This is a big issue since reacting too fast might create risk, add significant operational overhead and incur additional cost.
Take the case of running across multi-regions for example. When do you decide that it’s a good time to initiate a complete failover to the other region?
This is a hard problem to solve. Often, it’s a good idea to use thresholds. Collect statistics about intermittent errors in your baseline, and based on that data, define a threshold that will trigger the reaction to errors. This takes practice to get right.
This doesn’t mean you shouldn’t observe and monitor intermittent errors by the way — quite the opposite. It simply means that you shouldn’t react to every one of them.
An important thing to remember when handling transient errors, is that often, transient exceptions are hidden by outer exceptions of different type. Therefore, make sure you inspect all inner exceptions of any given exception object in a recursive fashion.
Circuit breaking
Circuit breaking is a process of applying circuit breakers to potentially-failing method calls, to avoid cascading failure. The best implementations of circuit breaking use techniques used to prevent cascading failure — backoff, degradation, fallback, timeout, rejection, and intermittent error handling — and wrap them in an easy-to-use object for developers implementing methods calls. The circuit breaker pattern works as follows.
A circuit breaker object monitors for the number of consecutive failures between a producer and a consumer. If the number passes over a failure-threshold, the circuit breaker object trips, and all attempts by the producer to invoke the consumer will fail immediately, or return a defined fallback. After a waiting period, the circuit breaker allows for a few requests to pass through. If those requests successfully pass a success-threshold, the circuit breaker resumes it’s normal state. Otherwise, it stays broken and continues monitoring the consumer.

Circuit breakers are great since they force developers to apply resilient architecture principles at implementation time. The most famous circuit breaker implementation is, naturally, Hystrix from Netflix.
What are we talking about?
In part 1, I discussed how to architect resilient architectures using the principle of redundancy, which is the duplication of components of a system in order to increase the overall availability of that system.
Such a principle implies that you use a load balancer to distribute network or application requests across a number of server instances. In order to determine whether a downstream server instance is capable of handling a request successfully, the load balancer uses health checks.

Health checks gather both local and dependency information in order to determine if a server instance can successfully handle a service request. For example, a server instance might run out of database connections. When this occurs, the load balancer should not route requests to that server instance, but instead, distribute the service traffic to other available server instances.
Health checks are important because if you have a fleet of 10 server instances behind a load balancer, and one of them becomes unhealthy but still receives traffic, your service availability will drop to, at best, 90%.
Continuing the above example, if a new request hits the server instance, it will often fail fast since it can’t get a new database connection opened. And without clear handling of that error, it can easily fool the load balancer into thinking that the server instance is ready to handle more requests. This is what we call a black hole.
Unfortunately, it gets more complicated.
Imagine for a second that you have learned your lesson and implemented a health check on each server instance, testing for connectivity to the database. What happens if you have an intermittent network connectivity issue? Do you take the entire fleet of server instances out of service, or wait until the issues go away?
This is the challenge with health checks — striking the balance between failure detection and reaction — and it’s a topic we’ll explore in this post.
Server instance failure types
There are two types of failures that can impact a server instance: local or host failures, which are on the server instance itself; and dependency failures, which are the failure of critical dependencies necessary to successfully handle a request.
Host or local failures are triggered by different things: For example: hardware malfunctions such as disk failure; OS crash after upgrade/updates; security limits such as open file limits; memory leaks; and, of course, unexpected application behavior.
Dependency failures are caused by external factors. For example: the loss of connectivity to data stores such as the external distributed caching layer or the database; service broker failures (queues); invalid credentials to object storage S3; a clock not syncing with NTP servers; connectivity issues with third party providers (SMS, notifications, etc.); and other unexpected networking issues.
As noted above, there are plenty of reasons why a server instance might break down. The type of failures you are probing defines the type of health check you are using: a shallow health check probe for host or local failures; and a deep health check probe for dependency failures.
Shallow and Deep health checks
(1) Shallow health checks often manifest themselves as a simple ‘ping’, telling you superficial information about the capability for a server instance to be reachable. However, it doesn’t tell you much about the logical dependency of the system or whether a particular API will successfully return from that server instance.
(2) Deep health checks, on the other hand, give you a much better understanding of the health of your application since they can catch issues with dependencies. The problem with deep health checks is that they are expensive — they can take a lot of time to run; incur costs on your dependencies; and be sensitive to false-positives — when issues occur in the dependencies themselves.

Act smart, fail open!
A major issue with health checks occurs when server instances make the wrong local decision about their health — and all of the other hosts in the fleet make that same wrong decision simultaneously — causing cascading failures throughout adjacent services.
To prevent such a case, some load balancers have the ability to act smart: When individual server instances within a fleet fail health checks, the load balancer stops sending them traffic, but when server instances fail health checks at the same time, the load balancer fails open, letting traffic flow through all of them.

In AWS, there are two options to perform load balancing between resources using health checks: the AWS Elastic Load Balancers (ELB) provides health checks for resources within an AWS region; while the Route 53 DNS failoverprovides health checks across AWS regions. For customers not using the AWS ELB, DNS failover with Weighted Round Robin routing policies in Route 53 can provide the load balancing and failover capability.
It is worth noting that the ELBs are health checking backend EC2 instances, whereas Route 53 can health check any internet-facing endpoint in or outside of the regions while supporting several routing policies: WRR of course, but also Latency Based Routing (LBR) and Geo-location/proximity Based Routing (GBR) — and all will honor health checking.
Most importantly for our case on resilient architectures, AWS Elastic Load Balancers (both Network and Application) and Route 53 DNS failover support failing open — that is, routing requests to all targets even if no server instances report as healthy.
Such an implementation makes a deep end-to-end health check safer to implement — even one that queries the database and checks to ensure non-critical support processes are running.
Finally, if you use either Route 53 or AWS ELBs, you should always favor deep health checks over shallow ones, or use a combination of them — as together, they can more easily locate where failure happens.
A look at Route 53 health checks
Route 53 supports three different types of health checks: (1) endpoint monitoring, the most common type; (2) calculated health checks, used when you have a combination of different health checks (e.g. shallow and deep); and (3) health checks monitoring CloudWatch alarms.
Route 53 uses health checkers located around the world. Each health checker sends requests to the endpoint that you specify to determine whether the endpoint is healthy: in intervals of 30 seconds, called ‘standard’; or in intervals of 10 seconds, called ‘fast’.
Fun fact: Route 53 health checkers in different locations don’t coordinate with one another, so your monitoring system might see several requests per second regardless of the interval you chose. Don’t worry, that’s normal–randomness at work
Health checkers in every location evaluates the health of the target based on two values: (1) Response time (thus the importance of timing out explained earlier); and (2) a failure threshold, which is a number of consecutive successful (or not) requests.
Finally, the Route 53 service aggregates all the data from health checkers in different locations (8 by default) and determines whether the target is healthy.

If more than 18% of health checkers report that an endpoint is healthy, Route 53 will consider the target healthy. Similarly, if 18% or fewer health checkers report that an endpoint is healthy, Route 53 will consider the target unhealthy.

Why 18%? It’s a value that was chosen to avoid transient errors and local network conditions. AWS reserves the right to change this value in the future 🙂
Fun fact: 18% is not that magic but comes from the fact that most of the time there are 16 health checkers, spread over 8 regions, probing against a specific target. Since we are checking if the target is reachable from at least 3 AZ’s for us to consider it healthy— it gives you 3/16 = 18%
Last but not least, in order for Route 53 DNS failover to happen fast, DNS records should have TTLs of 60 seconds or less. This will minimize the amount of time it takes for traffic to stop being routed to the failed endpoint.
Demo: Route 53 failing open with failed health checks
To continue the example I presented in my series of blog posts on multi-region architecture, let’s take a look at route 53 failing open.
This is a reminder of the overall architecture of the multi-region backend where Route 53 is being used to serve traffic to each region while supporting failover in case one of the regions experiences issues.

I configured the Route 53 DNS failover health checks using the health API deployed here.

And I did the same for each region. You can see the routing policy between the two regional API endpoints from the API Gateway as well as the target health check.

Finally, I altered my application so that both my regional health checks are returning 400, therefore registering as unhealthy and failing.

However, as you can see, requests are still being routed to the endpoints — Route 53 is successfully failing open.

Considerations when implementing health checks:
- Build a map of critical request path dependencies that are mandatory for each request, and separate operational vs non-operational dependencies. The last thing you want is your load balancing policies to be sensitive to non-critical services (e.g. failing because of the monitoring services).
- When probing multiple dependencies over the network, call them concurrently if possible. This will speed up the process and prevent your deep health check from being timed-out by health checkers.
- Don’t forget to timeout, degrade or even reject dependency requests if necessary, and don’t over-react — that is, don’t break at the first transient error. It is good practice to enforce an upper bound response time for each of the dependencies, and an even better practice to use a circuit breaker for all dependencies methods calls. More info can be found here.
- When dependencies are failing, remember to inspect all inner exceptions of any given error in a recursive fashion since transient exceptions are often hidden by outer exceptions of different types.
- Beware of the “startup” time of your deep health checks. You don’t want non-critical dependencies to prevent your newly launched server instances from being served traffic. This is especially important in situations where your system is trying to recover from outages. One solution is to probe a shallow health check during startup, and once all the dependencies are green, turn that shallow health check into a deep health check.
- Make sure your server instances have reserved resources and priorities set-asides to answer to health checkers in time (remember health checkers have timeouts too). The last thing you want in production is to have your fleet of server instances to shrink down when you have a surge of traffic because your server instances failed to answer to the health check in time. This could turn into a devastating cascading failure effect through overload.
- When load testing your server instances, look at understanding your max concurrent connection per instance and probe that number while health checking (remember to account for health checkers’ connections too). Often, that max concurrent connection is the limiting factor to scaling.
- Oh, I almost forgot — be sure to secure your health check API endpoints!!!!!! 🙂
As noted in the New York Times article: “For Impatient Web Users, an Eye Blink Is Just Too Long to Wait,” a user’s perception of quality and a good experience is directly correlated to the speed at which content is delivered to them. Speed matters, and four out of five users will click away if loading takes too long. In fact, research shows that 250ms will provide a competitive advantage to the fastest of two competing solutions.
In 2007, Greg Linden, who previously worked at Amazon, stated that through A/B testing, he tried delaying a retail website page loading time in increments of 100ms, and found that even small delays would result in substantial and costly drops in revenue. With every 100ms increase in load time, sales dropped one percent. At the scale of Amazon, it’s clear that speed matters.
Content providers use caching techniques to get content to users faster. Cached content is served as if it is local to users, improving the delivery of static content.
A cache, is a hardware or software component that stores data so future requests for that data can be served faster; the data stored in a cache might be the result of an earlier computation, or the duplicate of data stored elsewhere.More at Wikipedia
While caching is often associated with accelerating content delivery, it is also important from a resiliency standpoint. We’ll explore this in more detail in this post.
Type of Caching
Static and Dynamic Content Caching
Traditionally, it’s been standard practice to consider caching for static content only — content that rarely or never changes throughout the life of the application, for example: images, videos, CSS, JavaScript, and fonts. Content providers across the globe use content delivery networks (CDN) such as Amazon CloudFront to get content to users faster, especially when the content is static. Using a globally distributed network of caching servers, static content is served as if it is local to users, improving delivery. While CloudFront can be used to deliver static content, it can also be used to deliver dynamic, streaming, and interactive content. It uses persistent TCP connections and variable time-to-live (TTL) to accelerate the delivery of content, even if it cannot be cached.
From a resilience standpoint, a CDN can improve the DDoS resilience of your application when you serve web application traffic from edge locations distributed around the world. CloudFront, for example, only accepts well-formed connections, and therefore prevents common DDoS attacks such as SYN floods and UDP reflection attacks from reaching your origin.
Since DDoS attacks are often geographically isolated close to the source, using a CDN can greatly improve your ability to continue serving traffic to end users during larger DDoS attacks. To learn more about DDoS resiliency, check out this white paper.
Another caching technique used to serve static web content is called page caching. This is where rendered output on a given web page is stored in files in order to avoid potentially time-consuming queries to your origin. Page caching is simple, and is an effective way of speeding up content delivery. It is often used by search engines to deliver search queries to end users faster.
From a resiliency point of view, the real benefit of caching applies to caching dynamic content: data that changes fast and that is not unique-per-request. Caching short but not unique-per-request content is useful since even within a short timeframe, that content might be fetched thousands or even millions of times — which could have a devastating effect on the database.
You can use in-memory caching techniques to speed up dynamic applications. These techniques, as the name suggests, cache data and objects in memory to reduce the number of requests to primary storage sources, often databases. In-memory caching offerings include caching software such as Redis or Memcached, or they can be integrated with the database itself — as is the case with DAX for DynamoDB .
Caching for Resiliency
Here are some benefits of caching related to resiliency:
- Improved application scalability: Data stored in the cache can be retrieved and delivered faster to end-users, thus avoiding a long network request to the database or to external services. Moving some responsibilities for delivering content to the cache allows the core application service (the business logic) to be less utilized and take more incoming requests — and thus scale more easily.
- Reduce load on downstream services: By storing data in the cache, you decrease load on downstream dependency and relieve the database from serving the same frequently-accessed data over and over again, saving downstream resources.
- Improved degradation: Cached data can be used to populate applications’ user interfaces even if dependency is temporarily unavailable. Take Netflix’s home screen (see below). If a recommendation engine fails to deliver personalized content to me, see “My List” from the image below, they can, and will always show me something else like “Netflix Originals”, “Trending Now”, or similar. These last two content collections, like most of the home screen on Netflix, are directly served from the cache. This is call graceful degradation.

- Lower overall solution cost: Using cached data can help reduce overall solution costs, especially for pay-per-request type of services; and also save precious temporary results of expensive and complex operations so that subsequent calls don’t have to be processed.
All the benefits listed above not only contribute to improved performance, but also to improving the resiliency and availability of applications.
However, using caches is not without its challenges. If not well understood, caching may severely impact application resiliency and availability.
Dealing with (In)consistency
Caching dynamic content means that changes performed on the primary storage (e.g. the database) are not immediately reflected in the cache. The result of this lag, or inconsistency between the primary storage and the cache, is referred to as eventual consistency.
To control that inconsistency, clients — and especially user interfaces — must be tolerant to eventual consistency. This is referred to as stale state. Stale state, or stale data, is data that does not reflect reality. Cached data should always be considered and treated as stale. Determining whether your application can tolerate stale data is the key to caching.
For every application out there, there is an acceptable level of staleness in the data.
For example, if your application is serving a weather forecast to the public on a website, staleness is acceptable, with a disclaimer at the bottom on the page that the forecast might be few minutes old. But when serving up that forecast for in-flight airline pilot information, you want the latest and most accurate information.
Cache Expiration and Eviction
One of the most challenging aspects to designing caching implementations is dealing with data staleness. It’s important to pick the right cache expiration policy and avoid unnecessary cache eviction.
An expired cached object is one that has run out of time-to-live (TTL), and should be removed from the cache. Associating a TTL with cached objects depends on the client requirements, and its tolerance to stale data.
An evicted cached object, however, is one that is removed from the cache because the cache has run out of memory, because the cache was not sized correctly. Expiration is a regular part of the caching process. Eviction, on the other hand, can cause unnecessary performance problems.
To avoid unnecessary eviction, pick the right cache size for your application and request pattern. In other words, you should have a good understanding of the volume of requests your application needs to support, and the distribution of cached objects across these requests. This information is hard to get right and often requires real production traffic.
Therefore, it is common to first (guess)estimate the cache size, and once in production, emit accurate cache metrics such as cache-hits and cache-misses, expiration vs. eviction counts, and requests volume to downstream services. Once you have that information, you can adjust the cache size to ensure the highest cache-hit ratio, and thus the best cache performance.
One must never forget to go back and adjust the above (guess)estimate. This can lead to capacity issues, which can in turn, lead to unexpected behavior and failures.
Caching Patterns
There are two basic caching patterns, and they govern how an application writes to and reads from the cache: (1) Cache-aside; and (2) Cache-as-SoR (System of Record) — also called inline-cache.
Cache-aside
In a cache-aside pattern, the application treats the cache as a different component to the SoR (e.g. the database). The application code first queries the data from the cache, and if the cache contains the queried piece of data, it retrieves it from the cache, bypassing the database. If the cache does not contain the queried piece of data, the application code has to fetch it directly from the database before returning it. It then stores the data in the cache. In python, a cache-aside pattern looks something like this:
def get_object(key): object = cache.get(key) if not object: object = database.get(key) cache.set(key, object) return object
The most common cache-aside systems are Memcached and Redis. Both engines are supported by Amazon ElastiCache, which works as an in-memory data store and cache to support applications requiring sub-millisecond response times.
Inline-cache
In an inline-cache pattern, the application uses the cache as though it was the primary storage — making no distinction in the code between the cache and the database. This patterns delegates reading and writing activities to the cache. Inline-cache often uses a read-and-write-through pattern (sometimes but rarely write-behind) transparently to the application code. In python, an inline-cache pattern looks something like this:
def get_object(key): return inline-cache.get(key)
In other words, with an inline-cache pattern, you don’t see the interaction with the database — the cache handles that for you.

Inline-cache patterns are easier to maintain since the cache logic resides outside of the application and the developer doesn’t have to worry about the cache. However, it reduces the observability to the request path, and can lead to hard-to-deal-with situations, since if the inline-cache becomes unavailable or fails, the client has no way to compensate.
On AWS, you can use standard inline-cache implementations for HTTP caching such as nginx or varnish, or implementation-specific ones such as DAX for DynamoDB.
Considerations on Write-Around Strategy
When you write an item into an inline-cache, the cache will often ensure that the cached item is synchronized with the item as it exists in the database. This pattern is helpful for read-heavy applications. However, if another application needs to write directly into the database table without using the cache as write-through, the item in the cache will no longer be in sync with the database: it will be inconsistent. This will be the case until the item is evicted from the cache, or until another request updates the item using the write-through cache.
If your application needs to write large quantities of data to the database, it might make sense to bypass the cache and write the data directly to the database. Such a write-around strategy will reduce write latency. After all, the cache introduced middle nodes, and thus increases latency. If you decide to use a write-around strategy, be consistent and don’t mix write-through with write-around, or the cached item will become out-of-sync with the data in the database. Also, remember that inline-cache will only populate the cache at read time. That way, you ensure that the most frequently-read data is cached — and not the most-frequently written data.
Final Considerations on Resiliency
Soft and hard-TTL
While stale content needs to eventually expire from a resilience point of view, it should also be served when the origin server is unavailable, even if the TTL is expired, providing resiliency, high availability and graceful degradation at times of peak load or failure of the origin server. To cope with such requirement, some caching frameworks support a soft-TTL and a hard-TTL; where cached content is refreshed after an expired soft-TTL. If refreshing fails, the caching service will continue serving the stale content until the hard-TTL expires.
Request coalescing
Another consideration is when thousands of clients request the same piece of data at almost the same time, but that request is a cache-miss. This can cause large, simultaneous numbers of requests to hit the downstream storage, which can lead to exhaustion. These situations often occur during startup time, at restart, or fast scaling up times. To avoid such scenarios, some implementation support request coalescing — or “waiting rooms” — where simultaneous cache misses collapse into a single request to the downstream storage. DAX for DynamoDB supports request coalescing automatically (within each node of the distributed caching cluster fronting DynamoDB), while Varnish and Nginx have configurations to enable it.
Final thoughts
When deploying applications in the cloud, you will have to use a collection of all the different types of cache and techniques mentioned in this article. There is no silver bullet, however, so you will have to work with all the people involved in designing, building, deploying and maintening the application — from the UI designer to the backend developer — everyone should be consulted so that you know the what, the when and the where of using caching.

Source: https://medium.com/@adhorn/patterns-for-resilient-architecture-part-1-d3b60cd8d2b6