Architecting for High Availability in the Cloud

July 22nd, 2014 by - 2,203 views

An introduction to multi-cloud distributed application architecture

In this blog, we’ll explore how to architect a highly available (HA) distributed application in the cloud. For those new to the concept of high availability, I’m referring to the availability of the application cluster as well as the ability to failover or scale as needed. The ability to failover or scale out horizontally to meet demand ensures the application is highly available. Examples of applications that benefit from HA architectures are databases applications, file-sharing networks, social applications, health monitoring applications, and eCommerce websites. So, where do you start? The easiest way to understand the concepts is simply to walk through the 3 steps of a web application setup in the cloud.

Step 1: Setting up a distributed, fault-tolerant web application architecture

In general, the application architecture can be pretty simple: perhaps just a load-balanced web front end running on multiple servers and maybe a NoSQL database like Cassandra. When you’re developing, you can get away with a single server, but once you move into production you’ll want to snapshot your web front end and spread the application across multiple servers. This approach lets you balance traffic and scale out the web front end as needed. In GoGrid, you can do this for free using our Dynamic Load Balancers. Point and click to provision the servers as needed, and then point the load balancer(s) to those servers. The process is simple, so setting up a load-balanced web front end should only take a few minutes. Any data captured or used by the servers will of course be stored in the Cassandra cluster, which is already designed to be HA.


Deploying the Cassandra cluster. In GoGrid, you can use our 1-Button Deploy™ technology to set up the Cassandra cluster in about 10 minutes. This will provision the cluster for your database. Cassandra is built to be HA so if one server fails, the load is distributed across the cluster and your application isn’t impacted. Below is a sample Cassandra cluster. A minimal deployment has 3 nodes to ensure HA and the cluster is connected via the private VLAN. It’s a good idea to firewall the database servers and eliminate connectivity to the public VLAN. With our production 1-Button Deploy™ solution, the cluster is configured to include a firewall on-demand (for free). In another blog post I’ll discuss how to secure the entire environment: setting up firewalls around your database and your web application as well as working with IDS and IPS monitoring tools and DDoS mitigation services. For the moment, however, your database and web application clusters would look something like this:


At this point you essentially have an HA application. And of course you don’t have to use Cassandra. You could also use a common SQL database and Block Storage to create a shared storage environment that scales. I chose Cassandra as an example because it’s considered one of the best HA database architectures on the planet. In short, with this design, any one of the servers or nodes can fail and the application will continue to be accessible. But what happens if the data center, availability zone, or service provider goes down? Those things can happen, and you should have a plan ready to go.

Step 2: Planning for Multi-Cloud

Planning for multi-cloud entails 3 components:

1. Creating a multi-cloud backup strategy

2. Standardizing architectures and deployment tools

3. Designing for failover scenarios across providers

Let’s start with backup. At a minimum, anyone running applications in the cloud should plan for multi-cloud backup. Employing a multi-cloud backup strategy simply means not putting all your eggs in one basket. Take the time to create relationships with multiple providers and store your backups on multiple clouds. Even if you don’t distribute your workloads across multiple providers, distributing your backups and archives across providers can ensure you can recover from a catastrophe, if necessary.

One thing that makes it easier to employ a multi-cloud strategy is to leverage standard architectures. This approach is often overlooked, but just imagine if you’re working with backup tools and a set of server images that aren’t consistent across providers. Recovering from an incident instantly becomes much harder. For example, proprietary services like Kinesis with Hadoop and EMR on Amazon Web Services (AWS) aren’t things you can spin up at another provider in the event you need to do so. An alternative would be to set up Cassandra with Hadoop on AWS and also set up Cassandra with Hadoop on GoGrid using the same out-of-the-box application technology. This way, when you distribute your backups and recover, you don’t have to re-invent the wheel for each cloud provider with which you’re engaged.

Last, let’s look at designing for failover scenarios. There are a few reasons you may want to direct traffic to alternate locations. Perhaps you need to optimize performance, deliver custom content to a specific region, or failover. At some point, you’ll need Geographic Load Balancing and failover services. Geographic Load Balancing lets you direct the traffic for your websites to the servers or data centers closest to visitors based on their geographic location. This approach provides shorter load times because visitors’ requests are routed to the closest server or data center. A Spanish website visitor would be sent to a Washington, D.C. (Ashburn) data center rather than to a data center in San Francisco, for example. You can also provide custom content to site visitors. In the same example, a San Francisco website visitor could receive content in English and a Spanish website visitor could receive content in Spanish.

In the event you need to failover, adding failover capabilities to Geographic Load Balancing provides a mechanism for continuous availability in the case of a cluster or data center failure. You can request a cluster as a “failover” cluster, for example, where the primary cluster responding to web traffic could be in a San Francisco data center and the failover cluster in the Ashburn data center. Should the cluster in San Francisco become unresponsive, the failover service will detect this event and automatically route traffic to the failover cluster. What’s the difference between this service and our Dynamic Load Balancing services? Our Dynamic Load Balancing services are intended to spread load across a cluster as needed, and our Geo Load Balancing and Failover services are meant to be a solution for distributing load across data centers or across multiple clusters, regions, or service providers. Each Dynamic Load Balancer can handle 10,000 concurrent connections at a time; however, if you need to scale beyond that point to handle larger amounts of traffic, you could put our Global Load Balancing solution in front of multiple Dynamic Load Balancers.

Step 3: Designing to Scale Out

This is the area where the true value of a cloud provider comes into play. As your website becomes more popular, there’s a tendency to want to scale up the boxes running the databases and applications. The problem with this approach is that you’ll eventually hit a limit. The nice thing about working with a Cassandra cluster, on the other hand, is that the cluster can scale out within a data center, and you can even distribute clusters across multiple data centers if you use our Cloud Link product. Pair that with the ability to distribute loads using the load-balancing techniques described above and you can now scale out your application as needed, either within a data center or across multiple data centers, zones, and even cloud providers. Plus, if you decide to leverage standard application architectures, you can also do things like orchestrate your deployments across data centers and zones, so you can failover and spin up an entire environment in just a few minutes. (Look for more on orchestrating deployments in a future blog post.)

Below is an example of what we would have built. As the figure shows, you can route traffic as needed to multiple locations. Leveraging cloud infrastructure, you can scale out nodes, servers, databases, load balancers, firewalls, and storage in multiple locations, and using Managed Monitoring and Managed Backup services, you can ensure your operation is healthy and can recover easily in the event of disaster.


And that’s it. In a few simple steps, you can ensure your architecture is HA. Just keep in mind that you need to build 3 things:

1. A distributed, fault-tolerant application architecture

2. A multi-cloud strategy for disaster recovery and failover

3. A design that enables the ability to scale out across data centers and providers

By leveraging these 3 techniques, you’ll be able to architect your application to be as highly available as you need.

The following two tabs change content below.

Kole Hicks

Senior Director of Product Management at GoGrid
Kole Hicks is the Senior Director of Product Management for GoGrid, the leader in Open Data Services (ODS) and committed to delivering purpose-built, non-opinionated Big Data solutions and services for the management and integration of open source, commercial, and proprietary technologies across multiple platforms..

Leave a reply