KML_FLASHEMBED_PROCESS_SCRIPT_CALLS
 

You Don’t Need a Superstorm: Disaster Recovery Basics

November 12th, 2012 by - 31,744 views

In this blog post, I’m going to discuss disaster recovery. After superstorm Sandy on the East Coast, there were people without power weeks after the storm. Data centers were affected by the storm as well. And although GoGrid’s East Coast data center didn’t experience an outage, some providers did. So it is timely to consider geographically redundant solutions rather than wait for the next superstorm.

Geographic Redundancy

There are three basic strategies you can implement today on GoGrid to make your application better able to recover from a data center outage: cold standby, warm standby, and full geographic-redundancy with multiple active data centers. Let’s start off with a definition:

Redundancy: (noun) the ability of an application or system to resist the failure of one or more constituent parts, or recover quickly from such failure.

Systems administration and IT management boils down to that proverbial 3:00am phone call. Your application is down. How do you respond? Having the proper plan and appropriate recovery assets in place is the key to surviving this all-too-real scenario. How current are your backups? Do you have standby servers already in place? If not, how quickly can you bring new ones online?

It’s pretty standard to have offsite backups. If the offsite backups are in a secondary data center, they can be used to springboard reconstituting your application. GoGrid offers two products that make this process easy to implement:

  • CloudLink is a redundant, dark-fiber connection (separate from the public Internet) between GoGrid’s US data centers. It’s available on a flat monthly subscription basis for a given amount of bandwidth starting at 10 Mbps scaling all the way up to 1 Gbps.
  • Cloud Storage is redundant, scalable NAS-based storage available in all three GoGrid data centers.

With these building blocks and the goal of having a plan and recovery assets in place to make a smooth recovery from an outage, let’s look at those three basic disaster recovery solutions, which are defined according to the state of the data being stored in the secondary site: cold, warm, or hot.

Cold Standby

Cold standby is the most basic form of geo-redundancy. It involves executing a backup strategy appropriate to your application—weekly full and daily “diffs” (differential backups), for example—then copying them across CloudLink to the secondary data center and storing them in Cloud Storage. This strategy is called “cold” standby because the data exists in backup form and must be restored to bring the database online. You need a small Cloud Server in the secondary data center to create the link to the remote Cloud Storage. This server could also host a simple web page with “Under Maintenance” messaging to be displayed while you’re bringing your application back online. Finally, use GoGrid Server Image (GSI) functionality to create “gold masters” for each server type in your application. These GSIs are also stored in Cloud Storage, and you can use them to templatize the deployment of Cloud Servers.

ColdDR1

Figure 1: Cold Standby – Primary environment in West data center; backups shipped via CloudLink to East data center.

To execute the failover, change your application’s public DNS to point to the secondary data center. This process will take some time to propagate, but as end users get the updated DNS record, they’ll see the “Under Maintenance” page rather than an error. In the meantime, you’re spinning up a database server and application servers from your GSIs, restoring the data, and reconstituting your environment.

ColdDR2

Figure 2: Recovery in Cold Standby – Spin up servers from GSIs and restore the database from backup.

The value proposition here is that recovery from a catastrophic outage can occur within a single day for most of your customers. The downside is that there will be some downtime, but it will be mitigated by the branded “Under Maintenance” messaging you previously set up. Depending on the complexity of the application, the volume of data to be restored, and DNS propagation time, application downtime could stretch to 12–48 hours. However, this is definitely less time than if you needed to reconstitute from traditional offsite backups in a totally new data center. And if you didn’t have offsite backups at all, it might take weeks to get back online and you might never be able to fully recover from the outage. The biggest “gotcha” in this scenario is data loss. The database restore is to the last backup. Any data captured between the time of the last backup and the outage will be lost if the original data at the primary data center is unrecoverable. Despite the limitations, cold standby is a viable, entry-level disaster recovery strategy.

Warm Standby

Warm standby is a substantially better, but only slightly more advanced form of geo-redundancy. In this scenario, you synchronize live data to a standby database server in the secondary data center via replication, log-shipping, or database mirroring. This process is called “warm” standby because you have warm data available—data that is ready to go. You still need to save GSIs to Cloud Storage for the application servers that would need to be spun up. It’s also a good idea to have simple “Under Maintenance” messaging in place and ready to display while you’re bringing your application back online.

WarmDR1

Figure 3: Warm Standby – Primary environment in West data center; live data synchronized to standby database server in East via CloudLink.

To execute the failover, change your application’s public DNS to point to the secondary data center. The DNS change will take some time to propagate, but as end users get the updated DNS record, they’ll see the “Under Maintenance” page if you created one, rather than an error in their browser. In the meantime, you’re spinning up application servers from your GSIs and reconstituting your environment. Spinning up only application servers is going to be much faster than if you needed to provision a database server and restore data as well, making your recovery time correspondingly faster. If your public DNS provider supports it, you could even set up an automatic failover upon loss of connectivity to your primary data center. In this scenario, rather than simple “Under Maintenance” messaging, you could have a portion of the application environment in place to deliver your application in an over-subscribed state on a short-term basis until it is augmented with additional application servers, spun up from GSIs, to the point where it can comfortably handle the full application load.

WarmDR2

Figure 4: Recovery in Warm Standby – Spin up application servers from GSIs; database is already in place.

The value proposition here is that recovery from a catastrophic outage can occur in as little as 1–2 hours, with minimal data loss. With DNS failover and a partial application environment in the standby data center, recovery could take only minutes. The downside is that there would still be downtime, but far less than in the cold standby model, even with a manual DNS switch-over. The cost to implement warm standby is greater, in both dollars and engineering resources, but not excessively so. As discussed, there are different levels of warm standby, so there is a tradeoff between cost/complexity and recovery time.

Full Geo-Redundancy (Hot Secondary)

The gold-standard for geographically redundant disaster recovery is to have an active/active data center deployment. With DNS load balancing, both application environments serve end users simultaneously. Users are routed to the nearest available data center, which should provide improved application performance. The databases are both active and taking application data simultaneously, so they must be synchronized via master-master replication to keep each environment aware of the other’s data changes.

HotDR1

Figure 5: Full geographic redundancy – Active/active data centers with public traffic DNS-balanced between them and live data being synchronized on the back end bi-directionally via CloudLink.

In this scenario, “gold master” GSIs are stored in Cloud Storage in both data centers. In the event one data center goes offline, you can use the GSIs to spin up additional capacity in the remaining one to serve the increased application load in that data center. Failover is automatic. The DNS provider has “keep-alive” monitors that can detect when a data center has gone offline and the DNS servers stop sending traffic to it.

HotDR2

Figure 6: Automatic recovery in full geographic-redundancy – DNS detects outage at West data center and directs traffic to East; spin up additional application servers from GSIs, as needed.

The benefits of this scenario are obvious: the highest-possible availability and automatic recovery from an outage with no (or nearly no) downtime. The downside is that a solution with multiple active data centers is going to cost more than the two standby strategies discussed previously (cold and warm), and it will be more demanding of engineering resources to implement. The tradeoff is definitely worthwhile, however—greater levels of availability and redundancy are going to cost more.

Conclusion

With the appropriate plan and recovery assets in place, you can smoothly recover from an outage and minimize or even eliminate downtime altogether. This blog post outlined three strategies—from the entry-level to the gold standard—for implementing a geographically redundant disaster-recovery solution. GoGrid has provided the tools to add geographic redundancy, with CloudLink linking its two US data centers via redundant dark fiber connections and Cloud Storage providing a secure, scalable storage repository for recovery assets. GoGrid customer, Martini Media, implemented a disaster recovery solution as part of their Big Data implementation. You can read more about this success story in their case study. Lastly, for an estimate of how much your coast-to-coast disaster recovery may cost, contact your sales account manager today.  Just mention this blog post and get a 10Mbps CloudLink free for three months!

The following two tabs change content below.

Scott Pankonin

Solutions Architect at GoGrid
Scott has spent 20+ years in information technology, with a strong focus on architecture, software development, and operations. Currently, he devotes this expertise to collaborating with customers to develop solutions on the GoGrid IaaS platform.

2 Responses to “You Don’t Need a Superstorm: Disaster Recovery Basics”

  1. Jakob Bohm says:

    Two minor comments about this:

    1. I wouldn’t use the word “dark fiber” about GoGrid CloudLink, as each customer does not rent his own physical optical fiber. Like everything else at GoGrid, it is virtual and the word is either VPN or “private WAN partition” (depending on what GoGrid uses as a transport for the combined CloudLinks of multiple customers).

    2. A supplemental article about how to get various popular database engines to play nice with the Hot Secondary approach given the longer packet delay caused by the physical distance and speed of light. I have seen at least one high end database engine choke on even small transmission delays between the active masters.

    3. A supplemental article about how to prevent the MyGSI system from attempting to “sysprep” the gold master in ways that break the application setup.

  2. It's all about being prepared. Waiting for something to go wrong with a system before taking action is not the way to go about things. If you already have measures in place, at the very least, the downtime won't be anywhere near as long.

Leave a reply