Learn How a GoGrid Customer Created a Multiple Data Center Routing & Failover Infrastructure Environment

March 22nd, 2011 by - 7,370 views

We absolutely LOVE hearing how GoGrid customers are using our cloud solutions to create unique “cloud fingerprints” and environments using the features and data centers of GoGrid. Paul Trippett just published a very interesting write-up of an infrastructure environment that addresses many of the common concerns facing any company looking to provide a highly-redundant infrastructure while also ensuring a solid Service Level Agreement (SLA) for their customers.

customer_showcase_GoGrid_logo_sm

You can find Paul’s original write-up titled “Utilizing GoGrid’s Multiple Data Centers for Routing and Failover” on his site. With his permission, we have reposted the article so that others can learn, mimic and build upon his unique scenario.

At the beginning of the year one of our customers asked us if we can provide an SLA for StormRETS and with it, the sound gritting teeth suddenly echoed around the room. As you can imagine, this caused more questions than which we actually had answers for:

blockquote_2 What kind of SLA did we want to provide and what could we realistically provide?

Our hosting provider, at the time, had an SLA which entailed “We don’t give any guarantee that your servers will be available, but if for any reason they are unavailable we will get the back up and running as soon as we can.”, erm, how on earth can we build a SLA based on that. It was decided at this time we would migrate our servers to another hosting provider, one at least with a SLA we can build on and a company we can actually contact directly should a problem arise.

Any migrations we did needed to address the following issues:

  • System Monitoring — Constant monitoring for potential problems.
  • Network Downtime — If its a localized network issue, secondary DR fail-over site.
  • Server Downtime — Multiple Servers or Secondary DR fail-over site.
  • Software foul ups — Unit Testing and run the system on multiple servers.
  • Server Software Updates — Multiple Servers.
  • Maintenance Schedules — Multiple Servers.
  • High Traffic Spikes –  Multiple Servers.
  • Botched Deploys — DR fail-over.
  • Drunk SysAdmins

We decided it would be a good idea to setup a second data center in the case of a failure of the primary, but, what’s the point in having all this extra duplicated capacity we’re not using, it makes sense to put it to good use by directing traffic to our nearest data center and failing all traffic over in case of a major outage.

DNS Failover and Load Balancing
The problem with this is that DNS fail-over or geo-aware DNS was extremely expensive. We just couldn’t justify a spend of more than $200 a month for something we could setup ourselves on a few VPS boxes scattered around the globe for $50 a month. Anycast DNS is severely overrated, it makes sense yes but not at the prices being asked. Sometimes answers comes from strange places, while doing some whois searches on start-up companies, which we knew would be looking at this same problem, we found Zerigo who have recently started offering geo-aware DNS at prices starting at $20 per month. After running some tests there response times aren’t to shabby either!

Cloud Hosting
There are more than enough blog posts about choosing a cloud provider. We looked at the more common providers including Amazon, Rackspace and GoGrid, in the end we decided on GoGrid. GoGrid offer a really good SLA, they have telephone support and multiple data centers ready for you to use.

blockquote_2GoGrid offer a really good SLA, they have telephone support and multiple data centers ready for you to use.

With every GoGrid account comes two /28 blocks of routable IP addresses, one in each data center. This is an awesome feature, usually when you create a new VPS you are assigned a random IP address from a huge pool, deleting a VPS would mean you lost that IP address, with GoGrid you can delete a VPS and re-assign its IP address to another VPS meaning all your IP addresses are always contiguous and easy to remember. Due to GoGrid’s network setup you cant use all the IP addresses as some are reserved for the default gateway, network broadcast, and a further 3 IP address reserved in the middle of the pool for the active and standby load balancers.

GoGrid provides free fault tolerant F5 load balancers with their service, allowing you to setup up to 3 load balanced IP addresses per data center. In our old setup we had setup Load balancers ourselves running on CentOS and NGINX but using GoGrid for this saves us time and money, and gives us one less thing we have to worry about and manage ourselves.

Network Setup
Our network setup is nothing new, but the data we have, needs to be retrieved and processed quickly via our API. Our average API response time is currently 0.07 seconds against approx 1,000,000 property records and we don’t want any redundancy we put in place to affect that time. Additional locations should be as autonomous as possible with little or no inter-site communication caused by an API request and be able to handle another data center going down.

Cluster14

Above you can see the basic network diagram of what we have setup and running:

  • Zerigo DNS to load balance between the two data centers and fail-over requests to another data-center in case of a failure.
  • GoGrid f5 load balancers in each location to load balance requests across the web servers in each location.
  • OpenVPN Servers to bridge the two networks for passing replication data between the data centers.
  • MySQL circular replication between the two sites.
  • CouchDB multi master replication between the two sites.

A lot of the fail-over is left up to the platform to decide. Every minute the monitoring system calls into the system via a URL, this script checks a few key things, such as MySQL and CouchDB availability, if any of these tests fail the script returns a failure status and the DNS automatically switches. We use 3 minute TTL’s on the majority of our DNS records, so in theory fail-over should take no longer than 3-5 minutes to complete.

With our new setup we can now redirect traffic to our second data center while performing maintenance on the other, and we are in a much better position to provide an SLA to our customers, but even after these major first steps we are still not in a position to provide an SLA quite yet. Over the coming weeks we will be running various performance tests and fail-over testing to verify that both data centers will be able to work independently of the other, based on these performance tests we will be able to devise how much capacity we have and at which points we need to start considering upgrades and adding capacity.

Do you have an environment running on GoGrid that you think is unique, helpful for others to see and learn from or are particularly proud of? If so, drop me an email: michael [at] GoGrid.com.

The following two tabs change content below.

Michael Sheehan

Michael Sheehan, formerly the Technology Evangelist for GoGrid, is a recognized technology, social media, and cloud computing pundit and blogger who writes regularly about technology news and trends.

Leave a reply