I guess I could have jumped on the bandwagon and started criticizing Amazon for the issues that it had on Friday with their S3 offering. But part of me wanted to see how it all shook out, what the cause was, what would change, if anything, and how their customers would take it. According to the blogosphere, the Amazon service was down for over 2 hours (which seems to be accurate after reading through this forum thread on the Amazon developer forum).
This official Amazon AWS response clip from the forum seems to explain the outage:
Early this morning, at 3:30am PST, we started seeing elevated levels of authenticated requests from multiple users in one of our locations. While we carefully monitor our overall request volumes and these remained within normal ranges, we had not been monitoring the proportion of authenticated requests. Importantly, these cryptographic requests consume more resources per call than other request types.
Shortly before 4:00am PST, we began to see several other users significantly increase their volume of authenticated calls. The last of these pushed the authentication service over its maximum capacity before we could complete putting new capacity in place. In addition to processing authenticated requests, the authentication service also performs account validation on every request Amazon S3 handles. This caused Amazon S3 to be unable to process any requests in that location, beginning at 4:31am PST. By 6:48am PST, we had moved enough capacity online to resolve the issue.
As we said earlier today, though we’re proud of our uptime track record over the past two years with this service, any amount of downtime is unacceptable. As part of the post mortem for this event, we have identified a set of short-term actions as well as longer term improvements. We are taking immediate action on the following: (a) improving our monitoring of the proportion of authenticated requests; (b) further increasing our authentication service capacity; and (c) adding additional defensive measures around the authenticated calls. Additionally, we’ve begun work on a service health dashboard, and expect to release that shortly.
The Amazon Web Services Team
What this boils down to was that is seems like Amazon was hit by an intentional (or not?) DoS (Denial of Service) attack using these authentication protocols. Good for Amazon for coming right out (albeit only on the developer forum initially) and admitting the issue as well as listing out some actionable solutions. This is an important move, by them, with the admission of mistakes.
I have mentioned this before, it’s tough being a hosting provider (as Joyent and Twitter experienced recently). But then I remember that we (all of the ISPs and hosting providers) who are working on providing grid/virtualized/cloud computing, or any new product offering, are breaking new ground. That means, like any new or innovative product, that there will be issues, bugs, downtime and other problems. That is the price you pay with emerging technology. Software is developed by humans, to err is human, so therefore, applying simple high school logic, software will have errors.
S3 is an incredible product and might be a good match for many companies. Just remember to have a back-up strategy. Amazon’s core competency is selling books and other products, not necessarily hosting. They are making a good run at it though. My thought is that S3 spun out of a glut of extra computing power and resources being available during non-holiday times. If so, what a great idea of off loading extra server capacity during non-holiday shopping crunch times. But again, hosting is not their core business.
So I ask you, the reader and customer/potential customer of any new, bleeding edge technology, to be forgiving when things don’t work as expected. Being an early adopter of technology means that you accept the risks associated with that. Your rewards may be great (new product & better pricing), but also the potential for “disaster” is also much higher compared to more traditional routes.Some points to make sure you don’t get caught by surprise:
- Develop a backup solution
- Use your backup solutions regularly
- Understand your SLA (Service Level Agreement)
- Look for ways to set up redundancy. Set up a High Availability Network.
- If you can afford it, diversify your network; set up various mirrored POPs (points of presence) with different service providers
- Develop a contingency plan: if your network goes down, can you:
- easily inform your clients
- get a temporary site up quickly
- get timely and informative information from your ISP
- Put it all down in writing
- Do a disaster recovery dry run to work out the kinks
- Do cross-training of core skills
The most important suggestion I can make is one that Douglas Adams articulated so clearly in his book, The Hitchhiker’s Guide to the Glaxay. The two words “Don’t Panic” are boldly inscribed on the cover of The Guide. You might want to follow the same lead and put those word on your IT strategy and contingency plans. And be sure to do some research on a variety hosting providers. They can make or break your business!
Latest posts by Michael Sheehan (see all)
- James Gosling to Speak on Innovation at GoGrid Cloud Meetup on 5/22 - May 16, 2013
- Advertising in the Cloud - May 2, 2013
- How To Enable & Manage the New, Free GoGrid Firewall Service - May 1, 2013