The Amazon Crash and Burn, And How It Affected an Ordinary Blogger
What a week for the cloud. On April 19, Sony took down the PlayStation Network in the face of hacker attacks that compromised the network. Two days later, large portions of Amazon Web Services crashed and burned, due to technical glitches. Amazon has since recovered, but the PlayStation Network is still offline. Much has been reported about the effect of the Amazon outage on major sites like Quora, reddit, and Foursquare. But many “ordinary” bloggers use Amazon services as well. How did it affect them? We spoke to a blogger whose sites were taken offline by the outage, and learned a few things.
First, let’s look at what happened to take large parts of Amazon Web Services offline. Amazon gave an explanation of the outage, about 8 days after it happened. It all started with human error during a configuration change to upgrade the capacity of the primary network. A traffic shift was initiated incorrectly, with traffic erroneously sent to an incorrect router not equipped to handle the traffic. Normally, different sections (“nodes”) of Amazon’s network replicate to other nodes, so that the same data is available across the network. With this incorrect router unable to handle the traffic, several nodes were isolated, and unable to replicate.
When the initial mistake was corrected, and traffic was sent to the correct router, the affected nodes tried to catch up on their replication all at once. The space for this filled up, and nodes were stuck in a loop, looking for space that didn’t exist. API requests to the network started piling up, and the node loop issue cascaded across a section of the network. Eventually, Amazon technicians were able to dampen the replication requests, and they also actually physically relocated servers to handle the load.
The Impact On the Ordinary Blogger
We spoke to the owner of Lazy Man and Money, who uses what Amazon bills as a “High-CPU Medium EC2 Instance Type” for his main site and some other sites. We’ll call him “Lazy Man” for this post. He had just transferred his sites to Amazon a month ago, only to see his sites go down for over 30 hours. His experience highlights a few points about an outage such as the Amazon outage.
1. Uptime Is Everything- Your Traffic Suffers, Even After You Get Back Online
At the risk of stating the obvious, the Amazon outage showed how crucial uptime is, and the lasting effects that it can have. The Lazy Man and Money site, according to its owner, has seen a lingering hit on traffic, days after the outage was restored. “Originally there didn’t seem too much traffic loss,” Lazy Man reported. “However, over the last 5 days, it seems like traffic has tailed off significantly.” It’s too early to tell if this will continue, but being offline is a compound problem – users that try to visit you not only won’t get through while your site is down, but might never return. Think about how often you stumble upon a site by chance, are interested, and become a regular visitor. Those potential regular visitors are lost when your site is down.
2. Cloud Providers Must Do a Better Job At Communicating With Their Customers When Outages Do Occur
One of the issues touched upon in the coverage of the outage is the poor communication between Amazon and its customers. The Web Services blog went four days without an update. The CEO of Big Door complained that “Amazon’s updates read as if they were written by their attorneys and accountants who were hedging against their stated SLA rather than being written by a tech guy trying to help another tech guy.”
Lazy Man’s experience was no better. “Amazon gave no helpful estimates as to when they’d have the problem fixed,” he reported. “The language made one think it would be a couple of hours, but it lingered on for more than 30.”
According to the Wall Street Journal, Amazon has recognized that it can improve in how it communicates to its customers. Amazon’s official statement was that it “switched to more regular updates part of the way through this event and plan[s] to continue with similar frequency of updates in the future.”
Granted, a provider such as Amazon might not know how long a service will be down, but it should at least be open about what it knows, and doesn’t know. It should at least share the nature of the problem. That way, a customer with a fallback plan in place can decide whether to point its traffic elsewhere, through a DNS change, shifting to a different Amazon cluster, or otherwise.
3. Cloud Providers Must Educate Their Users About Redundancy and Backup, and Make It Dead Simple
For the first several hours, Lazy Man had no access to his server, and visitors couldn’t access his site. At some point thereafter, Amazon made it possible to create new instances and install any back-ups that its users happened to have available. Lazy Man found out, however, that Amazon doesn’t make it easy to make backups on a regular basis. “You can do it manually through their console,” he explained. “That requires logging into the website and clicking on a few buttons every time you want to make a back-up. There are some command-line tools, but they aren’t pre-installed. The tools also require pre-requisites like Java that is not also pre-installed on the server.”
Lazy Man eventually was able to take some steps to get back online, at least to some extent. “During a portion of the outage, I was able to put up an older backup, but I had to shutdown any kind of user activitity as it would be lost when/if Amazon was able to restore the instance as it was before the shut-down.”
Is there an answer to this, for the average blogger? One writer claims that site owners only have themselves to blame if their sites didn’t withstand the outage. Yes, users could have moved their operations to parts of the Amazon cloud that were working. And yes, a user needs to be responsible for his or her site. However, Amazon can better educate its users on how to do this. If Amazon is going to market its services to the average user, then it needs to make redundancy and backups dead simple.
If you google “how to back up amazon ec2,” you’ll find some suggestions concerning how to backup your Amazon-hosted site, such as this one. The Amazon forums also discuss backing up from EC2 to Amazon’s S3. Yesterday, we looked at how to mount an Amazon S3 bucket locally on your Mac, which might enable you to then pull that data down to your local machine, once it is backed up to S3. None of these methods fall into the “dead simple” category, though.
Even if it were dead simple, this comes back to Amazon’s poor communication. If you didn’t have redundancy built in prior to the crash, but did have backups, you didn’t possess enough information to know whether to wait out the outage, or to get your site live elsewhere, using backups.
What Does This Mean For The Cloud?
My opinion: the Amazon outage will have zero detrimental effect on the move towards the cloud. The cloud is where the world is heading. If anything, cloud providers and users will learn from this, and outages like this will be less likely to have a substantial detrimental effect on users in the future.
Personally, I’m a proponent of the cloud, but, when possible, I try to live in both worlds. Any piece of information that I keep in the cloud, I also try to keep locally as well. That won’t help when entire services go offline, such as if there were a complete Gmail outage. In my mind, though, that’s a price that I’m willing to pay and a risk that I’m willing to take, if the alternative is not using those services at all. How about you?
Thank you to Lazy Man, from Lazy Man and Money, for talking to us about his experience with the Amazon outage.