Who gets blame for Amazon outage?

Published: April 26th, 2011

Amazon.com has promised to provide a “detailed post-mortem” on the root causes of the prolonged outage of its cloud services in recent days. Users of the Amazon services, meanwhile, may also have to explain how they got caught up in the outage.

The ensuing conversations may be uncomfortable for both Amazon and its cloud customers — perhaps even more so for users of the services.

Cloud services overall have been remarkably reliable, which may be fostering a dangerous complacency among customers who are putting too must trust in them. This is another old and familiar story of technology hubris, one that was famously illustrated by another tech marvel, the unsinkable Titanic.

In this case, it is IT managers who will have to explain to their users — and to their company’s executives — why they didn’t have a lifeboat.

Amazon’s partial outage, which began Thursday and seemed largely resolved today, was an exceptional event.

Based on data compiled by AppNeta, the uptime reliability of 40 of the largest providers of cloud-based services, including Amazon, Google, Azure and Salesforce.com, shows how well cloud providers are delivering uninterrupted services. The performance management and network monitoring firm, known as Apparent Networks until this week, captures minute-by-minute uptime and other data from cloud providers used by its customers.

The overall industry yearly average of uptime for all the cloud services providers monitored by AppNeta is 99.9948 per cent, which equal to 273 minutes or 4.6 hours of unavailability per year.

The worst providers clock in at 99.992 per cent or 420 minutes or seven hours of unavailability a year.

The best providers are at 99.9994 per cent or three minutes or .05 hours of unavailability a year.

The takeaway for cloud users looking at the AppNeta data is often that the risk of an outage is very low.

AppNeta runs its company on Amazon’s cloud technology and was thus affected by the outage. However, its problems where short-lived because it’s service is architected to respond to a data centre failure in Amazon’s cloud.

Matt Stevens, the chief technology officer of AppNeta, said its system was able to fallback to an alternative availability zone in another data centre in Amazon’s cloud.

“You still need to plan for worst-case scenarios,” said Stevens, who said Amazon advises its customers to plan for a potential data centre interruption. “It was actually their guidance that helped us avoid this from being more being more painful.”

Amazon has built the system with multiple levels of disaster recovery, including a design for high availability across virtual infrastructure within a zone, such as the ability to failover between servers, as well as planning to failover to another data centre, as AppNeta did.

AppNeta has redundant mirroring of its data in Amazon’s S3 storage service, which allowed them to pull that data into a second data centre. Their problem was limited to a couple of hours Thursday morning, said Stevens.

Stevens believes that the Amazon’s outage will cause people to step back and ask some question about their internal architecture, as well as ask whether to adopt a multi-cloud strategy to do more to spread the risk. “That’s certainly got to be top of mind for a lot of CIOs today,” he said.


More Articles