Microsoft and Amazon Outages – The Need for More Redundancy

Tuesday, August 23, 2011

Ben Kepes


So now the world knows that the recent outages that Amazon and Microsoft were suffering in Europe had been caused by lightning strikes on their Dublin data centers.

The outages had caused downtime for users of both Amazon EC2 and Microsoft BPOS services.

I’ll not delve into the issues around failover – clearly the lightning strike was a catastrophic event that overcame the protection that both providers have against upstream events and caused the usual power supply backups to fail.

For those not in the know, Ireland has become a key hub for technology providers – it’s got a good climate (read cold and wet, good for temperature control), it has good internet connectivity and it has a ready supply of IT staff.

Another big factor is the fact that the Irish Government offers very attractive incentives to technology companies to relocate there… it’s all about costs after all.

Colleague Phil Wainewright covered the Amazon outage pointing out that;

EU-WEST-1 is Amazon’s only data center in Europe, which means that customers who have to keep their data within the European region for data protection compliance have no available failover to another Amazon location...

It’s a very good point, and does beg some questions about redeundancy in terms of European located cloud providers.

Wainewright did point out that the Amazon data center does in fact have three distinct Availability Zones, within one location and that the outage affected only one of those zones, however it appears that recovery efforts are affecting the other availability zones as well.

Adrian Cockroft, resident Cloud guru at Netflix was pretty upbeat about the fact that the incident only hit one availability zone saying;

They lost one AZ, that’s why there are several AZ’s. We are testing a global Cassandra cluster and it didn’t go down #doinitright. only use EU for global testing but Cassandra data is repl to all 3 AZ, lose zone it still works and recovers when zone comes back. we use excess reserved instances in prod (in US) to get priority for capacity in zone level outages…

So it seems the cascading issues on the other two zones did not affect Netflix’s infrastructure at all.

Either way, this does highlight issues. Imagine an uber-catastrophic event that knocked out the entire Dublin Amazon data center.

With only one physical location in Europe, companies relying on Amazon to host their Euro-centric data have a couple of unpalatable options:

  • Move to another provider, which in an open standards world might be easy enough but for a company built entirely upon Amazon’s proprietry standards isn’t much fun
  • Move data to Amazon’s US DCs, and in doing do potentially breach data location regulations

It’s early days in the cloud, but events like this help us think about what the future needs to look like in order to ensure that cloud is safe for everyone…

imageCross-posted from Diversity  

Help Support Infosec Island by Tweeting and Stumbling our Articles - and join our LinkedIn Group HERE - Thanks!

Possibly Related Articles:
Cloud Security
Service Provider
Microsoft Amazon Cloud Computing Managed Services Data Center Resilience Failover
Post Rating I Like this!
The views expressed in this post are the opinions of the Infosec Island member that posted this content. Infosec Island is not responsible for the content or messaging of this post.

Unauthorized reproduction of this article (in part or in whole) is prohibited without the express written permission of Infosec Island and the Infosec Island member that posted this content--this includes using our RSS feed for any purpose other than personal use.