Leap Second, Public Cloud and a Lesson in Enterprise Resiliency

Friday, July 06, 2012

Rafal Los


I'm a big Louis C.K. fan, and when I think of the weekend we've had in technology my mind immediately pulls up one of his a famous quotes (see video clip).

"...Like how quickly the world owes him something he knew existed only 10 seconds ago..." when he's talking about a person on an airplane disappointed when on-board WiFi goes out.  

How right is that, though?  Everything's amazing right now in technology, but nobody's happy!  Our users expect to be able to stream Netflix while traversing the continent at 35,000 ft. and when something fails they get indignant.  Hrmm ...

So recently as I was sitting down to watch some streaming content at my parents' house, I realized some of the services I had come to depend on were throwing 503: Service unavailable errors.  

I went straight to my source for information which is Twitter, and after getting some strange API errors on my Twitter client I finally got the news that there was a massive outage of the Amazon AWS cloud out east where many of these services I was trying to access live.  

Darn those electrical storms... The problem is that many of the Amazon customers who were set up for same-vendor redundancy experienced a failure somewhere in the ELB (Elastic Load Balancing) service... and things sort of went downhill from there.

At midnight UTC, just as the cloud thing was starting to settle down and services were being restored to normal, the leap second bug wreaked havoc in some strange and unpredictable ways [See Wired write-up].  There are a lot of interesting notes, bug reports, and second-hand-stories from this too... MySQL had an issue, as did Java.  

What's interesting is the bug report from 2009 [Red Hat bug 479765] that seemed to be dismissed as so highly improbable it wasn't in need of a formal fix... a hit at things to come maybe?

So anyway, the Public Cloud crashed (Amazon is the public cloud in many people's minds still) and Linux had a freak out.  I know some of us were awaiting the third shoe to drop and the zombies to come pouring in off the streets...

No, but seriously, these were two very serious issues which (as many press articles put it) cast doubts on the public cloud and IT to deliver stability in this ultra-scale IT world.

Capture.PNG On Twitter, one of my followers [@michaelhood] made this interesting observation, which I can't help but think is amusing as well...

Sometimes Murphy and chaos just team up against us, and things break.  Like I've been saying for a while now "Everything Fails"... it's not if things fail in technology, it's in how we recover that matters.  This applies to getting hacked, blowing a storage array, or a public cloud.  Things will fail... seriously, we need to get over it.

You'll undoubtedly read articles on many prominent news outlets that tell you that this outage means Public Cloud is not a viable option for consumers and enterprises.

That's simply not true, categorically silly, and potentially misleading.  Ask those who are telling you that public cloud is bad whether their internal (or private cloud) would have survived an outage of this caliber and complexity.

I don't pick on organizations that are experiencing issues, as a rule, so I won't even touch whether Amazon AWS should have been able to stand up to a failure of this nature or not, or whether as some suggest other providers in the area of the lightning storms survived and were available... so did Amazon fail somehow?  I have no idea. Does it matter to you the customer... yes, but probably not the way you're thinking.

Learn this lesson: Resiliency is ultimately your problem

Consumers of cloud services are still failing to understand that building resiliency into their critical services is their responsibility.  If you are pushing a critical service, and I mean really critical, to the public cloud and you're dependent on a single provider then I would argue you've done a terrible job of understanding and mitigating your risks. Period.

No single provider is fail-proof.  I don't care if you're using our service (HPCloud.com) or some other provider - every provider has an equal chance of being struck by natural disaster, human error, or a stray meteor... and if this completely kills your critical cloud-based service you are to blame for failing to account for this.  No one else.

Critical services should be architected to withstand failures of all kinds, we call this resiliency.  You want to build in resiliency across providers, across software, across hardware platforms, and across data - where you can and where it makes sense.  You should absolutely have diverse technology and a multi-vendor strategy... no doubt in my mind.  

Don't kid yourself - everything in life will fail you at some point, but it is a smart thing to plan for, and architect into your applications or services.  Undoubtedly this type of strategy costs more - having multiple providers, more complex software, more data shipping and redundancy isn't cheap - but if your services are critical to your business or the lives of others this is one lesson you must learn.

Take that weekend's unpredictable, erratic, and odd failures as a sign. It's happened before, it will happen again - just make sure you're not crying about it next time it happens, and it's your business that's unreachable or down.

Cross-posted from Following the White Rabbit

Possibly Related Articles:
Cloud Security
Service Provider
Enterprise Security Amazon Cloud Computing internet Managed Services Information Technology Resilience AWS vendors
Post Rating I Like this!
The views expressed in this post are the opinions of the Infosec Island member that posted this content. Infosec Island is not responsible for the content or messaging of this post.

Unauthorized reproduction of this article (in part or in whole) is prohibited without the express written permission of Infosec Island and the Infosec Island member that posted this content--this includes using our RSS feed for any purpose other than personal use.