The EC2 Fleet Upgrade Tests our "Cloud Abilities"
Posted by Greg Arnette on Fri, Dec 16, 2011 @ 12:19 PM
In the past I have written about the secret to successful cloud deployments and how to architect for the cloud. Being successful requires a "designed-for-the-cloud" architecture, best operational practices and DevOps on steroids.A couple weeks ago Amazon notified a majority of their customers about an upcoming event that us early-to-the-cloud pioneers hadn't seen before; a forced reboot of the host operating system. On a massive scale. For Sonian, 72% of our currently running EC2 instances will need to be restarted before Amazon's deadline. There is no reprieve. There is no deferment. Welcome to Infrastructure as a Service!Our AWS business development contact gave us an early heads-up, and Twitter lit up when the first email notices started to arrive for the US-West region. Something big was afoot. And a lot of groans from the EC2 user community. First let me state flat out that Amazon did a pretty good job getting the word out and provided several methods to know which EC2 instances would need to be restarted. An email was sent with the list, the EC2 Management Console displays the information, and the EC2 API 'Ec2-describe-instancestatus' field has the information. Fortunately Joe Kinsella, Sonian's VP Engineering (@joekinsella,) enhanced our Cloud Control Viewer and provided a report showing the exact instances and their reboot schedule.
Of the various reboot types, the most invasive is the one that moves the virtual host to new hardware. That will force a change in IP address and ephemeral storage is lost. This activity will certainly shake out any bugs in automated deployments, hard-coded settings, and sloppy shortcuts. We had to scramble in order to assess the impact. All we learned from the email notice was that a portion of our EC2 instances would need to be restarted. Actually there were two types of restarts. An operating system reboot, which would preserve the non-persistent ephemeral storage, and a more invasive full instance restart (meaning the hardware under the hypervisor would power-cycle) which would not preserve the ephemeral storage.One of the major mistakes cloud customers can make is to get complacent and treat the cloud like traditional co-located hosting. The cloud has different operating characteristics, what one could call the "cloud laws of physics," and this forced restart is a good example of this principle in action. It's also a wake-up call to not get lazy. A large-scale forced restart is like an earthquake drill. Practice makes perfect, and if this were an actual un-scheduled emergency, then we would be scrambling. Despite the headache, this event has some positive spins. First, it's encouraging there is an "EC2 fleet upgrade," which means newer underlying hardware. Perhaps faster NIC cards in the hosts. But for the companies like Sonian that started in the cloud circa 2007, some of our original instances that have been running for more than a year needed a "freshening." This event reminds us there is a “hardware” center to every amorphous cloud. Amazon just does a great job to allow us to not have to think about that too often, except for times like these. A stale part of the cloud gets a refresh. The second "benefit" is the forced fire drill. I know, there's never a good time for a fire drill. But this type of event has similar qualities to an unexpected outage. There is some luxury to pre-planning, but the shake-out will be the same. Something will be discovered in your architecture or deployment practices that will get improved by this reboot activity. Clusters may be too hard-coded. Config settings may be to restrictive. Reboot scripts may not work as you think. Sonian survives unscathed due to our maniacal focus on 100% automated deployments, 100% commitment to "infrastructure as code," and an investment in cloud control tools that allowed us to triage the situation and develop an action plan relatively quickly. We also employ the best darn DevOps team the cloud has seen.