We devoted the Palomino Newsletter this month to the important topic of disaster recovery, in light of the challenges posed by Hurricane Sandy. If you're not already receiving our newsletter, you can subscribe here.
Hurricane Sandy has been on many people's minds of late; mine not least. Having lived the last 4 years of my life in Manhattan and on the Jersey Shore, the loss of lives, the destruction of homes, business and memories, and the disruption of so much has me in shock. I grew up in Louisiana, and hurricanes were a way of life. You didn't do something hoping that a hurricane would not come by. You assumed a hurricane would come. At least, that's how I was taught. That's the mentality I try to bring into my architectures, my process and my planning as well. So, when hurricane Sandy bore down on the East Coast, my alarm bells started ringing, just as my email started exploding. Every one of our US-East Amazon customers was in danger. Who knew when power would go out? And when it would come back?
Palomino is proud to be an Amazon Web Services consulting partner. That said, we recognize that Amazon has had its share of instability. A few weeks ago, US-East experienced some significant EBS latency and unavailability. We've lost availability zones. We've lost regions. We've found availability zones inexplicably unpredictable in terms of latency and availability. Amazon forces us to think resiliently. Not in preventing disasters, but weathering them, bouncing back, and being ready. Some say this is an issue with Amazon. That the unreliability is a drawback. Perhaps I'm the eternal optimist, but I simply see it as a way to force rigor in anticipating, documenting and practicing our availability and business continuity plans.
None of this is new or incredibly enlightening. Any operations person worth their salt thinks of failure and what can go wrong, and they think of it often. So what's the point here? I thought I'd share the war stories of the weekend to help cast a light on varying degrees of preparation.
Client One
Client One contacted us. They had anticipated the problem and already been preparing to create multi-region EC2 environments; Sandy just accelerated things. This client is in RDS, Amazon's Relational Database as a Service - in this case MySQL as a service. RDS is such a convenient tool, until it isn't. One of the big drawbacks? No cross-region support. Yes, you can use Multi-AZ replication for Master availability across availability zones. Yes, you can also create replicas in multiple availability zones. If you do both of these things, you've got a certain level of fault tolerance in place. You can still get hurt if your master does a multi-AZ failover. All of your replicas will break, as RDS doesn't take into account the ability to move manually to the next binlog when a master crashes before closing their binlogs. Thus, you are without replicas. Not great. But you have a working master. Similarly, you have multiple replicas across AZs, to tolerate those failures. But cross-region? Nothing.
So, we had to dump all of our RDS instances and load them into RDS in another region. Parallel dumps and loads were kicked off, accelerating the very painful process of a logical rebuild of a system. We used SSD ephemeral storage on EC2 to speed this up as well. The process still took 2 days. OpenVPNs had to be set up with mappings for port 3306 to allow replication. If this hadn't already been in process before Sandy was a threat, we never would have been ready in time. We still had and have issues. You can't replicate from RDS in one region to another. Custom ETL must be created in order to keep each table as in sync as possible. We'd done this work in a previous plan to move off of RDS, mapping tables to one of three categories - static (read-only), insert only, or upd/del. Static just needs to be monitored for changes. Insert only can be kept close to fresh with high water marks and batch inserts. Transactional requires keys on updated at and created at fields, and confidence in the values in those fields. Deletes present even bigger problems. Digging in further is out of scope here, but consider it a future topic.
Summary: Client One was in-process for multi-region disaster recovery (DR). A fire-drill occurred, and people had to work long, long hours doing tedious work. But, had Sandy hit their region with the force it hit further north, we'd have been ready.
Client Two
Client Two contacted us also. They had known that they were at risk, but they were small, they were pushing new features and refactoring applications, and DR was far out on their roadmap. They too, were on RDS. They could not afford the amount of custom work our larger clients requested, so we had to create a best effort approach. RDS instances were created in Portland, along with cache servers, transaction engines, web services and the rest of the stack. Amazon Machine Images (AMIs) were kicked out, and we built a dump and copy process across regions. There would be data loss, up to many hours, if the region went down and never came up. But they would not be dead in the water. Data loss can be mitigated by more frequent dumps and copies, but not eliminated completely.
Summary: Client Two had no plans for multi-region DR. They had taken a conscious risk. Luckily they had the talent and agility of a small company and could move fast with our help. Failing over would have hurt, but they'd still be alive.
Client Three
We reached out proactively to Client Three. They had put together a multi-region plan for critical systems last year before we started working with them, which included scripts to rapidly build out new clusters of Hadoop based systems. It was supposed to just work. When we started working with Client Three, we’d scheduled our DR review, testing and modernizing for our Q4 checklist. Too little, too late, right? Sure enough, things didn't "just work". It wasn't horrible, but a weekend of cleaning up, rescripting and fixing problems as they rose occurred. But had we had to fail over? They would've been ready.
Summary: Client Three had anticipated and architected DR, but they hadn’t tested it. Luckily we had the days before the storm to test and to fix this. If they hadn't planned at all, I'm not sure we would've made it.
It’s also worth remembering that you are not alone in these shared environments. All weekend shops were staking claims on instances and storage, and building out. Rolling out resources got slower, and if you didn't claim, you'd lose out. This has to be considered in your plans.
To recap: Palomino loves AWS. We're a consulting partner and have helped many clients in many different business models deploy, scale and perform in AWS. But DR is not a luxury anymore. It's a necessity. Architectures have to take multi-AZ and multi-region plans into consideration in the beginning. Many people use AWS so they save money on hardware. They get upset when you point out the labor and extra instances needed to guarantee they can weather these storms. But it's a hard reality. It's one of the reasons we only recommended RDS in early phases, when downtime is tolerable. Good configuration management also means you can deploy a skeleton infrastructure in another region; you can explode that to a full-blown install with ease. But you have to practice, and you have to move fast. If you think your region can go down, go to DEFCON and push the buttons. If you're wrong, you can always tear back down.
Anticipate.
Plan.
Build it early.
Automate it.
Test it.
Test it.
Test it.
Test it.
If you haven't been able to donate to the Red Cross or other institutions helping our fellow brothers and sisters in the Northeast and in the Caribbean, please take some time to do so. Having lost property and cared for loved ones displaced by Katrina, and now hearing so many horror stories from New Jersey and New York, I urge everyone to donate money, donate shelter, donate time and skills if you have them.