Infrastructure As Code: Business Continuity And Disaster Recovery With Cory O'Daniel

Episode Description

Chris Hill sits down with Cory O'Daniel to talk about how Infrastructure as Code can help with disaster recovery and business continuity. From the technology and personnel challenges to scenarios such as losing one of your regions and the importance of backup plans, learn how IaC can be used to help ensure data and operations are not affected.

Episode Transcript

Infrastructure As Code: Business Continuity And Disaster Recovery With Cory O'Daniel

Chris Hill sits down with Cory O'Daniel to talk about how Infrastructure as Code can help with disaster recovery and business continuity. From the technology and personnel challenges to scenarios such as losing one of your regions and the importance of backup plans, learn how IaC can be used to help ensure data and operations are not affected.


I'm talking with Cory O'Daniel, CEO and Cofounder of Massdriver. We're talking about how IaC things like Terraform can help with disaster recovery and business continuity. Let's talk about it. How can IaC help us with these scenarios?

I think infrastructure as code is probably one of the most key tools to business continuity and disaster recovery scenarios. The recent State of CD Report had a pretty scary number at it, with only 27% of organizations using infrastructure as code. When you think about different disasters that we can have, from regional outages to more catastrophic disasters with data loss, it's pretty scary thinking about how we are going to reproduce these systems. In a world where a lot of people tend to be using scripts or doing click ops, it's very hard to get a system up in a different region when you're under pressure.

It seems you need some consistency and reproducibility. Is that where IaC comes in?

I think it is. I think that doing good IaC that works well for disaster recovery is also, in and of itself a challenge. Things like codifying regions into your code versus accepting them as inputs are things you have to focus on when you're doing your development, but also what services. When you look at AWS, Azure and GCP from region to region, you are going to have services that are available or not available or configurations that is or isn't available yet. A part of it isn't codifying the systems that you have in infrastructure as code but also being aware of where you can migrate these in the event of an outage so that you aren't trying to replicate it to a region where that service might not be available.

Personnel Challenges

You're talking a bit about the technology challenges. What about the personnel challenges when you run into a situation like this? They call it the bus factor of staff changes, things like that, how can IaC help us with that stuff?

I think IaC is a great tool for collaboration, as it allows you to show another engineer what change plan you plan to make on your infrastructure and have that approved through a pull request. One of the other things that's fantastic about it is as a documentation tool to actually understand what you have in the cloud, how it's configured, where it's deployed, and whether it's a bus factor of one and you lose the one operations engineer your team might have or whether your company is going through a risk or layoffs. Business continuity is there as well. Making sure that we have these tools in place so that teams that are coming on or teams that are taking over responsibilities for people that have left understand what is provisioned, how it's secured and the different configurations that are applied to different regions.

Data In Multiple Regions

Let's dig in on scenarios where you've got multiple regions. Let's say you lose one of your regions, and that's maybe the main region where you had data. How do you manage data in these scenarios? If you need to come up in another region, how do you deal with the data problem?

Data's the hardest. With data also being the hardest, it's also probably the most costly. IaC is fantastic for describing some clusters and some databases, as well as US West. Let's say US East because we know it's going to go down. Let's say you've described all of your infrastructure, your applications are there, it goes down. You don't know how long it's going to go down. You want to replicate it to, let's say, the US West or Central region. That's great, but have you moved your data?

I think that one of the other key things with IaC is using IaC to actually manage backup planes. Being able to say things like, “Yes, we take backups that are in our East zone, and then we're going to replicate those two wherever we plan to restore them in case of an outage.” Thinking through this ahead of time, you don't want to be caught off guard trying to go from East to West during an outage to find out if service isn't there, but also making sure that that data is replicated.

From database to database, that is a challenge. It's very easy with something like MySQL or Postgres, where you can restore snapshots over there and be snapshotting them. When you're talking about Kafka, event-driven pipelines and whatnot, you have to think about what are in those pipelines, what events have been processed.

Disaster Recovery

What are your thoughts on the differences between, perhaps, let's say, disaster recovery and running as a multi-region and that being your disaster recovery?

I think that that is an excellent option if you have the budget for it. You're looking at a couple of cost multipliers there. One is having dual home infrastructure. You're going to have twice the infrastructure. Yeah, you might have the load spread across both of those regions, but you're going to have a baseline cost for having that infrastructure in place, even if the load's lower. You are going to experience higher costs. If you have the expertise to do it, that's also great, but may be hard to find.

I think the harder thing there is actually bifurcating those services. Let's say that you have some geo-based DNS resolution that routes people to the closest region. That's fantastic. Now, if I get to a service that's storing in Postgres and that's in my US West and I have a US East, that gets very difficult to synchronize across those Postgres instances.

I don't have multi-master, multi-primary Postgres. You have to start looking at things like how do I configure the wall to replicate to this other region and make sure they're replicating in both directions, which can start to get a bit heady. That is a choice, but it does require a lot of effort to get there. That is a choice that might be necessary in some of the more high-uptime environments.

Assuming cost was an issue, would you actually recommend that over disaster recovery or do you think purely having a disaster recovery plan is better?

I think it depends on the stage of the business and what their uptime requirements are. If you are a hospital, you might need to make sure your systems are online all the time and do some actually dual home or multi-region infrastructure where your low to spread across both of those or maybe you're doing a hot failover. If you are less critical, maybe you're an e-commerce site, but you don't want to miss out on Black Friday, maybe you can handle an hour outage or maybe you can't and you want to be able to bring up infrastructure to at least serve checkout quickly.

I think it depends on the stage of the business the importance of that business to the world, and which decision you make there. I think a lot of organizations can go a long way using IaC to codify their entire environment and practice standing it up a few times to make sure that everything works so that when it comes to the day when it has to work, it does, but it comes down to the business.

Platform Engineering

How do you see platform engineering, particularly things like Massdriver or helping teams accomplish this and build towards disaster recovery scenarios, and being able to make sure that they maintain uptime even when these things happen?

I'd say if you asked me this question a year ago, I'd have a very different answer, but seeing how IaC is still a concept that people want to try, but many organizations aren't adopting or putting as much time into it, I think one of the biggest features is the fact that a lot of the IaC is already done for you. By grabbing something out of our marketplace and, connecting it together, getting it deployed, you've passed the first hump that 73% of companies haven't passed. You got infrastructure as code.

More importantly, we have built this concept into the platform for environmental parody. Our environment parody is very agnostic about where it's replicating infrastructure. It could be staging and production, or it can be production US and production EU for fully isolated environments if you have concerns about data sovereignty, or it could be replicating your disaster recovery environment. Production has gone down. I want to replicate that over to US West USC.

Now, you still have the same problems as you did with doing IaC on your own. You do need to practice these disaster recovery scenarios to make sure that the environment you're rolling out to has the services. You do still have to design a backup strategy to make sure that you're getting your data over there and whatever time constraint you have on your data, do I need to have an hour of missing data, five minutes of missing data, whatnot.

It seems like also the whole idea of backups and having availability can be built into platforms like this. Not only do you enable the ability to replicate the infrastructure, but you can also build in a lot of this expertise for backups and recoveries.

Backups and recovery is something that I think is very key to a good platform engineering posture. It's one of those things that I feel like people don't talk about a lot. If you're in larger organizations or you've come from a data center, you think about backup and recovery. A lot of times in the cloud, we think about snapshots. RDS is snapshotting it. DynamoDB snapshots itself, but actually having good backup strategies and recovery strategies are absolutely critical for business continuity.

Having good backup strategies and recovery strategies is absolutely critical for business continuity.

Speaking of disaster recovery, I'm going to try to recover from this disaster of an interview by saying thanks and goodbye.

Important Links

Featured Guest

Cory O'Daniel

CEO & Co-Founder at Massdriver