Guest Host: Kelsey Hightower - Beyond Pipelines: Infrastructure As Data

Is your Git repo really the source of truth for infrastructure - or just a suggestion?

Guest host Kelsey Hightower sits down with Cory O’Daniel to unpack why many teams hit dead ends with CI/CD for provisioning, where GitOps struggles with drift, and when TicketOps helps or hurts. They explore a different model: infrastructure as data with typed contracts, shared artifacts, and workflows that embed policy, validation, and upgrades from the start. You’ll hear practical ways to reduce cognitive load for developers while giving operations reliable control and better day‑2 levers.

You’ll learn:

Why pipelines are a poor fit for infra provisioning and what to do instead
How to reason about drift as a three‑way merge with reality
When reconciliation helps, and when it breaks production firefights
How typed contracts and artifacts connect modules and teams without glue scripts
Ways to present safer self‑service without requiring everyone to learn Terraform
A simple mental model for treating TicketOps as a surface, not the workflow

Guest Host: Kelsey Hightower

Kelsey has worn every hat possible throughout his career in tech and enjoys leadership roles focused on making things happen and shipping software. Prior to his retirement, he was a Distinguished Engineer at Google, where he worked on Google Cloud Platform. He is a strong open source advocate with a focus on building great software as well as great communities around them. He is also an accomplished author and keynote speaker with a knack for demystifying complex topics, doing live demos and enabling others to succeed. When he is not writing code, you can catch him giving technical workshops covering everything from programming to system administration.

Guest: Cory O'Daniel, CEO and Co-Founder of Massdriver and Co-Founder of OpenTofu

Cory has been a software architect and engineer for 20 years, leading up to the founding of MassDriver. He's also a husband and the father of two kids.

Cory O'Daniel, X

Cory O'Daniel, Medium

Links to interesting things from this episode:

"Gitopscracy" video

Transcript

Cory: 00:00:14

Welcome to part three of our guest host, Kelsey Hightower, interviewing me about what I do during my day job at Massdriver and some of my maybe extreme opinions about infrastructure as code. The first two episodes were about cutting cognitive load and giving developers a real understanding of what's running in prod.

This one's about what actually holds under pressure. We dig into where CI/CD breaks down for infrastructure, the reality of drift and GitOps, and why TicketOps is just one interface, not the workflow.

Then we get into what better models look like. Infrastructure as Data, typed contracts between components and bundles that bake policy checks and upgrades right into the flow. Let's jump in.

No one's ever super happy. Nobody's like, "Man, this CI/CD stuff was super fantastic."

Like, Devs weren't even really excited about it when it was just running their tests, right? Like the CI part, do we want to integrate the software with the rest of the software? Does it test good? Is the win-win good? Does the code look good?

Like, it was already a painful place and then the delivery did turn into deployment. For a lot of organizations, it's like, oh, GitHub Actions deploys my thing. It's not putting something into some sort of an artifact registry, right.

And so I think we misused it a bit. Why I don't think it's necessarily great for infrastructure is you lose a lot of details when your terraform is executing there, right?

And you see this solved by many tools. Some that I think are great, some that are competitors that I've absolutely recommended potential customers of Massdriver go buy instead of us.

I'm not, you know, super prescriptive on, like, you have to use my thing. But we're getting feedback now in a comment. And so the comment's like, hey, this is the plan, this is what we're going to do.

And you're like, great, that looks great. Merge it and it's like half it didn't work. It's like, oh, well, now I've got something broken in main. Cool, right?

So now it's like, okay, well, hopefully people aren't cutting things off of Maine right now, right? Cutting something broken off of main. And it's also this like, like, choke point.

Like, if I've got my infrastructure managed and it's being Provisioned from a GitHub action, what environments is that? Right? Like, if I'm actually like, like, can I get a preview environment Can I get a preview environment of a VPC and a Kubernetes cluster?

Probably not. That's expensive, right? So, okay, I want to test this thing out and see how it changes. Like is it just a plan?

Is that all I'm actually Getting in my GitHub action is just the plan or are you going to do an apply and show me like what would have happened? And it's like well, if you applied it, what was it? Is that a net new Kubernetes cluster?

Because that's very different than the one that's running in Prod that may have had some other services. You know, deploying things with Argo, right? Like it's not a real representation of your world.

CI/CD, I think as a place to run infrastructure from provision infrastructure from is just fundamentally not the right fit. I think that D should be delivering some infrastructure as code to a registry and that's how matchdriver works.

Like we're a registry for the infrastructure as code modules, right?

And that allows us, as this comes in and becomes data to us, it allows us to do a lot of really interesting thing and expose a lot of interesting points for you to kind of hook into as an operations team. But you're just losing that data in the logs of GitHub Actions or in the logs of Jenkins, right?

It's funny to think about like the transition from like the three pillars of like observability, sorry observability monitoring to like OTEL and structured data that we're sending out. We all kind of agree like eh.

While structured data is a little bit of a pain in the ass, I will take structured logs or traces any day over just a dump of raw text. But what are we getting in GitHub Actions now?

You might have some structured output from your Terraform or whatever, but it's structured output that's just to a log someplace that you don't have access to, right? It's hard to analyze it, it's hard to get information from it, it's hard to glean things from it.

And you're always kind of like working on either an approximation of or maybe net new infrastructure or you're merging things that may be broken to main. I think it creates a lot more heartache than a system designed for executing infrastructure as code. So that's why I think CI/CD is a dead end.

Kelsey: 00:04:13

You know what, so it's dope about that is that a lot of people that are using CI CD in the way you described, they're shell scripting again. Right.

They're just writing a bigger shell script with a bigger scripting execution framework and it has become their shell and the pipeline has become their script. And I think when the thing that made me super excited about Kubernetes, that a lot of people didn't understand why I was so excited, right?

This is someone who used to work at Puppet Labs, used to work at, you know, worked at CoreOS, and had been a system administrator years prior.

When I saw Kubernetes, it's like, oh, I get it now you finally have a better last mile technology that says if you give me what you would like your infrastructure to look like, let me be responsible for doing the deployment part. I will figure out how to get a container on a server, if the server goes away, how to put the container back.

So that means your pipeline then becomes even more concrete. The artifact is maybe for some people the Docker image. But for a lot of people it's going to be those manifests. Those manifests become inputs.

And I think some people learned early on it was in your best interest to maybe resolve the manifest. Maybe that meant taking a container ID and put it in a deployment object.

But then the object was the final artifact and some people stored it somewhere and then Kubernetes may watch to say, wow, there's a new artifact. And now I'm going to resolve the state of the world. But that only works for one cluster.

And maybe that gets us to the next kind of maybe dead end that you talked about, which is GitOps, right? Trying to describe your entire infrastructure using a YAML file delivered via git also becomes more of a kind of a dead end in your mind.

So CI/CD, we talked through how that's kind of a dead end. Trying to use that as a replacement for your shell scripts. But then GitOps falls short in some other places.

What does that look like on the GitOps side?

Because you think you're doing something now where you got all these YAML files describing the world, but then you're still falling short somewhere where.

Cory: 00:06:17

Yeah, I'll have to link in the show notes to my. Have you seen my Gitopscracy video? Did you see that?

Kelsey: 00:06:23

I did not see the Gitopscracy video.

Cory: 00:06:25

I took the executive scene from Idiocracy and I redubbed it all about GitOps. But it's funny because, like, I have a blog post I'm working on on this now. I've talked about GitOps not being the right solution in the past.

But the thing that's interesting with it is developers didn't want it. And I'd venture to say at your organization right now, if you're doing it, your developers didn't want it.

Like, GitOps is a choice that Ops has made a decision and said, this is what you're going to do, Developers and usually what GitOps is now, I have to say, separating the idea of deploying your application from your infrastructure, I think that it can absolutely make sense for deploying applications. But as far as like managing a database, what do I mean by this? You made this decision for. Well, how I see a lot of teams get to.

This is the reason they get to GitOps is they're like, ooh, I can have some reconciliation. Why do we want reconciliation? Because there was drift. Why was there drift? Because the thing that you gave developers was too hard to use.

Like, let's be real about it, like, the drift comes from a few different places.

Kelsey: 00:07:27

Describe drift for people.

Cory: 00:07:29

Yeah, so when, when your view of the world and your infrastructure as code does not match what is actually in the cloud, you have some drift. And drift comes from one of three places. It comes from just not knowing how to design IAC for auto scal.

That's not your developer's problem, that's your problem. Ops team, sorry, that one's on you. So we'll just go ahead and knock that one off.

The other two problems do come from your developers and they're both user experience problems. One is it's just too damn hard. And that's the one I'm going to like tear on in a second.

But it segues really greatly into the other one where I see people say, well, drift happens because there's outages and at 2am it's just easier to go into the AWS console and troubleshoot and fix something than to go through our process. And it's just like, wait, sorry, can you say that to me one more time? Your process is so hard that in an emergency you can't use it.

Like, that is also a user experience problem. Now if just you're. If you self host git and gets down, great. Okay, that's a, that's an anomaly. That's not like the average 2am outage, right?

And so going back to like, it's just. Just too hard. Right? Everybody kind of starts out in the same place when they start getting into iac.

Like you grab Terraform or whichever tool and you build a little thing, you're like, oh, that's pretty cool. It actually made a queue this, right? And it's Very easy for the Ops team to kind of okay, I get a request from SurfaceNow.

Developer wants this, let me type them some terraform. And it gets to the point where you're like, okay, this is actually, this is a lot of work for me.

Like developers need to be doing this terraform thing. Now. We fall immediately into that operational production knowledge context that they're missing problem that we talked about earlier.

But then we hand them the Terraform.

So they're surfaced with, that they're surfaced with like trying to find where all this stuff is in your git, where the modules are, where, where we keep all of the actions that we're supposed to use, right? A lot of decisions. And so developer goes, well, I mean I do have production access to the AWS console.

Like, this is a pain in the ass and I have a deadline like this, this right here is literally a waste of time for me. You're asking me to go through all of this. My PM's asking me to go do this thing that makes the company money and I have to have it done by Friday.

I have kids, so I'm not working past 5pm tonight. So guess what, Ops team, like you're going to fix my drift anyway. I'm going to go do it in AWS console.

And Ops goes, man, I really hate it when they do that. Let's, let's do a drift detection tool. And you're like, oh shit, they're just drifting all the time. That tells you something you don't need to know.

And you ping developers, you're like, I'll send them the drift notification after they drift. It's like, okay, your tool that you gave me was, was hard to use and now it's annoying. I know I drifted. That's your problem now, Ops.

I got to get something out by Friday. Now that might not be in the spirit of DevOps in collaboration, but like there's also business that has to be done, right?

And so the reaction is, okay, drift detection isn't enough. Drift alerting just kind of annoys people because they know that they're drifting in the first place. We got to get to reconciliation.

If they change something, we're going to change it right back. And that's what reconciliation is like. That's how we got to GitOps for infrastructure is devs are changing something in AWS.

And again, I'm be very distinct that I'm not talking about deploying your apps, I'm talking about managing infrastructure as code. Devs are doing stuff in AWS we don't like it. It's creating drift, it's causing chaos. So we're going to reconcile and just switch it back.

Okay, then you have made the decision that they have to do GitOps by saying, we need reconciliation to solve drift. You've said, developers, you found a way around, you found the easiest path and we don't like it.

So I think your challenge is really if you've reached for reconciliation or drift detection. Drift detection might be a good indicator that, like, you're not doing a great job as an ops person presenting an interface to your developers.

But I think the solution is not saying, ah, you did it wrong. How dare you? The solution is how do I make this thing easier for you to use? Right?

And so I think a lot of the stuff that we have in the space, like, it's so much easier for us to be like, oh, I'm just gonna slap the hands and reset developers. That's great. Except for it's that 2am thing again. Prod's down.

Oh, gosh, this GitOps thing and trying to figure all this shit out and Terraform's too complicated. I'm just gonna go fix it in AWS console production's back up, I'm going to bed, I lay down, reconciled, switch back to the broken config. That sucks.

How do I turn off reconciliation 2:00am? I'm going to page you, the ops person, and say, "Hey, I can't figure this thing out. Every time it reconciles, it breaks. Prod again." Right?

So how do we make these systems easier? How do we make the layer right? I think the silos are fine. The silos held our expertise, but how do we make them malleable?

How do we make them where we can actually interface with each other in these knowledge silos that we have without putting that burden on a developer and then telling them what tool and what approach they have to.

Kelsey: 00:12:00

When I was at GCP, we thought about this problem too. And you know one thing, when Kubernetes came out, remember saying something like, we are now at infrastructure as data.

And when you start to think about infrastructure as data, then we can talk about, well, who owns what field?

When you say Infrastructure as Code, I think that's when people get nervous about changing something because of the number of side effects that could occur. Yeah, right. Because you change a for loop to if statement, it changes everything about that program.

And so infrastructure as code has always been seen as this fragile black box.

And you have to go through these control points because of what could happen due to side effects when you get into infrastructure as data, then the thing that needs to change, what hasn't changed yet is from the cloud provider's perspective, they don't let you see the model. You're interacting with random APIs that also have side effects on the backend. Because the truth is your git repo is nothing. It doesn't.

It's not authoritative for anything. At best it's a suggestion.

And the reason why I say this is, and you know this well, if in your git repo you have a YAML file that says, or HTL file that says, these are the 38 options for this VM. Well, I'm the cloud provider and I have 108 options. What is the value for the other 70?

You have not defined them and I'm free to change them from release to release. So even your thing is drifted.

Unless you specify every possible parameter in that git repository for every version of my API, then you're already drifting. It's just that you don't have a way to tell that you've drifted.

So what you're trying to do is say, basically the way I like to think about it is every time you push these configs, you're doing a three-way merge with reality.

Cory: 00:13:54

That's it.

Kelsey: 00:13:55

Yeah, that's all you're ever doing. So this idea that Git is authoritative is false.

And if you believe that it's false, then what needs to happen going forward is that one thing a cloud provider could do.

If my developer goes to the UI and we agree that call it a git repository is authoritative, then you allow to say if you use the AWS console, what you're really doing is updating the YAML file in our git repository, doing the three-way merge and submitting it together to AWS. So what happens then is you're really editing our source of truth and then presented it as a whole.

And I think that's what mass drivers try to do at the front end, right? Saying, you know what, there's only one source of truth, one state of the world, and it's always this.

So ops, here's your contract, Dev, here's your contract and we're always going to merge and then give it to the cloud provider. The moment someone uses a different interface that's not tied into the whole equals drift. That's the way I like to think about and simplify it.

So in theory there should be no issue with someone going to the AWS console and doing something. But the only reason why it is an issue is because it is the only authoritative API that we have and everything else is secondary.

So I think once people break that paradigm and get rid of the illusion that their git repository is authoritative for anything, they'll be a little bit more open to say infrastructure is data which can be manipulated by multiple surfaces. How do we then merge them back?

So in my world it would be there is no release on our side, but we've noticed a change upstream and we're going to merge it back. That way there's no drift. So that way five surfaces can interact all at the same time and last one wins.

You have an audit log to figure out what happened and who did what where, and then you can go have a discussion whether those surfaces should be allowed to manipulate the state of the world. So I think that is the last mile to all of this is to get all the surfaces to agree that there's a shared data model and we all get to manipulate it.

Cory: 00:16:06

Yeah, I think the third dead end, I will say, and I'm going to put a caveat on this one because I think it's. I think it's important. I don't want to like panic anybody. This one's going to sound the most reasonable to be a dead end.

But I'm also going to advocate for it not being a dead end just for fun. And that's Ticket Ops. Now, when is it not a dead end? When it works, right?

Like if you have relatively stable infrastructure and changes aren't happening that often, you might not need all the bullshit that I've just said. And TicketOps is fine. Right.

And I think that's one of the things that sucks so bad about a lot of the other kind of talking about like the vendors in the space and, and how they, some of them, I think are pretty disingenuous. There is not one way or tool or process for doing all of this stuff.

And TicketOps might work absolutely well for a team that has relatively stable infrastructure and not introducing a ton of changes. But if you are to the point where you're like, man, like, all my job is clicking approve and service. Now that's a dead end.

I don't know what the end of your day feels like. You must need a real sig. Like, I don't know, like, like that. That feels rough. Right?

And so the idea of self-service through requesting things from others is an absolute dead idea at scale. Like you will never scale that team.

And I think that if you survey a lot of those teams what you're going to see is like... and we work with a lot of those teams... this is where this is coming from. You still see like that, I have to ask you for this and then I have to wait on you.

That's where the animosity started.

At the very, very beginning, before we had DevOps, I was tired of waiting on these ops people and the ops people were tired of me asking to change stuff and TicketOps and like going through like the approval workflow where like these blessed people over here must approve what I do is where that animosity came from that started the movement.

And so I think that TicketOps is one of those things that can be absolutely fine at low scale, but it is the thing that caused all of this in the first place. And for a lot of orgs, like, we've just done a full circle and gotten all the way back to the beginning. We're like, hey, we solved it.

If you need some stuff, you just ask us in ServiceNow and we'll do it for you. And it's like, that is literally the way it worked 20 years ago, except for we hit you up on IR instead of ServiceNow or whatever. Right.

And I'm sorry not to call out them in particular. I think there's definitely some valid places for that. But like, if that is your provisioning flow, that's a dead end.

And you're not going to get to this point where you have harmony between the teams. You're not going to get to a point where you can reason and understand and query your environment.

You're not going to be dynamic and nimble in the cloud. You're going to be stuck doing grunt work. So I think that's the other one that is the most dead end and also acceptable at small scale to me.

Kelsey: 00:18:57

So I'm going to challenge you. I think it's a dead end if you don't have all the other things too, because again, I'm going back to. It's just another surface.

So let's say you're dealing with people who have no interest in anything about the infrastructure. It's like when I buy an airline ticket, they don't show me anything else about what it takes to fly the plane and scheduling.

It's just like, look, you want a ticket, here's what's available. And I remember building a ticket op system that did scale really well. When I worked in financial services, we had Puppet, we had all this stuff.

And I was like, you know what? For some people, they love the abstractions.

We built around Puppet, we built every company, Every administrator goes through the environment generation tool. Right. Like, what software do you need? And then create. Everyone's done that. I've done that multiple times. Oh, yeah.

And so look, that got us really, really far with like the engineering and technical teams. Right. It was great. But then there was another class of user which was like the product managers that were releasing software.

All they wanted to do was say, I heard QA tested something and it's ready for customers and I want to decide when it goes to prod.

And so what we would do with the Ticket ops was they would open a ticket and in that ticket I would write custom fields for like Jira, where it would only show you the software that has been signed and tested by qa. The PM would take that dropdown, pick a version that they liked, and maybe there were some other things that was for the customer.

Like, hey, credit card services needed to be there. And they hit save. Now, at this point, we know that everything in that dropdown is valid.

So the moment they hit save, if it was for production, they wanted to sign off to say, this person can do this.

They still work here, they haven't been fired and, you know, the things hasn't been removed or we're doing this during the change window, which is the only time we allowed that. And so that approval was for saying there is someone that just wanted to do it. But once you hit approve, everything went through Puppet.

We had the report, we had all the things, but for that person, their surface into our world was TicketOps. Yeah. And I think if you're going to do Ticket Ops, it almost has to be seen as an additional interface.

So if you were using Massdriver, I could imagine someone saying, I'm going to build a ServiceNow integration to Massdriver. Right. So maybe I don't want you looking at Massdriver or it's overwhelming for you.

You can say something like, what is the most expensive stack we have? Open a ticket to request this information, and then we query Mass driver, get that information and respond in a ticket automatically.

So I think there's a way to do ticket Ops, but it has to be more of a surface than the workflow. Right. Because I think what you're talking about is when people take Ticket Ops and says, all right, I got another thing to do today.

Read the ticket, start running commands, maybe even doing click Ops. I'm done now. Ticket closed. And it's like, oh, no, this cannot be your workflow engine, because it's just not rich. Enough?

Cory: 00:22:01

Yeah. If you go through the efforts of building it into automation, I think it absolutely makes sense. Right.

And if you have a production environment that is extremely, like highly sensitive information or like critical uptime. Yeah. Like having an additional, like, separation of duty, like, make sure this is approved before going through.

What I meant is the back and forth kind of stuff where it's like, hey, I need a database. It's just like, I have a big text field that just goes on forever. So I'm like, I need a database. And then I get a comment like 24 hours later.

It's like, yeah, which one is like postgres? What version? Like, I don't know what to put in here. It's not structured. Now, if you can start putting in fields.

And I say, hey, I need all these things.

But again, this is where I think you start to get into some of the other stuff that's like, you did a lot of effort and then you just kind of fell back on a band aid. Right. It's like, okay, let's say that we go in and we say, okay, you know what? People are asking for databases all the time. I'm the DBA.

I make a lot of decisions around like, SLAs of these databases, the alerts, et cetera. I know the 10 fields to put on this thing, right? Okay, cool. How do I do that? I go in there.

Maybe Jira or whatever has the ability for me to like, add these fields. And now developers like, this is great. I can just go in here. Like, I don't have to go back and forth. Boom, I'm in there.

It takes five minutes to fill out this field or this form. This is great. I hit submit. But then on the other side, what are you doing?

If it's just a log, like in your thing, you're like, okay, now I got to take this and I got to go copy and paste it into, you know, a GitHub repo someplace to, like, provision this terraform module that we've built. It's like, you're like, you're right back into the other two things that I think are dead ends, but also like, you're. You're the glue.

Like, and I refer to this as a meat gate. Like, it's just like this piece of meat that you have to pass through to like, get into production. And it's like, if you can take that.

I love this idea. If you can take that, those 10 fields and you can automate and provision that stuff. Like, that feels pretty good.

And I would argue that if your developers are able to go in and do that and then if they want to make a change to that database, it has the same data there. And I feel like a lot of these systems that might not be the case. Right. Then that feels pretty good.

And that actually feels fairly close to what is going on inside of Massdriver.

But I think the key there is again going back to you can do this without necessarily buying something from somebody designing your infrastructure as code around the constraints of the developer.

Kelsey: 00:24:23

Right.

Cory: 00:24:23

I shouldn't be asking a developer how many zones they have. Is this a production database? What's the sla? I'm going to make the decision on that because we have cost constraints that we have to stay in.

Kelsey: 00:24:30

Right.

Cory: 00:24:31

How much data do you have in there? Like what is the expected like data like the growth of the volume, like what are we expecting here? Right.

That's going to help me figure out how to like auto scale and handle disk.

Kelsey: 00:24:41

Right.

Cory: 00:24:41

So there's questions that I will ask if I'm in a very ticket op scenario where somebody's like hey I need this. There's a back and forth before I give them an answer for a config.

And if you can take that survey those questions that you ask as the ops person trying to kind of qualify with this developer wants and you can make that their interface, you've hit that 80%. Your developers are going oh this is sweet.

Like these are all questions that I know they're not asking me about zones and like what type of encryption. Like I don't care, right? Just ask me the questions that you would ask me in a Slack.

If we're going back and forth for you to go create the terraform module, put that in a form like you're going to reach that place of developers can self serve now. And now you can say okay great.

Now what is a our biggest challenge or debt as DevOps or platform team because now we have breathing room or you say hey you know what? How do I add value to the business? Which is something that we don't get to do very often. Right.

But there is plenty of places that we could actively go add value to the business as DevOps teams. We're so reactive most of the time because we never get the time to figure out how to add value.

We're always reacting to what developer needs or what customers need or what production needs because it's falling over.

Kelsey: 00:25:48

Right.

Cory: 00:25:49

And I think if you use TicketOps in that way, if you take that mentality about the approach of what you want to ask the developer, that becomes a lot easier. Now the hard part is there's 200 and something different resources in AWS, right?

And so I can go out there and I can stamp out those forms for the most common things. But when you start getting into like some of the other stuff, it's like, oh, how do I do it for like literally everything? What do I do for a bucket?

Like a bucket is the most.

Kelsey: 00:26:17

You're describing where I think a lot of people, some of the more advanced, some of the more mature teams have gotten. Yeah, they have some type of homegrown interface, right. I think internal developer platforms is the word of fashion these days.

And people are trying to take these generic frameworks and build on a large array of tools, right, for their apps, for their infrastructure, for all kinds of things. And the debate is still out, will that even pan out? Because sometimes things that are too generic become impossible to use.

But I think there's this part where if you're listening to this and you got into that 80% thing and you're thinking about your roadmap, I think there's two big things that come into play. All the things you're going to end up building that all great user interfaces have, right?

Validation, tool tips, visualizations, graphs, audit logs, audit trails, rbac. Like you haven't even gotten to like permission. Who can deploy what field where we haven't even got into that.

And then you get into the other part, which I think is, I don't know if it's more important, but equally important, which is the ecosystem play every time a new thing you want to do comes out. Do you really want to have your team building some custom automation, then building a matching UI and ServiceNow, gluing it all together.

And remember, there's a learning factor. So you got to spend the next three to six months learning. Did you even do that correctly?

Like the first time you learn that if you change this field on a database object in hcl, that it deletes the database first and then recreates it. Like you need that period to learn and destroy. And I think what you're kind of making this argument around is like, look, it's been 20 years.

If we take everything we've learned from TicketOps to ClickOps to self-serve, and to some degree what IDPs are presenting now, what would you build?

What would you build that takes advantage of ecosystem, which I think was a brilliant choice by you all by saying, look, terraform, I don't care what y' all like or dislike about it. It is the lingua franca of this space. The ecosystem is rich at GCP. I remember we were building this thing called config connector.

Cory: 00:28:24

Oh yeah, I remember Config connector.

Kelsey: 00:28:25

Yeah, it was a Kubernetes based thing and there was this vision that we will have a way for use Kubernetes API to manage all of GCP. And we realized even at GCP that team could not keep up with all the changes across all of just Google services. Forget all the other providers.

And I remember one time we were adding some new features. They were like, yeah, we were just going to be calling Terraform on the back end. I'm like, what? You're just calling Terraform?

And look, it made sense from an engineering perspective.

And I think since that is the reality, watching Master I layer this one component above it and I think we should talk about that before we finish off is like what is the delta between. There are some really great Terraform modules out there.

People with great experience, they get the idea of UI people are using various tools to do a little bit of input validation. There are some people who've taken this to its logical course and these modules are good, they work well for their organizations.

They even built wrapper modules to do a little bit of what you say, like let's not present raw low level terraform modules, let's get people more usable wrapper modules with very few fields that they have to fill out. What's the delta between that and then what Massdriver's trying to do?

Cory: 00:29:43

Yeah, so the delta of what you get, what you yield from it is fairly significant. The delta to get started is not. One of the key things with Massdriver is we support... Well actually I'll talk about our extension model first because again I'm an OPS engineer, I've been in this, I'm CEO of the company.

But almost every decision that we make is me and my co-founder, we're both Ops engineers, have been in a long time, we take a very operational view on that. So I mean even down to our pricing, everything else painfully so in some places.

But one of the things we kind of decided early on is our extension model is infrastructure as code. Notice I didn't say a company name there and Docker.

Our goal was if you wanted to change the way this platform worked, had to be tools that operations teams are already comfortable with. We didn't want anybody to have to retool.

That's kind of like the big goal is like how do we scale this IAC adoption and your organization's operational knowledge without either of your teams having to learn something new. Going back to that low fidelity thing, developers already get a low fidelity graph.

They can click things that they know what the tools are, database or Postgres, whatever. Like they grab it and they drag it on. So for people who aren't familiar with this, we're a visual tool. You drag it out.

So you're really diagramming infrastructure. Now what happens under the hood? Well, I go to my sidebar, I search, hey, I need Postgres. I've got my app deployed already. Maybe I need Redis.

I'm gonna do some caching. I type redis in the search box and Massdriver's heart, it's a registry that stores your IAC modules.

These IAC modules can talk to each other through an artifact system that's reminiscent of maybe CRDs, except for you don't have to build the controller. And so it starts with, for you, as an operations engineer, is just taking an IAC module and publishing it to our registry.

And as soon as you publish it to our registry, what will happen is you can drag it out on this canvas and essentially all of your terraform variables or your helm values gets turned into a form and we have this crawl, walk, run approach. It's ugly at first. Why? Because you might have a thousand fields in your S3 bucket module. Right.

But as you kind of think about, okay, what do I want the developer experience to look like? Here you have a Massdriver YAML file where you can essentially say, okay, it's region, it's a string in terraform.

But I want to present it as an enum. I want these three. Or maybe I want to pull that data from someplace else. You can actually pull data from other sources.

It could be outside of mass driver, it could be a helm chart is pulling the IAM policy from a DynamoDB table, right?

And so what's cool is you start to get into this, like, synergy, for lack of a cool word, between your IaC, where you can actually start to codify a lot of the rules in between things. Codify some of the stuff that's harder, like security groups. It feels very chicken and eggy when you're doing stuff in Terraform.

But in Massdriver, you just take this module and you say, hey, you know what? It produces a database. And this database produces a security group or firewall rule or whatever.

And we have this artifact system that you can emit stuff to. And so like, that's like the delta and what you have to change for mass driver is with this artifact system.

So you take your module as it is, you put it in, it may or may not look ugly, depending on how many fields you have.

You can kind of fine tune it with this config, make it look you can order things, reorder things, you can mark fields immutable, make that region immutable for your clusters, right? You can kind of put this like additional layer on top of it. And then we have this artifact system.

And the artifact system, if you're familiar with Terraform, it feels very much like the outputs in Terraform, except for they work between IaC tools.

And if you don't use Terraform, it gives you something very similar to Terraform's output system in tools like Helm, where you can say, hey, this is a helm chart. It has 30 manifests in it. But what's most important is it's the billing API.

And so here's where you can find the billing API and here's the MTLS certificate for our side of that relationship. And so now what's cool is when I'm saying, hey, checkout needs to access the billing API.

I just draw a line from the billing API to my checkout app and that contract system says, hey, you know what, I'm going to push the host name into the billing host variable of your helm chart for the checkout API. And I see this MTLS certificate. We'll be able to do some cool MTLS stuff for you in the middle, right?

And so it's very easy to kind of get started with it. But then the artifact system is where essentially is like the net new thing that you're doing.

But what's really cool is how do you do these artifacts in Terraform?

It's just a resource, so you can just bring it in things without like a runtime, like a helm chart, you just drop a JQ file in there and it can actually like parse JQ expression describing like what you want to pull out of the helm file. Works for BICEP, CloudFormation, et cetera. And so that's like the first kind of delta there is. It's very easy to kind of get started.

Now the second delta, the second thing that's different is, and I think this is one of the things that's pretty fundamentally different about Massdriver than the rest of the tools in the space is a, we're a registry. So we're actually storing this stuff in OCI or we feel like an NPM or whatnot for Terraform.

The second thing is you as the person who's working on the VPC module or the database module, you're probably the expert or your team is the expert. You know what fields developers should and shouldn't set. You know what workflows they should be running.

So one of the things that's pretty key in Massdriver, and we're going to be open sourcing this very soon, is this thing that we call a bundle. And what a bundle is, is it's your infrastructure as code and maybe multiple infrastructure as code tools. Quick example, helm.

I still need an IAM role for my app to run, as I might build that in Terraform and then run a Helm chart. Masterdriver can actually compact that into one thing.

But what it can also compact into it is A, the user experience, the UI and B, and this is the part that's most important is your workflows. So it's really cool with Massdriver is I sit down, I have a Terraform module. I say mass, you know, mass bundle import brings in my Terraform module.

I see this Massdriver YAML file get dropped into my module. And what I mean by module is like your, your shared module or a child module in like Terraform parlance, not a root module.

There are no root modules in Master App. That's one of the biggest big deltas. And like what you get, there's no root modules, there are no pipelines.

Kelsey: 00:35:37

What's the problem with root modules?

Cory: 00:35:38

There's snowflakes, they're all over the place. Where, where is this database deployed? Well, I know where the shared module is, but I don't know what's calling it.

How do I figure out what's calling it? I have to go to GitHub and find everybody's main TF, which may or may not be at the root of their directory. I. I don't know where it is, right?

Like that is the first part that is problematic. The second part is if I'm a developer and I'm like, hey, I have a main TF that describes my environment. This is right alongside my app.

I don't know what to put in the right side of these fields, right? So I'm right back into this place of like, okay, you gave me postgres, but what do I put for like the azs, right?

So with masterf there's just no root modules. And so where are the root modules? They're ephemeral. You really don't care about them.

What you really care about is there's this shared module over here that you want people to excavate and there's inputs, right?

And so when you're diagramming your infrastructure, what happens is when you hit deploy, we're essentially taking the workflows that you now define in your Terraform module. So in your module you say, hey, this is my vpc.

With my vpc, I want to run Snyk and I want to offer a plan option and maybe I'll run Chekov and Checkov will hard fail if this is a production environment. And that Massdriver's bundle system actually compacts that into one thing. So now I don't have Terraform and 17 GitHub Actions. I have a VPC, right?

And so you as the developer now say, oh, I need a VPC and I need a database and I need this and I need that. And what happens when you deploy a Massdriver? We say, okay, well, we're registry.

We have access to the container images that you've described for running Terraform or Snyk or whatever. We have access to your code. We essentially ephemerally build the CI/CD pipelines for doing the provisioning when that developer hits deploy.

And so now what's really cool is day two, I decide I want to go from Terraform to open Tofu. I can go to that module, the VPC module. I can switch it From Terraform to OpenTofu. I can push it up.

I don't have to ping anybody about a new workflow. When developers upgrade to the next version, all the workflows are upgraded, right?

So, like, you're actually able to take your IAC and put it in lockstep with like, all the tooling you'd have in your workflow. Scanners, security scanners, compliance scanners, et cetera. And now you get to change it in one place. You don't have developers.

That's one of the things that really sucks is like, even if your team's done this all very, very well, as a developer, I still have to find all these pieces and piece it together in my environment. And if I miss the sneak one for whatever reason, we're deploying stuff that hasn't been compliance or security scanned. Oops. Like, that sucks.

Like, it's very easy for me to sidestep security and compliance by just leaving something off, right?

And day two for you as the ops engineer, if you have 75 Postgres databases and you ping 75 teams and you say, hey, we're upgrading the module, like, can you all bump that? You're doing that, like they're not doing that. If they do it, it's not happening anytime soon, right?

If you say, hey, we're switching our tool from Snyk to Checkov or Checkov to Snyk or whatever, nobody is prioritizing that work.

And this allows you to kind of take control of hey, we want to offer developers databases and this is how we provision them, this is how we secure them, this is how we make them compliant. This is all that we need to know from you to be able to do it. Developers diagram and that's what the tools. It's really funny.

It's like so much of it is so centered around developers getting infrastructure, but so much of what we've done is trying to make the operations teams day two so much simpler because that's, that's really where teams start to get choked up. Like it's just there's so much stuff to happen on day two versus day one. It's easy to throw together a Terraform module.

So I think the delta for getting started is pretty straightforward. The delta for what your collaboration and team looks like looks significantly different.

Kelsey: 00:39:07

I think the dope part about that is a lot of times people see ClickOps and they just get to this fear of like, "Man, this thing's going to lock us into a certain way of doing things is going to be so limited that the minute we try to do anything complex, we got to get rid of this whole tool and go back to custom doing things land." And I think what you're talking about is more like what Kubernetes brought to the table when it came out.

It's like, look, we have some built in things for deploying containers, et cetera, but CRDs make you do a lot of work, right? You do a lot of upfront work. You have to build a controller, you build a specification, but once you do, then you get Kubectl for free.

You get this thing where it says whatever you can imagine you could build.

I think what you're saying is if you've already made an investment in things like Helm or Terraform, you can turn that into a uiable thing now and you know, get a little click ops out of it if you're just doing nothing. But if you add a little bit semantic sugar on top, you can give people policy validation.

And I think people may not even understand what we're saying.

Typically, if you're using something like Terraform, most people never make it to the maturity curve where they're actually running unit tests and running sneak that, that is like, people aren't really getting that far. Most people are just like, dude, we've just gotten to Terraform Apply. We haven't even gotten to all these other levels.

But what you're saying is let's skip these ad hoc pipelines because you end up with a hundred million build pipelines trying to glue all these utilities together. I think what you're saying is like, hey, let's lean into more to the declarative approach. Give me your Terraform modules. You know what's best put?

Policy constraint, workflow functions that need to make sure things are right, maybe custom validations, et cetera. And then what you end up presenting is what you want people to do.

Cory: 00:40:54

Yeah.

Kelsey: 00:40:54

Not learn Terraform. Learning Terraform was never the mission. It was to use what Terraform provides.

And since people were missing that entire layer you described, we all stopped that. Here's the Terraform module. Good luck with that.

Now we have the last mile, which is, here's how you create a thing that happens to be powered by Terraform. But that's irrelevant to you because if that ever leaks up, that means we've done something wrong in some case.

Cory: 00:41:25

Yeah, for sure.

And it's interesting too, because like so many teams getting started with us, we find them in these really interesting places where we have a customer right now we're very popular on Azure, which was surprising to us, but we kind of weaseled our way into an Azure partnership a while back. And so we get a lot of companies from Azure. And it's interesting, Azure has their own IaC tool called Bicep. Perfectly fine tool.

Now you can't manage stuff outside of Azure with it, right? So you immediately run into places you're like, oh, we started to use some cdn, that's third party cdn.

And we can't manage, you know, can't we need Terraform or OpenTofu to configure that, right? And it's like, okay, well, guess we got to start everything over again. Are we moving everything to Terraform?

And it's like, no, we'll do some of this stuff in Terraform and some of it in bicep.

And what's cool about a platform that is able to speak both of those languages and we speak it through this artifact system is not only can you use both of them, you can say, hey, you know what? This one thing we have to do through Terraform because reasons.

And that doesn't mean we have to like throw the entire rest of the world into disarray. We can Say, hey, if you need the CDN tool that we use, that's a third party, it's provisioned with Terraform. That shouldn't matter to you whatsoever.

But what's cool is through the artifact system, they can actually talk to each other. So we have a customer right now that their networking team on Azure uses bicep.

They only manage networks, and it's one of these companies or this, like, they have way, like, I'm like, I don't even know what you're doing with all these networks, a lot of networks. But the ops team's like, this tool is fine. This is what we need now. Other teams need to provision networks sometimes.

So they're calling these Bicep things. And these teams also are doing things in Azure and other networks with like, we're using Terraform over here.

And so to them, all of a sudden, these developers are presented with multiple tools, tools and multiple compliance scanners and et cetera, et cetera. It's like, oh, why I put Chekov on that? Why didn't it work so well? Checkov doesn't work with.

I don't know if this is actually true, but Checkov doesn't work with bicep. It works with Terraform. We use Checkov with Terraform. With Bicep, we use this. And it's just like, oh, my God, how do I just. How do I make a bucket?

Like, I gotta get back to work, guys, right? And so what's really cool is like, the ability to, like, create these abstractions and actually have them, like, interconnect and work together.

Like, Bicep can push network information out and it pushes it into the artifact system, and the artifact system can react to that and say, hey, you know what? We got something that maybe wants to automatically be deployed if the downstream does right, or the. Or the upstream has been right.

So it allows you to put in a lot of really cool rules to kind of truly automate and integrate a lot of these different tools that you have, which is another part that just becomes extremely, extremely difficult.

And you're getting all of this while speaking the language of a developer without ever, as you pointed out, like, presenting Terraform to them or presenting anything to them.

What I think is a great experience is when developers that are the end users of MassDriver, so not our customers, but our customers, customers just have no idea that there's even infrastructure as code happening to them. They're like, this is ClickOps. And it's like. It's like somebody finally got ClickOps right? Like I'm like, I need a preview environment.

I hit the preview environment button and it stands everything up. I can test something crazy, tear the whole thing down. Right? Like that is just so hard to get to.

And I think that when you get to this level of power and self service that you can push to developers where they just have no idea what tool the ops team is using, that's nailing DevOps. That to me is tearing down the silos. The silos to me were never about knowledge and making sure everybody knew everything.

It was making sure that these teams could get together and work. And I think to me that is the most perfect form of DevOps and platform engineering is where I can do my job without interrupting you doing your job.

Right. And it's just very hard to get to.

Kelsey: 00:44:57

I think a lot of people struggle with this concept of strong binding, late binding to other systems.

Like if you think about most of what we do when we're programming, we're literally getting data from other systems and munching it back together before making another decision. And before CI/CD, I don't think no one had experience with it. Right.

When you write a shell script, you're opening the file yourself and trying to figure out what data you're reading because most of it wasn't typed. Yeah, I think that's the big unlock. Now you're saying I want the database URL. That's a type that you can count on.

You don't have to go and call the API yourself or get it from some file. You're literally getting a first class API. And once you have that, this whole thing feels so much easier.

When I can write my logic and just say, give me this input and you figure out how. The first time I had that aha moment was when I was using CI/CD for like what's the tag? And the CI/CD system. It's in this environment variable.

It will always be in this environment variable. You don't need to read a file, you don't need to run any subversion command line tool. It will just be in this variable.

And when you get that, you're like, oh my God, I just checked the variable and then put it here and it always works. And that was a form of interoperating between the various tools.

So if you go from subversion to git didn't matter, just look at that variable and the value you're looking for is going to be there.

I think you've gone through an amazing journey around like the challenges people have and more importantly, why they have them and all the tools that we've been building over the last couple of decades to try to solve all of these problems.

And I think it's safe to say that there is a collection of tools out there that do solve the give me a VM problem where now just starting to scratch the surface on, but give me a workflow so that anyone can create these things the way we do it here. That's the next level of conversation. So it's not really ClickOps.

I don't want anyone to get the idea that ClickOps is this uncustomizable, unextendable tool where you're just stuck in whatever UI gives you. No, we're talking about now having mature interfaces to customize things and allow people to use them without breaking everything.

We're finally there. So I appreciate you giving us this kind of big view of the world. Where can people follow you? Where can people follow up on the discussion?

Cory: 00:47:28

Ooh, LinkedIn. I'm Cory O'Daniel on LinkedIn. I'm coriopostrophe on Reddit because I'm anonymous. That's it. That's it. That's the only place.

It's also Massdriver.cloud. For anybody who listens to the show and didn't know I also had a company, that's what I do during the day.

Kelsey: 00:47:45

Awesome. All right, thanks for letting me host the podcast and thanks for all the information. Catch you next time.

Cory: 00:47:51

Yeah man, I really appreciate you coming on. It was awesome.

Thanks for sticking with us through this three part series with Kelsey Hightower. It was very exciting for me to be on this side of the microphone talking about what I do at Massdriver, I know I don't usually bring it up on the show, and some of the philosophies that I have around Infrastructure as Code.

If this has helped you rethink CI/CD, GitOps, TicketOps, or just infrastructure management in general, please share the episodes with your friends and feel free to send me any questions. I'll see you next time.

Episode 39

5th Nov 2025

Guest Host: Kelsey Hightower - Beyond Pipelines: Infrastructure As Data

Transcript

Listen for free

About the Podcast