Episode 33

full
Published on:

30th Jul 2025

From React to Dagster: Pete Hunt on Data, Infra, and AI-Ready Platforms

Is Postgres actually a better message queue than Kafka? This provocative question is just one of many insights Pete Hunt shares in this conversation about data orchestration, platform engineering, and the evolution of infrastructure.

Pete Hunt, CEO of Dagster Labs and former React co-founder at Facebook, brings his unique perspective from working at tech giants like Instagram and Twitter to discuss how different platform team approaches impact product development. Having witnessed both Facebook's clear delineation between product and infrastructure teams and Twitter's DevOps-style ownership model, Pete offers valuable comparisons of these contrasting philosophies.

The conversation explores:

  • How Dagster provides a higher-level abstraction for data teams, making it easier to track and debug data assets rather than just managing workflows
  • The challenges of modern data platforms and why many organizations struggle with complex, distributed systems that could be simplified
  • A practical approach to migrating from Airflow to Dagster with their "Airlift" toolkit that allows for incremental, low-risk transitions
  • How AI development is fueling demand for better data orchestration as companies build applications that rely on properly managed data pipelines

Pete also shares his thoughtful approach to balancing technical debt and product development with a "quarter on, quarter off" cadence that allows teams to both ship features and clean up the inevitable corners that get cut under deadline pressure.

For platform engineers, data teams, and technical leaders navigating the intersection of infrastructure and AI, this episode provides practical insights on creating abstractions that deliver real operational value without unnecessary complexity.

Guest: Pete Hunt, CEO of Dagster

Pete is the CEO of Dagster Labs, where he first joined as Head of Engineering in early 2022 and transitioned into the CEO role later that same year. Before Dagster, Pete co-founded Smyte, an anti-abuse startup acquired by Twitter, where he continued as a senior staff engineer.

Earlier in his career, Pete was one of the first engineers to work on Instagram after its acquisition by Facebook in 2012. There, he led development on Instagram’s web and analytics teams and became a co-founder of the React.js project, helping transform an internal experiment into one of the most widely used front-end frameworks in the world. He was also part of the early community around GraphQL and has remained deeply engaged in open source and developer tooling.

Pete brings a pragmatic, hands-on perspective to modern data infrastructure. Having been both a founder and an engineer, he focuses on reducing complexity and fatigue in data teams by building tools that actually work together. At Dagster, he remains close to the code and actively involved in technical decisions, combining leadership with deep technical fluency.

Pete Hunt, X

Dagster

Dagster Pipes

Dagster Airlift

Links to interesting things from this episode:

Transcript
Intro:

You're listening to the Platform Engineering Podcast, your expert guide to the fascinating world of platform engineering.

Each episode brings you in depth interviews with industry experts and professionals who break down the intricacies of platform architecture, cloud operations and DevOps practices.

From tool reviews to valuable lessons from real world projects, to insights about the best approaches and strategies, you can count on this show to provide you with expert knowledge that will truly elevate your own journey in the world of platform engineering.

Cory:

Welcome back to the Platform Engineering podcast. Today I've got Pete Hunt, CEO of Dagster Labs. If you've ever touched React, you've felt Pete's work.

He helped co-found and open source it while at Facebook. He's also led web engineering at Instagram, built anti-abuse tools at Smyte, which was acquired by Twitter.

And now he's running Dagster, an open source data orchestration platform that's somewhere between your infra team and your data team.He's been thinking a lot about how the chaos of modern data stacks and AI impacts how we build and why platform engineering just hits a bit differently.

I am super excited to talk to Pete. Thanks for coming to the show, dude.

Pete:

Thanks, Cory. Let's get into it.

Cory:

Yeah, let's get into it. So you've got... You've been everywhere. And so I'd love to hear like a bit about your background.

Like you're in data and in frontend, like I feel like those worlds are so far apart, but you've touched them both. I'd love to hear how you got into React, how you got into the data world and then... yeah, let's go from there.

Pete:

Been programming since I was a kid, so I've always loved technology. I went to college for it and I got my master's actually in distributed systems. So you can add a third one to that.

Cory:

We love them.

Pete:

Listen, I came out of my master's in twenty ten and back then big data was all the rage and we were still kind of in that MapReduce era. And so I thought it was going to work on the next great distributed database or next great caching layer.

And I took a job at the coolest tech company I knew of at the time, which is Facebook, kind of pre IPO. I was really excited to work there, believe it or not. It was like facebook.com was a cool product for young people back then.

And I got there and they kind of put you to work on a bunch of different parts of engineering. And I started working on photos and I basically pushed a bug to production that sent a bunch of notifications to people that use the product in a certain way. And one of those people was Michael Arrington at TechCrunch.

And so he wrote an article about how stupid Facebook was for sending him all these hundreds of push notifications. That was my bug.

Cory:

Yeah.

Pete:

First couple of weeks of working there, and I was like, wow. Like, I got written up in a publication, and I'm, you know, 23 years old. That's pretty cool.

That really got me into working on product, actually. So I was just slinging product for a couple of years, doing what it took to make it happen. So whether that was modeling the data and distributed MySQL tier at Facebook, or writing PHP code or building CSS, it was just kind of whatever it took to get the product built. That's really where I got started.

Cory:

Yeah, I feel like that is such a two thousand to twenty ten story. Where it's just like, I just kind of did everything because everyone just did everything back then.

Pete:

You see these grays in my beard?

Cory:

Yeah. That's where they come from.

Pete:

I still have the IE6 box model hacks stored in my brain.

Cory:

Oh, yeah.

Pete:

And I still remember all the limitations of MySQL 3.23. It was that era, right?

Cory:

Was it MySQL 3?

Pete:

I think Facebook at that time was MySQL 4.1. But in school, I had used MySQL 3.23, and there were all sorts of weird indexing limitations to that thing. And I remember everything that you would read in the database textbook didn't quite work in MySQL because it wasn't a complete implementation of the standard.

Cory:

My master's degree was in healthcare information systems, but a lot of the focus was in database administration. That's what I really kind of honed in on.

We used Oracle for pretty much everything at school. But I remember at that time... this was two thousand and five, two thousand and six-ish... I think it was the first time I saw a MySQL. I think it was MySQL, I can't remember what it was, but it was such a long time ago.

But I remember I worked through the MySQL 3 era. And the amount of teams that I still see running MySQL 5 in prod, they're just like, "We're using this bad boy."

That's pretty cool. And so you've worked at a lot of companies - Instagram, were you at Twitter or your product was acquired at Twitter?

Pete:

Company was acquired by Twitter and then I stuck around for three, three and a half years.

Cory:

Yeah. So you've been in a lot of huge companies in the social space, like teams that have large platform teams. I definitely want to get into what are some of the differences you saw in team and culture dynamics and what you think made some of those teams flourish.

Before we do, you wrote a blog post that I've got to talk about, or I want you to talk a bit about because I'd love to hear (A), the pitch of the post and then (B), the people's opinions that you got from the post. But the title of the post was... do you have it memorized? I just had it up a minute ago, but I think... oh, there it is, I've got too many tabs... "Postgres: a Better Message Queue than Kafka?".

Pete:

With a question mark at the end.

Cory:

I think that should be an interrobang. That should be the exclamation and the question mark in like one symbol, if you've ever seen that -the interrobang. But I love Postgres.

I'm a Postgres fanboy. I have so many car slapping memes about Postgres. But I also have an annual tweet that goes out every year that is... it just says... it's the Tim and Eric "Psst, free real estate" but it's, "You don't need Kafka."

I saw this post and I was like, this is... Pete is my guy. So tell us a bit about this post and what the reception of it was like.

Pete:

Yeah, man, that's really funny.

Okay, I don't know exactly where to begin with this one, but you obviously know the innovation tokens use boring technology post, right? So it's kind of a rehash of that. And by the way, first of all, full disclosure, I'm actually like a MySQL guy.

I know, I know. There's like a MySQL guy and an Emacs guy. There's like fewer of us every year.

Cory:

You should be a protected class.

Pete:

No, I know, I know. I'm not going to defend it one way or the other. It's just the thing that I know better, but we use Postgres at Dagster.

And by the way, you know, at Dagster, I started out as a head of engineering and now I'm the CEO. So it's not like I go hands on keyboard to the extent that engineers are going hands on keyboard.

But I guess to kick things off with this one where this really came from was, first of all, I was coming out of Twitter and Twitter is a place built for scale, right? You know, asynchronous message queues, like totally appropriate at Twitter scale. You want your thing to be super, super cheap and be able to handle really, really bursty workloads and have the right back pressure semantics built right into the clients. So that makes sense.

One of the orgs that I had while I was at Twitter was the internal tools team where we would basically get spam and abuse tickets. So like, if there was like a tweet on Twitter that you thought violated the rules, you could report it and say this is, you know, doxxing or something, and that goes into a ticket management system at Twitter and then a team picks it up and does manual review. And then there's some, you know, machine learning assisted stuff that happens.

But the point is, we were getting hundreds of these tickets a minute, right? I had run internal tools at Twitter as part of my job, and we would get spam and/or abuse reports on tweets. And so it'd be like, you know, if there's something that you thought violated the rules - like doxing, sharing personal information - you click a report button on Twitter that lands in a review queue. There's some machine learning assistance that happens, but usually like a human goes and takes a look at it... they might escalate it, you know, there's a whole workflow around it.

Anyway, we would get, you know, hundreds of these reports a minute - something like that. Which kind of sounds like a lot, but this is something a single Postgres box can handle no problem, or single MySQL box can handle no problem. It's not really that complicated.

It's a couple of tables, some pretty clean inserts in there, and you're not even really doing much in the way of full text search. So it's pretty bread and butter relational work. And the system that I inherited there was like this giant microservice monstrosity.

It was like nine different services to implement this thing, and asynchronous queues between each service. And each service was backed by a key value store that didn't have transactions. And so you basically got a PhD in distributed systems in order to like transactionally store a ticket in a ticketing system.

It's completely ridiculous.

Cory:

Were there distributed locks involved?

Pete:

It was eventually consistent.

So there was... there were basically like reaper processes that would run and clean this stuff up. And I got a whole rant about why that system was built that way. You know, first of all, there's like... that's how you build a Twitter scale system. And that makes sense for Twitter scale parts of that system. But this was not like a real Twitter scale, this was such a subset of that. Right?

Cory:

Yeah.

Pete:

So you don't have the same constraints.

The second thing is, like, you know, a lot of companies, unfortunately, if you want to get that staff engineer promotion, you've got to deliver a complex distributed system. And that is a really bad incentive, I think.

And this is not unique to Twitter, this is like everywhere I've ever worked and everywhere anyone's ever worked. You've got to have these objective criteria, and the goal is fairness, and that's important, but it often leads to these technical decisions that are a little bit questionable, I would say.

Does this make sense?

Cory:

I'm following.

Pete:

I was coming fresh off of that, and, you know, coming into Dagster and we had this nice, really simple architecture. I was like, number one, I really want to preserve this.

Number two, I want to give other engineers, especially more junior engineers, the license to do, like, what seems like the simple and stupid thing. It's like, why does this thing have to be so complicated? Like, why don't we just keep it as a single-node Postgres instance for my application?

And the third thing that I had to do is like... I came in there, they were like, "Listen, we're just starting to sell our commercial product. We want to make some buzz in the community." And I was like, "I know how to do that."

I wrote a list of potential blog posts that we could do that was just like, "What would get both sides riled up on Hacker News?" I was like, "Well, you know, people love talking about Postgres."

The distributed systems people will get pissed off because, you know, the single-node Postgres instance doesn't scale. You know, there's another subset that will be like, "You know, Postgres and Kafka, they're two different technologies. They're not comparable."

Cory:

Oh, yeah.

Pete:

"You know, so why would you..." so they'll get fired up to reply to that. And then the Postgres evangelism strike force will come in and also, you know, hop on this thing.

So it was a little bit of a gimmick, for sure. But I think there was actually this interesting point that we were trying to make, which was like, you know, oftentimes you don't want to use the right tool for the job.

If you already have a tool in your toolbox that you have operational rigor around, you know how to operate the thing, and it solves the problem well enough given the constraints that you have. And like, that was our main point and that held true.

You know, we only adopted Kafka at Dagster like a couple of months ago. You know, and it was only for one really focused use case. We actually had one or two incidents where we were like, Kafka actually would have saved us here. We would have been able to not load shed in that situation. We would have been able to like just kind of delay processing.

So that's the story with that blog post. And it's like very long-winded, but that's where it came from.

Cory:

No, it's a good one - I will put it in the show notes.

But I mean, I honestly think, like, Postgres is an amazing piece of technology. I feel like it was like one of the things stolen from the Area 51 crash. It's like space technology. But I mean, the JSONB storage is pretty dope, partitions work well, you can fit a ton of use cases in that bad boy.

Pete:

Definitely.

Cory:

Let's get back to platform engineering - I just had to take a little tangent on that.

So you've worked on these huge teams at Facebook, Instagram and Twitter. What are common traits that you've seen between some of these platform teams that you think really help those teams excel? And what are some of the differences that you saw on those teams that really made one team shine maybe over one of the other teams?

Pete:

Yeah, so I ran a platform team at Twitter and I was a customer of platform teams at Facebook. So I didn't run a platform team at Facebook, so it's a little hard for me to really do like a full compare and contrast.

I learned a lot at Twitter and I think that the operational rigor at Twitter from the infrastructure perspective was really, really great. You know, I left before all the Elon stuff went down. Like, that wasn't a reason why I left, I left like months before that happened.

But, so the legend goes, he went to the Santa Clara Data center or it was the Sacramento Data center and just like turned it off.

And there had been all of these disaster recovery exercises that would be run at Twitter and there was all this redline testing and stuff like that, that really, I think, contributed to their ability to take down an entire data center without a ton of thought and the system continued to like to operate pretty well.

So I thought... you know, I learned a lot about the checklist that you run before you put a system into production at Twitter, and I thought that that was really, really, really great.

What I also learned... and this was a big difference between I think Twitter and Facebook. Twitter was very into this like DevOps style kind of product team owning their infrastructure kind of front to back. And Facebook was the opposite where there were like, there's a clear delineation between the product people who were slinging PHP and CSS and JavaScript and the infrastructure people that were slinging C++ and kind of like presenting, you know, the interface between those two systems.

Cory:

Yeah.

Pete:

And I really think that there's different trade offs there. But at Facebook you were kind of able to move a little bit faster on the product because there was a set of abstractions that were like pretty reusable and composable because the expectation was like, product engineers are not going to run their own complex stateful backend service.

Cory:

Yeah.

Pete:

And at Twitter it was different. So the abstraction layers were pretty different. I thought that was like an interesting difference between the two companies.

Cory:

Yeah. That is really like... the abstractions there, right? Like it sounds like the Facebook one feels a lot more like a pass. Like it's concerned with your workload versus like the underpinnings. And then at Twitter... like focusing on Twitter scale, everybody owns their own stuff. Like you have to own your own infrastructure as well.

In that scenario where, you know, people are owning their own infrastructure as well. Did you typically see the teammates there... like everybody knew everything about infrastructure and their applications? Or did you typically see like clustering of like, "Hey, there is this team." but it's like a small subset actually understands like the underlying infrastructure? Or was it like rewarded to say, "Everybody literally knows everything front to back"?

Pete:

Yeah, you know, I mean, what is infrastructure?

Like, it's an interesting question to ponder. But what, what I'll say is that at Twitter it was pretty routine for you to be an engineer that's working with a product person. And so like you're solving a business problem for users, and then you're going back and you're running container orchestration commands.

Like we used Aurora but we were like transitioning to Kubernetes. But like you're operating at kind of like the container level. It's like you get Key Value Store and you get containers and you get an RPC framework in Scala and it's like "Go forth and build your application", right?

As opposed to like at Facebook you just kind of like land PHP code into a monorepo. It's not hacked but like you land a PHP code into monorepo and you have this like graph database abstraction. And it deploys sometime and you know, you don't know... some other team like runs the deployment schedule for you, and they pull your changes out, they git revert your broken stuff if you're going to hold back the deployment.

So it's like a different... it's a very different way of working. Different trade-offs.

Cory:

Ownership model.

Pete:

Yeah, that's a totally different ownership model.

Cory:

Yeah. So the funny... I love that "What is infrastructure?" because I feel like it's changed significantly in the past 16 years. Like what the abstractions of infrastructure are, right? Like when we first think about like AWS coming into the world, the abstraction layer was a VM, right?

You had SQS and you had EC2 and it was like that is infrastructure. And now infrastructure could be like wedged right in the middle of your app - it's not necessarily what you run on anymore, it's the services that your application's calling. It could be managed services by the cloud or whatever. I love that, "What is infrastructure?" That is astute.

Pete:

Yeah, yeah, yeah.

At Twitter I keep thinking that like you know, you own your own infrastructure, but if you have container orchestration, like are you really that close to the metal? But I think it's... like when I ponder that question, it's like do you think in terms of like SLAs and SLOs, right?

I think if you have a chart with a burn rate on it, like you're managing infrastructure. And if you don't, then you're probably working on product. So like, you know, we had tons of those charts at Twitter on like every team that I saw. And at Facebook it was a daily active users chart and engagement metrics and stuff that looks a little more like growth metrics.

So it's just, you know, that's kind of how I think about it. It's like what outcomes are you trying to drive?

Host read ad:

Ops teams, you're probably used to doing all the heavy lifting when it comes to infrastructure as code wrangling root modules, CI/CD scripts and Terraform, just to keep things moving along. What if your developers could just diagram what they want and you still got all the control and visibility you need?

That's exactly what Massdriver does. Ops teams upload your trusted infrastructure as code modules to our registry.

Your developers, they don't have to touch Terraform, build root modules, or even copy a single line of CI/CD scripts. They just diagram their cloud infrastructure. Massdriver pulls the modules and deploys exactly what's on their canvas.

The result, it's still managed as code, but with complete audit trails, rollbacks, preview environments and cost controls, you'll see exactly who's using what, where and what resources they're producing what all without the chaos. Stop doing twice the work. Start making infrastructure as code simpler with Massdriver. Learn more at Massdriver.Cloud.

Cory:

Well, let's get into Dagster because I think this is another place that's going to be really interesting, right? Like in this world of AI and more and more data, I'd love to learn like what Dagster is and how it fits into platform engineering.

I've had some guests on the show that have worked in the data space and it's really interesting. Some companies are definitely focusing their data teams as a part of the platform and delivering data services to the engineering customers. And others are - the data team is on the platform just like everybody else and they're getting abstractions for their infrastructure.

So I'd love to hear what is Dagster and how are you seeing teams using it in their platform engineering journey?

Pete:

Dagster is an open source project and it's in the data orchestration category.

So think about container orchestration that's responsible for scheduling the containers, making sure that they're running and they're hitting their operational requirements. And then what those containers do is fairly open-ended. Data orchestration is pretty similar, except it's about data assets instead of containers.

So I've got this table in a data warehouse. I want to make sure that's up-to-date. I want to make sure that it's passing its quality checks. I want to understand its current status and debug issues with it. And if the job that populates it crashes, there should be some process monitoring that, restarting it, alerting under certain conditions and things like that.

And then there's like... we've got a commercial product too that has the Dagster Orchestrator at its core. But there's a bunch of like tooling around that that you can either bring in like kind of individual point solutions and put together on top of Dagster, or you can just get it all out of the box with Dagster Plus.

Cory:

Very cool. There's a lot of tools in this space, there's the Airflow... Is Kubeflow and Airflow the same? I can't remember if Kubeflow and Airflow is the same.

Pete:

They're different.

Cory:

I slammed my head against Kubeflow pretty hard a few years ago and I wandered away from it dizzy and never came back to it.

Pete:

We hear about it a lot less these days, yeah.

Cory:

Great, I'm sorry, Kubeflow team, but that is great news.

So there's a lot of tools in this space. How does Dagster differentiate against like the Airflows of the world?

Pete:

Airflow is fundamentally a workflow orchestrator. So you have these things called workflows or dags. You have a bunch of different steps. Airflow runs each step. Those steps can do whatever you want them to do. Then it'll restart your pipeline if it breaks.

Problem with that is that there is an impedance mismatch between that way of looking at the world and the data team's way of looking at the world.

So if you are a data team, your customers don't complain to you about a broken pipeline. They say, "My report is missing the last week of data. What's up with that?" And so you start with a dashboard, which is an asset, that has some tables in a data warehouse that power that, those are assets. And then you work your way upstream until you figure out where your data broke.

And so with a tool like Airflow or another workflow orchestrator, you basically need to take... there's like this many-to-many relationship between the workflow steps that manipulate the data and then the data assets that you actually care about. And they're kind of tangled together.

And so what you do is you end up bringing in a bunch of additional tooling and manual instrumentation and documentation pages, in a wiki and stuff like that, to try to assemble something that resembles a coherent platform that you can then go give to stakeholders of varying levels of technical skill to be able to self serve on the platform. And, at the end of the day, platform engineering is in many ways about balancing autonomy versus safety.

You want to let your stakeholders do their own thing in their own way with some guardrails. And the existing kind of tools didn't really do a great job of that. They could restart your workflows and keep those up and running, but in order to deliver a self-service platform like that, you need a higher level of abstraction.

Cory:

Yeah. So in an Airflow scenario, I'm pretty much console logging every step, trying to figure out where that last week of data went. Like that's my option.

Pete:

Yeah.

Or you could bring in another vendor and you can instrument your code manually and it really is going to probably be like two or three other vendors to kind of paper over the missing abstraction that we brought to the table.

Cory:

Yeah. And so Dagster ties into just a ton of data sources - so pretty much all the data sources that your teams are familiar with.

So how are teams like exposing that at the platform level? Do you see them a lot starting to like have data services as a part of the platform they're offering to their teams?

Or are you still seeing people run Dagster, like on top of whatever their internal platform is and it's something kind of owned by the data team.

Pete:

What is a platform? Right?

Cory:

This episode is going to get real meta.

Pete:

Yeah, yeah, we can get really out there, but no... let's get concrete.

So the team that adopts Dagster is usually called the data platform team. They're the team that is trying to put together a bunch of tools and best practices for people to build data pipelines and run them.

So you'll get like some users that are using dbt to do analytics jobs that power dashboards. You get some other teams that are using machine learning models to do some forecasting. And you want them all to be able to like work in the same platform, get the same level of observability, hit their SLAs, stay in compliance, all that stuff.

There's usually like an infrastructure team or like a platform infrastructure team that owns container orchestration, public cloud - like when we say platform engineering, what you would originally think of.

Those teams are in close collaboration because basically the data platform gates access to those resources. It basically lets somebody that doesn't know how to use those resources take advantage of those resources. So it's really important that they are collaborating super well and that they understand each other.

In many ways, Dagster is kind of about helping those data platform teams grant access to those resources in a way that everybody is happy with.

Cory:

Yeah, yeah.

So I'm coming in, I'm building a new pipeline, and it's given me this abstraction where I get to focus on like my workload and it's giving me the debugging tools that I need, but it's sitting probably on top of either a traditional DevOps and SRE teams work, or it's sitting kind of on top of the typical platform that we're thinking of - where we're managing containers and whatnot. So it's almost like a layer that's sitting on top of that.

Pete:

Yeah. So like, very concretely, most of our customers sit on top of Kubernetes or Amazon ECS.

So we present this nice abstraction to the people building data pipelines where they can think in terms of data assets.

And then we turn around and we spin up jobs on CAPES, and we monitor those processes, and we interpret the logs, and we render it in a way that is understandable to these data practitioners.

Cory:

Oh, very cool. So, in the internals, you're actually working on primitives of Kubernetes. It's not like a whole bunch of introduced new concepts of Cron jobs and jobs. Your working on top of the... How's it engaged with Kubernetes? Is it watching those or is it implemented as like a custom scheduler in Kubernetes?

Pete:

We have a pluggable system called Step Launchers where you can plug in different backends.

So, for example, we have a serverless product where, you know, if you're just like a data or like an analytics team and you don't want to think about any of this stuff - you just want to like run snowflake queries and populate Power BI dashboards - we will run that all on Fargate for you. You don't even know it's Fargate. It's just like runs in some magical place.

Cory:

Yeah.

Pete:

If you're running this in an enterprise though, you might have Kubernetes in one place and ECS in another place and Azure in another place. And so we make all of this pluggable. And the way that you use Kubernetes might be different in one org or another. You know, we just get access to the Kubernetes cluster and we can manually schedule these things.

What's also interesting is 2 years ago we built something called Dagster Pipes, which is a protocol for orchestrating compute on some alien system.

Cory:

Okay.

Pete:

And it's almost like kind of our replacement for step launches in many cases. Where it's pluggable, just like the other system, where you have a Kubernetes one and you have an Azure one and ECS backend, but then the question is like, "Okay, I've got this container that's running in some system. How do I get metadata back out of this thing? Because I want to know how many rows are processed. I want to know how many LLM tokens were burned as part of this process?"

And again, you could wire up a separate observability tool, but then you've got to go into Datadog, you have to know what to look for. Then you have to bounce back into Dagster to see the operational state of your thing. And this is what I mean about this impedance mismatch, right? Like you've got to go from like container world into workflow world, if you're not using something like Dagster. With Dagster and Dagster Pipes, we have the ability to kind of stream that metadata back out and leverage that inside the core orchestration engine.

It doesn't sound like rocket science, but being able to have an abstraction that works for processes on your local host Kubernetes pods and ECS and serverless requires a lot of design taste. And I think our CTO did a great job with it.

Cory:

The thing that is really interesting with that is... I feel like that is one of those areas where you're like, "Ah, it's not rocket scientist", but that's one of those areas where you just like get gumshoed. Like, once you get into it, you're just like, "Man, I'm not doing product work. I'm not doing anything that adds value. I'm in here like troubleshooting logging stuff that's going to DataDog for the 800th time."

Once you're in that scenario, nobody's like, "Wow, this is the part of my job that I've been waiting for." There's like, "How do I get out of this?" right?

It's very interesting that, because I feel like those data teams are also the teams that in many organizations that I've seen struggle to get the most attention from the DevOps and Infrastructure teams. And so that doesn't fall onto some DevOps team's work queue of how to troubleshoot this metadata I wish was coming back to me, that falls on me as the data engineer. I've got to figure that out because the DevOps team is drowning in debt trying to surface stuff for the product team, right? So I'm just, I'm over here doing stuff on my own.

So like, that is one of those things. It's like, yeah, it might not be rocket science, but it's like one of those areas that just sucks for every engineer when they land in it.

That's awesome. It feels like it's a part of this, like, phase that we go through every few years around, like, bundling and unbundling stuff.

Pete:

Oh, yeah.

Cory:

It's like we're starting to bring it back together and like, create this abstraction that makes more sense than just an acyclic graph. Yes, that's what it is under the hood, but what we're doing is we're processing data and there's real metadata that matters about that. And I shouldn't have to be... like, given that this is my use case... I shouldn't have to be some internet detective trying to dig through logs to get this information. Like, you understand my use case, send me back the telemetry so I can understand how my use case is going.

Pete:

That's totally right.

Cory:

That's pretty cool. I really like that.

And so with this pluggable system, I can kind of... if I'm a company, we're buying companies left and right. I've got some people on AWS. I've got some teams on Azure. We bought a company, they've got an On Prem data center in, you know, Tucson or whatever, right? I don't have to go and look at 35 different tools that run on Azure/AWS. I can take Dagster.

I can have the workloads running wherever makes sense for the current teams. And I know that by adopting this tool, I'm also not like, jumping into... we're going to do all this additional operations work to get these people good information.

Pete:

Can I give you some nightmares here?

Cory:

Oh, dude, yes, you absolutely can. If they're... I love nightmares. I love production nightmares. Like, I want a whole episode on hearing how you spammed TechCrunch to death. Like that... This is my happy place.

Pete:

We could talk about that one too.

That system that you just described - Multiple M&As, multiple different data stacks. That's probably a global multinational corporation, right?

Cory:

Yeah.

Pete:

You think that the EU data protection regulator doesn't come in and ask them, "Hey, like, you collect phone numbers for two-factor authentication in France. Like, how are you using that data? Please provide the documentation for that." And then they got to go figure out where that data flowed to, right?

It's like, okay, we ingested it here and then, oh, it actually popped over to our subsidiary over here and then it popped back over here, and in between that, it went from Azure to AWS to like our, you know, IBM mainframe... you know, that does happen, you see that every once in a while at old financial institutions. And so it's really important to be able to get that global lineage across all these different environments.

This comes back to that like bundling versus unbundling. Those like Lego blocks of composability. It's like if you could stitch that together in a way that produces operational value, it's incredibly important for the organization.

Cory:

Yeah.

Pete:

Especially for these kind of like, you know, P99 projects where like that legal request doesn't come every day, but when it does come, man, it throws your roadmap out the window while you, you know, figure that out.

Cory:

The amount of times I've been on a team or I've met teams where like we're working through all this stuff, we're seeing so much progress and then they just get hit by the bus of like an audit. And it's just like, well, our entire roadmap for the year is gone while we spend the next three months just like assuring some 23-year-old Deloitte employee that we didn't screw it up.

Pete:

Yep, yep, yep, exactly.

I gotta say, you know what I think is an interesting human mistake that a lot of platform teams make is they spend so much time talking about how median latency doesn't tell you much. P99 Latency really tells you the full story, right? Like this is like kind of Infrastructure Engineering 101.

But then when they're doing their reorg or they're planning their roadmap, they think in terms of the median project - like what is the median project that we're going to go take on this year? And then they do all their estimates based on that. When in reality they have to be optimizing for the P99 project.

Like, okay, we should anticipate some sort of audit coming this year that is going to have some tail risk of taking a year. Like how do we design an org structure? How do we design a set of practices of planning to be able to absorb something like that?

And it's just so interesting to me that we take what we really understand in the technical domain and we don't often apply it to the less technical domain, like project management, for example.

Cory:

Yeah, I mean, it's funny too, because I saw a Reddit thread. I'll mark and put the thread in here. It is extremely vulgar. I mean, I like vulgar things, but eh...

Pete:

I'm holding back right now.

Cory:

Yeah. Oh no, this shows marked explicit. We tried to do non-explicit early and everybody was like, it's too much to bleep from you, Cory. Like we're just going to mark it explicit. So it's explicit now.

But I saw this Reddit thread yesterday on like a very similar concept of like, you know, when we approach a lot of projects we go like, "Oh, let's not overengineer" right? And like, yeah, we don't want to overengineer anything. You saw the effects of this on that Twitter project. That seemed like a bit of an over engineering.

But it's so funny when we take on and plan projects, many times we take on and plan them as like, "Oh, we'll just do something to get it working right." And it's just like, okay, but like if this is a line of business... like if this isn't a startup, this isn't your first week... this is like an established business moving towards making some money. And it's like, "We'll half ass it" and it's like, okay, but then you have to come back and fix it. Right? And like that's never accounted for in the planning.

Pete:

Right.

Cory:

And like all of that just kind of gets shoved off into, "Well, that's technical debt now and somebody else will figure it out down the road."

I feel like that is such a hard part of planning is like trying to figure out like the right amount of effort to put into a thing and like where are acceptable corners to cut versus like where are not acceptable corners to cut. Such a hard thing to plan. But it's just like, it is just a reoccurring nightmare, I feel like in all aspects of engineering.

Pete:

Yeah.

As a CEO of a Series B infrastructure company, you know, my approach to this is like... and maybe I should be explicit about telling the team that this is kind of how I think about things... but in my brain I'm always like, you know, quarter on, quarter off in terms of intensity for me.

It's like we've got to ship some user facing stuff that's going to drive revenue, like, "Push, push, push, let's hit the deadline, let's ship the V1 of this thing." And then I like to kind of fade away from the team a little bit. Not put another project like that on their plate until like a quarter later. It takes the pressure off them a little bit.

It lets them recharge first of all. But also the corners that were cut would never be like communicated. And you want to give the team like the power and autonomy to make their own technical decisions and to be able to push on quality. And there are going to be times when the business really needs something like yesterday. But you got to also plan for like, you know, take your foot off the gas a little bit. Let them, you know, say, "Okay, you know, we delivered the feature for the customer. We cut corners in places X, Y and Z.Let's go take some time to go clean that up and make sure that we're not introducing operational risk down the line."

Cory:

Yeah, yeah. I like that idea of quarter on, quarter off. I feel like that's pretty, pretty decent cadence.

But it also gives you that gap too, right? Where it's like the customers are figuring out how they feel about it, the project managers are getting feedback on, like, what can be made better. There's probably incremental work to do in that time. But also gives you a good buffer of like, okay, well, like, customers are liking it. There is some debt that we took that may be related to features that are coming, or it might just be debt that we took just to get to where we are.

That is a very nice cadence.

Pete:

Yeah, yeah. And you touched on it. Exactly. It's like you put the product out there and then there's a period when you're like hearing from customers and you're maybe iterating. Oftentimes it's like frontend tweaks initially and those are a little harder to predict because it's based on customer feedback.

Cory:

Yeah, yeah. And tastes, which are always extremely hard to predict and hard to make everyone happy. That's what I've learned.

Awesome. Okay, so there's the Dagster open source product. Anybody can just go out and grab it. It's on GitHub.

Pete:

Right.

Cory:

I can stand it up locally. I've got lots of plugin options for where it runs. So let's say I brought Dagster into my environment. Got it. I love it.

Like, where do I go from Dagster open source to Dagster like the enterprise product? What's that path like?

Pete:

So first of all, we are big believers in open source. I've been work... you know, I used to work at React, so I've been very into open source for well over a decade now.

Nick Schrock - our founder, CTO, co-creator of GraphQL - he's very into open source as well. So we're like kind of true believers I think.

You know, you've also got to make money to be able to fund all this open source too.

First of all, we want customers to feel like they can always leave our commercial product and continue to operate their business. So we try to not introduce proprietary APIs that keep you hostage to us.

There's an altruistic aspect of that, but at the end of the day it's good for business. If you're going to go sell a deal to a major enterprise, they're going to go run a risk analysis on all their vendors. They're going to say what's going to happen if this vendor raised prices or goes out of business or something. And the technologies that have true open source alternatives or true open source versions of them are going to win those RFPs, right?

So customers demand it and we're happy to oblige.

In that case, like how do you make money in open source then? And how do you afford to continue to fund this work?

And our approach has been to number one, focus on the needs of accelerating enterprises. So there's a lot of like integration work that sits better in a cloud service, is directly correlated with willingness to pay, and enterprise value.This is things like, you know, the SAML integration - this is a huge pain in the ass to do, especially in like an open source product. You know, it's very much correlated to willingness to pay. Then there's a bunch of related stuff to that around access control, authentication, authorization.

The second thing is the open source orchestrator throws off a ton of interesting data. We have a ton of structured event logs, ton of metrics. And you know, in our open source product that's kind of just like oil that's sitting underneath the ground. Our commercial product basically taps those oil reserves and like makes them available to the customer. So we can do things like help you understand your spend, help you understand your core KPIs - like time to resolution of an incident.

We can create subsets of the platform so you can give them to one of your stakeholder teams and be like, "Listen, you guys manage your own spend, you guys manage your own Operational Excellence KPIs, we, the Central data platform team have too much other stuff to do." and we give them the ability to kind of like self-serve their operational KPIs and metrics and stuff like that.

You know, the other thing that we do is we offer up some really nice quality of life features. Like we do full staging environment per pull request, for example. We have a very deep GitHub and GitLab integration that lets you do some interesting stuff around staging environments.

So that's kind of been our approach to building a commercial product. Most customers, they try Dagster open source. They get a lot of value out of it.

Then they say, "Okay, we're about to make this a mission-critical part of our infrastructure" and they'll come to us and they'll say "Listen, we're just going to have a conversation" and they'll learn a little bit more about our offering. Oh yeah, we host it too by the way. That's another reason why people come here.

And so a lot of companies say there are some of these like access control observability features that we really want. Others just want a relationship with the vendor. Others don't want the operational risk of hosting a new service. And so we're happy to partner with those customers. And sometimes they say, "Listen, you know, we want to stay in open source for a while, it's working great" and we say "Awesome. At some point in the future, hopefully you'll come on board." But we see like losing to open source in a sales process as not necessarily a loss for the business, right? It's just, "It's not now."

Cory:

I feel like lock-in is always such a concern of operations and data teams and it's just like... there's a lot of businesses where you can like, "Haha, I got you and I can lock you in and I can crank the price" and like you can't really do much about it besides pay more or suffer an extreme burden to leave.

And like I always feel like you see some of the best innovation and some best features come out of companies where they're not afraid of their customers leaving. Because then it's competing not on, you know, keeping you here - it's "Hey, I have to compete with this open source product by giving you things that are valuable" versus like forcing you to stay here.

That anti lock-in like mindset, I feel like is something that's very far and few between in a lot of open source and enterprise versions of products. That is very cool to see you guys do that.

Pete:

I'd say we're number two in the category. Number one is obviously like open source Apache Airflow - number one in the sense it's most widely deployed.

And when you're the number two in the category you've got to be like 10x better than the number one on like every dimension. And so we really strive for that.

And at the same time I think that like life is too short for me personally to go build a business that is not going to be successful in a win-win for like all the parties around the table - employees, customers, investors.

And so we do try to make an effort to have a 10x better product and also have real healthy gross margins at this stage of the game. So we're not selling dollars for 50 cents, locking up the market and then doubling prices in order to break even. Like, that's just not... it's not something I want to do. I know it's been done in the past, but it's not something I want to do.

Cory:

Very successful business model for the folks doing it. But no one's super enthralled by it besides the people counting the cash.

Pete:

But again, it's like... I like to get the willingness to pay answer early, right?

Cory:

Yeah.

Pete:

Like, is the market going to pay for this product or not? Because if it's not, then I'm going to go move on with my life. The good news for us is that it's going great. So I'm excited and motivated and keyed to grow this thing.

I think that the amount of time it takes to get to the answer... and one of the things that I've learned in my career is getting the bad news as early as possible is great.

Cory:

So for those teams... so I'm sure lots of folks that are listening, like, they're probably running Airflow. They probably have Airflow jobs they're running. So, like, for teams that are already on an Airflow - they're number one in the category - what is the migration to Dagster where they're going to be able to run this stuff anywhere? They're going to have more telemetry and information, like less of the operational glue that they have to put into it.

How do they go from Airflow to Dagster? Is it a complete rewrite of everything that they've done or is there like a smooth migration process?

Pete:

Our founder, Nick Schrock, and CTO, he made a name for himself at Facebook as like the technology migrations guy. Or like one of them. There's a small number of people that were like, "All right, we got a V2 of the system - and it's a V2 of like a core software abstraction - How do we migrate a thousand engineers and the, you know, 10 gigabytes of PHP code over to this thing?" I guess it's 10 gigabyte compiled binary, but you get what I mean, a lot of code.

And so the answer is incremental migration. It's completely unreasonable to ask people to like do a pause the world and rewrite everything. So we have a toolkit that we call Airlift. And our approach here is, I think, pretty interesting.

Like step one is we do this thing called federating where we land on top of the existing Airflow cluster via the Airflow API and we can give you observability in the Airflow cluster better than they can - just because we've invested a lot of time and effort into building good observability tools - we're able to like use the Airflow API and get like a good chunk of that functionality for Airflow users. So now you've got, number one, a better observability tool for Airflow.

Number two, you can start to build downstream of Airflow because we're observing the activity in Airflow. So you can say, "When this job finishes, run my new Dagster stuff."

There's also a way where we can incrementally, one by one, pick off parts of the pipeline, and move it step by step into Dagster. And there's a Boolean config file so if you move that step into Dagster and you're like, "Oh shit, we broke something." you set like migrated equals false and then Dagster will continue to run that step back in Airflow.

Cory:

Oh, dude.

Pete:

It's really, really neat. Yeah.

Cory:

That is awesome.

Pete:

It's super, super cool. Nobody's got anything like this.

Cory:

Yeah.

Pete:

So, you know, migrations are migrations, but this is a pretty easy one.

Cory:

Yeah. So there's like baby steps in. I can take my Airflow, I can toss it in... it was Airlift. Start getting some metrics. I'm like, "Oh, I'm actually getting more information out of my Airflow via Dagster" and then just baby stepping essentially jobs and workflows from there and then having this fine-tuned control.

That is pretty slick. That is pretty slick.

Pete:

I'm telling you, man, Nick Schrock did a great job with that one. And the rest of the team that worked with him, Chris Decarolis and a bunch of other people, did a great job.

Cory:

That's very cool.

I know we're coming up on time here.Before we go, I'd love to know... with all the things that are happening in AI, I assume it's going very well for you all... like how are people tying in, you know, LLMs into Dagster? Do you see a lot of people using this to push data into like the big LLMs? Are people tending to do their own training and like bringing models in-house.? Like what's the big use cases you're seeing right now?

Pete:

Oh man, lots. I keep using fossil fuel metal ports here, but I do think that they are useful.

It's like data is in many ways the fuel for these gen AI applications and we're the thing that builds and operates the pipelines and gets the data where it needs to go. So obviously business has been great with the rise of all these AI enabled applications. How people are using it though, it's very interesting.

We have one large AI lab training the foundation model on us. We have a couple of really trendy gen AI, like vertical AI SaaS businesses - they're not training models but they're calling these other model providers and using Dagster to orchestrate those calls and producing the data assets that power their application. And we have kind of like old-fashioned shops that they suck in a bunch of invoices from dental offices and they used to send that to an outsourced team to go annotate and now they are able to do that with LLMs.

What people are doing with the Dagster side of the house - there's a lot of innovation. And there's also a lot of innovation on how people are writing code. We see Claude Code and Cursor and these sorts of tools pretty universally adopted.

I think that's a very interesting trend and one that we're paying very close attention to.

Cory:

Awesome. Well, this has been super exciting. Thanks for coming on the show.

I'm going to drop links to a couple of the things you mentioned, the blog post, as well as Airlift and Dagster open source in the show notes. So definitely check those out.

And then where can people find you and follow you online? Are you on LinkedIn? Twitter still?

Pete:

Yeah, I'm on Twitter. Floydophone - F L O Y D O P H O N E.

Cory:

Okay, we'll put that in the show notes too. Awesome.

Pete:

Cool.

Cory:

Well, thanks for coming on the show, I really appreciate the time. And everyone, thanks for listening and we'll see you soon.

Show artwork for Platform Engineering Podcast

About the Podcast

Platform Engineering Podcast
The Platform Engineering Podcast is a show about the real work of building and running internal platforms — hosted by Cory O’Daniel, longtime infrastructure and software engineer, and CEO/cofounder of Massdriver.

Each episode features candid conversations with the engineers, leads, and builders shaping platform engineering today. Topics range from org structure and team ownership to infrastructure design, developer experience, and the tradeoffs behind every “it depends.”

Cory brings two decades of experience building platforms — and now spends his time thinking about how teams scale infrastructure without creating bottlenecks or burning out ops. This podcast isn’t about trends. It’s about how platform engineering actually works inside real companies.

Whether you're deep into Terraform/OpenTofu modules, building golden paths, or just trying to keep your platform from becoming a dumpster fire — you’ll probably find something useful here.