Episode 32

full
Published on:

16th Jul 2025

Building Better Platforms with Dapr: Abstractions, Portability, and Durable Systems with Mark Fussell

Cloud lock-in isn't just about where your data lives—it's about how deeply cloud-specific code permeates your applications. Mark Fussell, co-creator of Dapr and CEO of Diagrid, joins Cory O'Daniel to explore how Dapr provides clean abstractions for common distributed system patterns, enabling teams to build portable applications without sacrificing cloud-native capabilities.

The conversation covers:

  • How Dapr creates a clean separation between application code and underlying infrastructure services like messaging, state management, and secrets
  • Why platform teams struggle with tight coupling between applications and infrastructure, and how Dapr solves this problem
  • The benefits of Dapr's sidecar architecture for local development, testing, and production environments
  • How Dapr automatically handles cross-cutting concerns like security, observability, and resiliency without boilerplate code
  • Introduction to Dapr's workflow engine for durable execution and the emerging world of stateful AI agents

Whether you're a platform engineer struggling with cloud lock-in or a developer tired of rewriting code for different infrastructures, this conversation demonstrates how Dapr can simplify your distributed systems while maintaining access to the unique capabilities of each cloud provider.

Guest: Mark Fussell, Co-founder of Dapr and CEO of Diagrid

Mark Fussell is the CEO of Diagrid, a cutting-edge company that simplifies building and scaling cloud-native applications. As the co-founder of Dapr (Distributed Application Runtime), Mark has played a pivotal role in shaping the future of modern application development by empowering developers to build resilient, distributed systems with ease. With decades of experience in the software industry, Mark has been a driving force behind innovative solutions that bridge the gap between developers and complex infrastructure.

Diagrid

Dapr

Links to interesting things from this episode:

Transcript
Intro:

You're listening to the Platform Engineering Podcast, your expert guide to the fascinating world of platform engineering.

Each episode brings you in depth interviews with industry experts and professionals who break down the intricacies of platform architecture, cloud operations and DevOps practices.

From tool reviews to valuable lessons from real world projects, to insights about the best approaches and strategies, you can count on this show to provide you with expert knowledge that will truly elevate your own journey in the world of platform engineering.

Cory:

Welcome back to the Platform Engineering Podcast. I'm your host Cory O'Daniel and today I have with me Mark Fussell, co creator of Dapr, CEO of Diagrid.

Before starting Diagrid, he led platform strategy at Azure, spent years thinking about how developers build distributed systems. With Dapr, he helped create a standard runtime for cloud native apps, one that's now widely adopted by platform teams.

At Diagrid, he's focused on helping teams operate these systems more reliably in production. Mark, thanks for coming on the show today. Super excited to talk about Dapr.

Mark:

It's fantastic to be here, Cory, and I'm excited to talk to you about all things distributed.

Cory:

Yeah, yeah.

So why don't you give us a little bit about your background, like what you were doing at Azure and like what kind of drove you towards, you know, creating Dapr and Diagrid.

Mark:

Yeah, yeah. Actually I spent many years at Microsoft. I spent 20 plus years there.

ich was like the hot thing in:

g on databases, I ended up in:

And it was all about building a distributed systems platform for Azure to build and host and run their hardest services, which were their database services like SQL Azure and Cosmos DB and things like this.

Cory:

Very cool.

Mark:

And anyway, it taught a lot about building distributed systems, building a platform, running things at scale and understanding how you really manage and operate these things.

Everything from sort of replication technologies all the way up to what it provided, which was an application programming not only model but also a description of how the applications had dependencies between each other.

And so yeah, Service Fabric was a great understanding of learning and transforming from what was in the days of client server and how you built things to kind of the modern way now of effectively distributed compute and running it on multi-machines. It was a fun, fun journey.

Cory:

Yeah, that sounds very cool. It's funny that you say xml. I had... I don't know if you remember this... I remember I got it, I think... Not Barnes and Noble.

e XML Bible in like the early:

And I was like, "Man, once I read this thing, I'm going to be the best programmer on the planet." I did not make it past the first chapter.

Mark:

There's xml, Transform xml, xslt, there's all this sort of stuff. And actually I spent also five very difficult years working on all the W3C protocols. So W3C schema, W3 secure conversation and all that sort of things, which now is kind of in the past.

Cory:

Yeah, yeah, very cool. So what I'd love to do is I'd love to just kind of start from the top and like get... I feel like you're in a place where you're interacting with a lot of platform teams today.

Mark:

Yes.

Cory:

And it's interesting because I think that, you know, the idea... I'm a big fan of creating abstractions...

I think that, honestly, that's what I've always felt DevOps was about, was creating abstractions to act as the collaboration versus like making other people do your job - which is how it's surfaced for many teams.

Seeing what Dapr does and seeing the types of teams that you're working with in the platform engineering space, I'd love to just get what you think the current state of platform engineering is today. And what are some of the biggest problems some of these platform teams are trying to solve with and without Dapr?

Mark:

I mean, let me give you a little bit of overview of what Dapr is and then we'll sort of dive in and answer that question.

And just to kind of continue my background, when I was at Microsoft, so six years ago, we started the Dapr project. And then left Microsoft three and a half years ago to form Diagrid because we sort of wanted to continue innovation with it. So today, you know, I'm the co-founder as well as the co-creator of Dapr.

But I mean, what Dapr is in essence is it's a set of developer APIs that allow developers to not reinvent the pattern, as we say, or not reinvent the wheel for common abstractions that they need. For example, decoupled messaging. So if you do event driven system, event driven application, you do Pub/Sub messaging, it allows you to provide an API for Pub/Sub messaging. Or if you want to kind of communicate a request reply semantics between two services, it has this thing called service invocation which does discovery, secure calls, and if an application fails and falls over to another machine, it will reconnect it all. Or things like secrets management.

So these common APIs are really what developers need in order to sort of be productive building their applications running on top of any different type of compute platform like Kubernetes for example.

So to go to your point of like, what do we see in platform engineering today? What we see a lot is platform engineering teams focus on, "Okay, I'm building a platform, it's got some services in it, like a database service, or a messaging service like Kafka or RabbitMQ, or you can get some secrets management from it as well like Hashicorp Vault or one of the secrets management for say from AWS or Azure." And they want to provide all these services, but they sort of somewhat ignore the actual contract to the developers themselves.

So what happens is you see like a developer goes, "Well, okay, I'm just going to use the Kafka SDK." And so they pull in the Kafka SDK.

Now if you want to do Pub/Sub messaging between services, first the Kafka SDK isn't very nice. You have to sort of build this sort of Pub/Sub semantics on top of it all - how I publish a message and how I subscribe - because it's a sort of streaming thing to begin with. So you have to sort of do that first.

Then you sort of bake the Kafka SDK into your code and then you start using it all. And then you know, lo and behold, kind of in about three months or four months time, someone says, "Oh, we have to move everything to AWS now" or some other cloud, and you have to rip out all that Kafka code and instead, you know, replace it with AWS SNS or Azure Service Bus or one of these other ones.

And all of a sudden you've got this huge pain point where developers have tightly coupled their code to the underlying platform. You know, you don't want to really do this.

One of the key benefits that Dapr provides is a very clean abstraction between a Pub/Sub behavior and semantics of an API and the ability to swap out the underlying infrastructure - which is called as components. So you can literally keep the code exactly the same with the right behaviors and swap between Kafka and say RabbitMQ or SNS literally within hours with a component that's been very well tuned for that particular message bus because all the right SDK has been wrapped in right behavior. It has a YAML metadata description... I know you're a YAML fan.

Cory:

I love it.

Mark:

So you have this description of this component and it has all the metadata about it all.

So you literally go and take the Kafka YAML description and swap it for the AWS YAML SNS description, drop it in place, and boom, you've literally migrated from Kafka to SNS within hours. And off you go.

And so that's one of the things that we see happen a lot today, this sort of tension and fight between the application developers and what the platform provides and this sort of tight coupling. Also it provides really for portability of code.

So to address the problem... we see a lot of financial companies that have to deploy multi-cloud, they have to deploy on this cloud and that cloud. We also see a lot of people who have to have choices as well in a platform - I want Kafka and SNS in both of them... and also provide for design flexibility. So as you move forward, and you decide one particular message broker might not satisfy you, and you want to adopt and change, you can.

So those are kind of the benefits we get from a platform engineering team perspective. Does that make sense?

Cory:

Yeah, yeah, yeah, it does. So it's funny that, like, as you're describing this Kafka thing... the word I like is calcify. Like it just gets crusted in there and it's all over the place. And it's funny because, like, when people talk about cloud lock-in, like, I very rarely feel like cloud lock-in is your data.

It's like, "Oh, I've got some stuff in an event broker somewhere." Okay, you write a little tiny lambda that just attaches and pulls that out to another one. Like you can get that data elsewhere.

I feel like the real cloud lock-in happens in two places. Security level, because you just can't see it.

Mark:

Yeah.

Cory:

And you forget about it. And then when you move to the new cloud, you're like, "Oops."

Then the other place I feel like you really feel cloud lock in is in the code base.

Mark:

Yes.

Cory:

Because their SDKs have just kind of permeated every single file. And it's like, man, to move off of S3. It's not hard to sync an S3 bucket to GCS or Azure Blob Store - it's very trivial.

To change the code...

Mark:

Yes.

Cory:

To change the code is going to take so long.

Mark:

It's so long.

Cory:

And it's like one of those things. It's like, it's funny. Like if your ops person sits down to move the data and the developer sits down to change the SDK, the part that we think about as cloud lock-in happens way faster. Like moving that data is done...

Mark:

Exactly.

Cory:

But changing that SDK, it's like, "Man, all of our test mocks are toast. Like, so much stuff is toast.

Mark:

Yes.

Cory:

That is pretty cool. It's funny, I feel like this is kind of covered a little bit by like, you know, hexagonal architectures or something like that. But like that also feels like, for many teams, a big, hard concept to swallow.

Like that's one of those things that like, we haven't really dried up. And it sounds like that's kind of what Dapr is doing. It's saying, "Like, look, there are these abstractions that make sense to developers." I buy this. I know many people do. I know there's some Ops people that still struggle with this, but developers tend to not. I don't want to say they don't care, but they're not as concerned.

Mark:

No. Yeah. I mean, you get a lot of, "I want to try and solve this problem myself."

So the biggest thing you rub against is the developers want to build a platform themselves and get all very excited about this. But that's not, you know, the business value to the company.

Cory:

No.

Mark:

And you know, when the business value to the company is like a level of portability, cross cloud.

I mean, there's also the same thing, comes down to the fact that developers can build things locally. So you can build against this API and you can run it.

For example, often we see that people use Redis as a local Pub/Sub testing environment, and then they can just swap out Redis and put AWS SNS in, and they move from local development to cloud with no code changes.

And like that concept of like, I just switch between this - I mean the developers love all this. And it's a very clean API around these things.

Pub/Sub epitomizes this but you've got things like jobs as well, a cron job API. You've got secrets management, you've got this concept of sort of bindings where I can talk to an underlying state store or a SQL database and then just swap that out. So, you know, all of these common API definitions make it very clean for developers to build these apps.

Cory:

The local development test one's another one that's really interesting to me. So we lean heavily into this idea internally, but it's a ton of boilerplate for us.

I've written about this a bit on LinkedIn. Like we just moved from AWS to just completely cloud native... we can define that term later, I think... like fully into Kubernetes. And when we did this, we were just heavily into adapters and it's because I'm very into TDD and I don't want to hit S3 every time I write like a Blob store test. And so it's like we have an abstraction that's like Blob storage. Development and test, it's actually just like Erlang memory stuff. In production it's S3, right? And it's like, for us, that's really, really nice.

So it sounds like this gives you that, without every team having to be just like spending tons of effort. Because we do spend a lot of effort on that boilerplate. Like every time we introduce a cloud service, we create an abstraction around it, we create a test version, a prod version. And it sounds like this just like gets you going.

Mark:

Exactly. That's exactly what it's designed for. And we just saw this time and time again repeated across all these companies.

And it's not only that, you've got to start putting behaviors inside it all. So let's take messaging as an example.

Often when you start to build these distributed applications and you want to publish messages, you want to be able to say, "Well, I only want these other services, from a security perspective, to be able to receive them all and not these ones." So you want to be able to lock down who can receive the messages, rather than anyone to be able to see them all.

So in the messaging behavior of the Pub/Sub API, you can say only these applications are allowed to subscribe these messages and these ones can't. So you can be very explicit about security.

In fact, security is prevalent across most of the concept of Dapr. A very key element to Dapr, which again kind of gets missed, is that the actual pieces of code have identity associated with them.

And the fact that you have identity associated with your code is you can say these applications can't receive these messages, or this piece of code is the only one that's allowed to connect to the Kafka service, and these ones can't.

So the platform team can now deny certain applications attaching to them or can approve certain ones as well.

Cory:

Hmm, that's cool.

Mark:

Or the application developer can say, "Only these messages or this application can talk to this", because you have Identity.

And this is another key thing that gets missed in application development. Identity allows you to do lots of super cool things, including flowing the identity of the actual process down to the underlying infrastructure itself. And we do this in Dapr with SPIFFE as an identity. So a SPIFFE identity gets associated with a piece of code.

Like in the AWS world, they can actually identify SPIFFE with an underlying service like SNS. And now a developer literally has no concerns whatsoever of like this piece of code providing its identity and talking to underlying service. And the platform team can approve or disallow which applications talk to all based upon that identity.

This is something that's also super key that the platform teams love because now they have control over what applications can communicate. Plus the developers love it because security just flows naturally.

Cory:

Yeah, we'll put a note in the show notes and a link to SPIFFE - this is the secure production identity framework for everyone.

Mark:

Yes.

Cory:

I'll tell you what, that's a spiffy acronym. I'll give it to them - they landed that one.

Mark:

Yeah.

Cory:

Yes. So this is the first time I'm hearing about SPIFFE. I was googling it real quick as you're saying it, so I could put a link.

Can you kind of tell us what SPIFFE is and how that actually integrates? Like, is it at the... so it's at the code level, not at the service level. So if I had like two different...

Mark:

Yes.

Cory:

You know, two different domains in my project that are storing data in different ways. Like each of those domains - like my user space and maybe whatever my other business domain is - would have its own identity, even if it's in the same code base.

Mark:

See, what developers want to be able to do is just say, "I want to talk to the order processing application and I want to call the order process on it. And that's it."

Invariably what you don't want is developers to say, "I have to look at this DNS location and open up a port and think about everything at the networking level." And all of that's just horrible for a developer. So instead, developers simply think in terms of identities of pieces of code.

So now I make my application, it has identity... like it has Mark's application, Cory's application. On Cory's application it has method A, B and C. And so from a developer's perspective I just say, "Go and call the order method on the order processing application wherever it's running in my Kubernetes cluster." I don't care where it is, it can be on a machine further away.

I'm not thinking about ports and endpoints and anything else like this, which is sort of the natural behavior of a platform team. And so all of a sudden you've got easy ways to send messages, communicate, discover things.

And that identity... going back to a SPIFFE identity that gets allocated... can be flown by federation because you can actually go into the underlying infrastructure itself and federate that back to the identity that was given to the actual process itself... it can now be federated to the underlying service, like for example in the case of AWS SNS service.

And now you can make sure that only this application with this SPIFFE identity... which is basically an X.509 certificate, SPIFFE is just a way of wrapping a piece of identity like an X.509 certificate or any other piece of identity credentials and then flowing that down to the infrastructure...

This concept of identity means that it's very easy for an application developer to call other services and to have security in the underlying services.

And it's something that, you know, just prevents application developers having to think of ports and networks and other sorts of horrible things like this, which is sort of not the natural base or knowledge of a platform engineering team as a whole.

Cory:

It's also just like, it's that undifferentiated heavy lifting.

It's like if everybody on the team has to think about it all the time, it's just like... that's just like slices of brain that are like not put towards the product. Right?

Mark:

Exactly. Well, there are three cross-cutting concerns.

So as well as APIs, there are three things that a platform team has to think about to service an application team. They are security, end to end and what that looks like. There's observability of how you track data and how it flows. And then there's resiliency in terms of what you do in the event of network failures.

And so this is another thing that becomes a burden on the application teams. So let's take an example of observability.

Now if you can get all your APIs flowing through Dapr, every single one of them is tracked with OpenTelemetry calls. So you can go call from... say to a secret store to get hold of a secret. You can send a message with Pub/Sub messaging and call to a database. And in that whole call flow, Dapr writes out OpenTelemetry events, which you can push into your OpenTelemetry store. OpenTelemetry is another CNCF protocol standard that's pretty much well adopted now in terms of events.

And then you can see the whole call graph of what that looks like inside your Datadog, New Relic, Azure CloudWatch, whatever you choose as the environment around these things.

That's beautiful as well, because without the observability, you can't run production applications. I mean, you have no idea what's happening because you can't see. And so now you can sort of pinpoint, "Oh, this happened here, this happened here, this happened here." You can see the latency between the calls and this sort of thing.

And invariably this is a thing that struggles a lot when developers pull in the Kafka SDK, and then they pull in some other SDK, and they're sort of disconnected and having to try and find a continuous way of doing diagnostics and observability across it. So it solves the observability problem as well.

Host read ad:

Ops teams, you're probably used to doing all the heavy lifting when it comes to infrastructure as code wrangling root modules, CI/CD scripts and Terraform, just to keep things moving along. What if your developers could just diagram what they want and you still got all the control and visibility you need?

That's exactly what Massdriver does. Ops teams upload your trusted infrastructure as code modules to our registry.

Your developers, they don't have to touch Terraform, build root modules, or even copy a single line of CI/CD scripts. They just diagram their cloud infrastructure. Massdriver pulls the modules and deploys exactly what's on their canvas.

The result, it's still managed as code, but with complete audit trails, rollbacks, preview environments and cost controls, you'll see exactly who's using what, where and what resources they're producing what all without the chaos. Stop doing twice the work. Start making infrastructure as code simpler with Massdriver. Learn more at Massdriver.Cloud.

Cory:

So I feel like this is one of those things. It's hard. I love OpenTelemetry. We actually... it is disturbing to new folks that like work on our application.

We have a quiet app, like if you attach to its logs, it doesn't make a peep. And it's because we've just leaned into events and OpenTelemetry, like spans and events, heavily. And so it's like everything... we just send it all there and we don't care.

It's funny. People are like, "Oh, isn't that expensive?" It's like, "No, your logs are..." It's expensive when it's an additional expense, but when you're not talking about logs, you can send 100% of stuff to OpenTelemetry and it costs about the same. And it's more useful.

Mark:

Exactly.

Cory:

One of the things I think is hard, though, is if somebody's getting into OpenTelemetry, or just tracing in general, is like where to put it. And so does the call graph... like, how Dapr is kind of this SDK that sits in between me making calls and handling the distributed nature of it... Is it automatically putting on those traces? Like on the calls for...

Mark:

Yeah, every call creates a span, the spans get flowed and does all that for you.

Cory:

It's like a freebie. I mean, you can probably do some more in your code, but like, you're not making tons of changes, like adding spans around every function. It's just like, boom, your service input came in.

Mark:

You turn it on and all of a sudden it's like, boom. You just get beautiful traces across all your APIs.

If you think about it, there you are, you know, you're calling some message broker with like Kafka, and then you're wanting to get hold of a secret and, let's say, call a database. And like, I want to see what that looks like.

And so, you know, developers want to see the call graph operations, operations people want to see what's happening. And all of these come back with metrics, information as well.

So you can see the latency, the error rates, the throughput around all of these things as well. So that whole thing just gets freebie around this.

Cory:

Yeah.

Mark:

Then the other freebie you get is resiliency as well. Because what happens is an application developer will call onto something like Pub/Sub service and it'll fail. And then they're like, "Oh, it failed. What do I do?" Well, you have to build retry mechanisms in where I failed and let me just retry the call again. And you can just say, retry it three times before you do something.

Or, for example, you want to put things like circuit breakers in - where something's going crazy and you want to sort of break it off for a while until it recovers around these things.

So the resiliency can be put onto any API as well. So you can do retries on the Pub/Sub API or you can do circuit breakers if things are misbehaving around this sort of thing. And so you end up building these on top of a lot of Pub/Sub systems and all these sort of things as a whole.

These cross-cutting concerns of security, resiliency, observability, you just get for free - it's great.

Cory:

So let's say I've got just two services - they're not in cloud services, they're two of my services. I know one's an old flaky one, it's written in Java 2 from a decade ago but for whatever reason we can't... we have to deal with it, like it's there. Now the retries, is that happening from the SDK caller or is something intercepting the calls and like handling the retries on the last leg?

Mark:

Well this is where Dapr comes in. So maybe I should just go into the next thing. What Dapr does is Dapr runs as a sidecar process to your application.

So the way it works is every single of your applications that you launch... say I launch 100 applications... they've all got their own little Dapr sidecar running next to them all. And so Dapr has its APIs that you're calling and it's doing all the heavy lifting for you.

And so it's doing not only the behavior of like the Pub/Sub messaging, but it's providing a resiliency policy and doing that retry on your behalf.

Cory:

Oh cool.

Mark:

If you're doing a call, it is logging that telemetry call. Or if you're doing security, it's checking that you're authorized before it handles a call on. So it's doing all those things on your behalf that you get for free.

So instead of you having to like bake this all into your application code or try and mix and match all these different SDKs together, Dapr just does this as a general purpose library, as it were, but launched as a little sidecar process next to your application.

Cory:

So you don't have like a hundred apps coming back like on one Dapr server, it's all sitting right inside your Kubernetes deployments - right there, fast local host.

Mark:

Exactly.

Cory:

Everybody's got thrown. That's tight. That's awesome.

Mark:

Yeah, yeah. So when you launch your pod for your application, Dapr runs a sidecar inside your pod. So you know, it's just a local host call around these things.

There is a version of Dapr that you can deploy which has a per machine version of Dapr if you want. You know, because some people like to have the idea that, you know, they have less resources or they want to have a central place of this. But the most common deployment is, you know, a single pod with a Dapr sidecar running next to it for your application and you have, you know, 10 applications in 10 pods and that's running on Kubernetes and it all just scales out for you.

Cory:

Yeah, that's very cool. So let me ask you a question about the abstractions because I've seen this in a few places where people have tried to do multi-cloud. I'm curious how Dapr handles it. So what's the SDK for like Blob Storage called?

Mark:

Well, right now there's a Blob Storage bindings we have for that. But let's just talk about...probably the best one is to go for estate management, where we sort of do key value storage into any form of database that you want.

Cory:

There's so many of these in S3 that are great illustrations, but like, how does it deal with like the lowest common denominator of the providers behind the scenes? Like, let's say that I need a feature, can I get that feature through the Dapr bindings and SDK and it's just like I only get that feature if the backend supports it? Or is that feature just not accessible to me because the lowest common denominator doesn't offer it?

Mark:

Yeah, so you nailed one of the most important questions that people ask. It's like, "Oh, you know, do I not get the features of the underlying platform?" And the answer is you do, because although the API is consistent, you can still choose the features of the underlying provider that gets surfaced in the component itself.

Let's take an example around this. So Dapr has an API for saving status key value pairs. So you say, here's key value pairs and then you could plug in any underlying database for that. And there's like 30 different databases, but some of those databases support transactions and some of them don't support transactions. And so you can decide that when I'm using the key value API and I have a multi transaction database and I want to do sort of multi update writes I can, but this one, if I swap it out can't. And so the API will still stay the same - here's a set of key value pairs I want to store - but some databases can do a transaction, other ones can't. And so you do get all the behaviors.

Or, for example, another common one, more so, is with the Pub/Sub message providers. Some of them provide sort of different capabilities on their consumer groups or they might provide their own retry mechanisms. And so all the behaviors of the SDK itself for the particular infrastructure do get surfaced up in a YAML description in a component. But the interface itself, I should say the API itself is very clean. For Pub/Sub it's just publish and subscribe - literally, that's all it is. But you can choose the behaviors in a YAML format for the underlying message broker that you want.

So for example, Kafka has its own retry mechanism and you can choose that if you want, or you can choose its own security authentication mechanism, things like this. So you do get those behaviors of the underlying component to choose from as well.

And people often submit updates to the components for a particular SDK just to surface it in the component YAML to use. So to answer your question, yes, you can use a component... the underlying behaviors of the component or infrastructure.

Cory:

Yeah, that's one of the things when you look at the three major clouds, they all have services that compete, but then you'll look at one service and you're like, they spent a lot more time working on this than the other team did.

And S3 is one of those where it's just like moving the data is easy, but then you get over there you're like, "Ah, there's not versions, there's not locking, there's not multipart uploads," like so much stuff.

Mark:

Yes. So you do get that if you do choose between moving from one message broker to another, for example. There is a forced choice of like, okay, you might have had a feature in one that isn't inherent in another. And so you'll lose it or gain it, whatever.

Cory:

Yeah, but my code doesn't change.

Mark:

But your code doesn't change. Yes.

Cory:

Which is still... that's the lock-in hard part.

Mark:

Yes.

Cory:

Okay, so I think the other hard part is. So Dapr is definitely not like, "Hey, I'm a startup, I don't have code yet, let's go to Dapr." Like I'm going to Heroku or whatever first. right?

So you're pitching teams that have debt, code, just tons of stuff deployed. Like what does their 0 to 1 look like with Dapr? Like what are the baby steps to, like, how do we introduce this to an environment? And how do we start using it without overhauling all of our services to start getting the benefits?

Mark:

Yeah, exactly. I mean, just to be clear here, Dapr isn't just for new greenfield applications. In fact, most commonly it's used in the modernization or migration of existing applications.

And so I mean generally Dapr itself... I think the bigger challenge more than anything is that Dapr, the open source project itself, mostly is run from a production point on top of Kubernetes. And so just generally Kubernetes is hard for startups as a whole, I would say. But we also get people that just run Dapr on VMs - you can just do that.

But you know, Kubernetes is still more of the challenging environment itself, but there's a very easy integration to it in that you deploy a control plane service that does all of the management and launching of the Dapr sidecar for you. So actually from a Kubernetes perspective, it's very easy to deploy, and run, and upgrade, and manage.

But you know, it more depends on are you prepared to take on Kubernetes as your underlying hosting system to host and run your code. But to get started, there's some great getting started guides. It's very easy to install Dapr locally on your local machine.

There's just a CLI, you can deploy it all locally. You can just do Dapr init and it sends up a development environment for you with Redis and Zipkin and a bunch of other local services. You can just run all your services locally. It sets up Redis as a local Pub/Sub and state store for you - just run and test it all out. And then what really happens is you then deploy Dapr as a control plane service into your Kubernetes environment and then really switch out these components for the environment you want there, and you're off to the races.

So it's pretty easy to set up and manage itself. I say the greater challenge is a Kubernetes environment around these things.

Cory:

Well, so for teams that are already like... let's say they're running Kubernetes, I mean, I'm sure there's plenty of like getting started use cases, but like, do you typically see teams like, "Hey, we're going to start using dapr and we have a new service that we're building, so we'll build it using Dapr."? Or do you see them saying, "Hey, you know what? Like we know that we want to move off of Kafka to RabbitMQ or something else, and so we're going to use Dapr to make it where we don't have to go through like the code thrashing a second time."?

Mark:

I would generally say, it would be my opinion that you should not build anything on Kubernetes as an application without Dapr. I would make that claim. If you were trying to do it yourself...

Cory:

Ooh baby. Hard take.

Mark:

Yeah, I think it solves so many more of the problems that you will just end up reinventing the wheel. And there was a good case study actually came out just a few days ago from a company called DataGalaxy, and they were modernizing an existing... they called it their own spaghetti code thing. And so they kind of went to a more... still a single binary, but it was a modular architecture inside all that. So they didn't necessarily break it up into lots and lots of different pieces of code, but they went from a spaghetti architecture to modular architecture using Dapr Pub/Sub messaging. And they did introduce another new service that they were sending messages to.

So if you look up the DataGalaxy case study, I think that's a great example of how they modernized existing code and split it and started using Dapr Pub/Sub APIs. Introduced Dapr into a new piece of code that they wrote that was receiving the messages, because it was effectively going off and doing some remote processing and then sending a message back.

I would strongly recommend that anyone who's starting to build on Kubernetes should start to use Dapr from the get-go because it just really solves a whole bunch of problems. Not just in terms of behavior, but in terms of these cross-cutting concerns I talked about.

Cory:

Yeah, I mean honestly I feel like... that quote's going in the video preview. We're going to get views on this one.

But I feel like also, I mean going back to that good, well-tested hexagonal code, there's so much pain to boilerplate the adapters around every single cloud service you use. And I feel like as soon as you start getting off of a pass and into the cloud, that proliferation of services can explode quickly.

Mark:

Exactly right. Here's another thing that we encounter a lot - there's a lot of frameworks out there. And I will point particularly to Spring Boot as one who would also say, "Yeah, we do a lot of the things that Dapr does" and it's true, they do. You know, they have discoverability, and they have messaging, and they have abstractions over Pub/Sub and things like this. And, you know, even the Spring Boot community is very nice. But that's just the Spring community. And what we've seen is that most of the development nowadays has a mixture of languages.

In fact, you'll be amazed of how much Python and JavaScript now is coming in to existing applications.

And so Dapr also plays across different language boundaries where the Python people or Python developers can work nicely with the existing Java developers and, you know, send messages between them all. So we see a lot of that happening as well.

Cory:

Yeah and that's the hard part. I feel like, especially bigger companies, they're buying other organizations and you have no idea what language in cloud just gets introduced overnight.

So tell me more about Diagrid. Let's say I'm using Dapr today. This is great. I love it. My Devs love it. It's super easy for us to swap out underlying layers. Their code's not changing. I'm having a great time. When do I start looking at Diagrid as an enterprise version of this?

Mark:

Great question. So just to be clear, by the way, Dapr is part of the Cloud Native Computing Foundation (CNCF).

Cory:

Hell yeah.

Mark:

It's a graduated project there, which is the highest level of endorsement that you get, which means that they did all their due diligence around it all.

And actually, myself and my co-founder, Yaron Schneider, we were the ones who actually started the Dapr project when we were at Microsoft and left to form Diagrid. And today at Diagrid, we are actually the primary maintainers of the Dapr open source project, along with lots of other companies - Microsoft and Nvidia and Alibaba and Intel all contribute to the project.

There's a number of contributing companies to the open source project, but what we do at Diagrid is that we sort of provide two things.

First, we provide Diagrid Enterprise, which is effectively enterprise support for organizations who've adopted Dapr but need sort of the core skills and things to be able to help not only with architecture guidance, but, you know, incident support and fixing things upstream. So we provide support.

We also provide a very cool tool called Diagrid Conductor that allows you to manage Dapr on top of Kubernetes. It allows you to do rollouts of new versions of Dapr Control Plane. It gives you great visualizations. It gives you this advisor tool that looks across all your components and your infrastructure and finds things that you may have misconfigured. Effectively it just makes an awesome experience for when you manage Dapr in a Kubernetes environment to kind of get all of this incredible data that's coming out of Dapr in terms of metrics, visualization, and behaviors, and just visualize it all and manage it.

Then we also provide our own version of Dapr Distribution which is just a more secure version of this that has some other features inside it all. But yeah, most of our engagement and efforts is kind of making Dapr Upstream an amazing product to use. So that's what we do there.

And then where we transitioned to recently... which also is about taking people down the journey... is that we've created a server product for Dapr which allows you to, rather than take the open source one and host and run it Kubernetes yourself and deploy DAPR on Kubernetes, our server product is called Diagrid Catalyst and you can deploy this into your environment.

And now you can run and deploy your application on any form of compute and call into Diagrid Catalyst as a Dapr server and use all the APIs there without you having to manage and upgrade and deploy Dapr itself, because we take care of all that for you.

When you really go down your journey, this really helps you kind of use Dapr in a multitude environment rather than just being bound to Kubernetes.

Cory:

Yeah. So then I can run VMs, or if I've got like a Nomad cluster I can schedule work there.

Mark:

Exactly, yes. And then that sort of thing.

I also think that one of the most important APIs that we introduced into Dapr in the last release was a workflow engine. Are you familiar with workflow engines?

Cory:

Baby, I am a workflow engine. Sorry, yeah. I mean there's a lot of different ones. There's like CI workflows, ML workflow engines, like the general purpose ones, like the Argos of the world.

Mark:

Yeah. Well, I mean here this is kind of like a business activity workflow engine. So if you're looking at the likes of Camunda and Airflow and Temporal... you know, if you love those, Dapr has a workflow engine built into it all that you write your business logic into. So it goes, "Call this service here, send a message to this person here, check that this order is in process." Imagine that you've got a hundred-step process. Think of it as durable execution.

Durable execution is probably the most important thing in an application, because irrespective of the failure of the underlying infrastructure or the hardware, durable execution guarantees that your application and its code and the state machine will complete, or the series of steps will complete.

So if you kick off an order process and you're at step 50 of 100 and it dies because the machine died, you don't want to have to restart the whole thing again and you don't want to go through all 50 steps again. And how do you track that? It's a very, very hard problem. Durable execution saves all the state and all of the previous state of the execution context and knows where it was and recovers it and carries on.

Workflow in business is critically important and it allows you to do coordination, so Dapr's workflow engine is kind of like just going... it's hot at the moment, like everyone is picking up it up. It's a code first one, so you write it in code yourself rather than being a declarative model and you can sort of easily debug it then and things like that.

Cory:

Yeah, so like inventory systems, logistics systems... this is a good fit here.

Mark:

Exactly, yes. And being code first means that you can write it in Python or Java or C Sharp or Go or any one of these languages. And you can set breakpoints on it. And at its simplest level, you have a workflow and have a set of activities - I do activity one, activity two, I might do activities in parallel and when they've all completed come back. So there's a number of different sort of workflow or state machine patterns in the business process that it enables.

And it's very, very powerful when combined with the other Dapr APIs like Pub/Sub and Secrets and State and things like this.

Cory:

Yeah, I feel like one of the use cases that I'm kind of hearing that I think I'll get a lot of value out of Dapr - and I'm definitely going to be playing with it in the next couple of days - I see a lot of stuff in our code base where I'm like, this would have made moving so much easier. The question is, is there an Erlang and Elixir version? So that's what I'll have to... well, we have stuff in Go as well.

I feel like one of the things that can be difficult for a developer who's working in the cloud is like, I've gotten to the point where I'm like, "Oh, you know what? I realize that I need event services for this or I need Blob Store and my app right now just has Postgres... 13, 14, whatever. I need to add this new thing." But like, I don't know, just looking at AWS, like how many message services do they offer? There's a lot depending on how you decide you want to cut that thing up. Right? And so like, I'll sometimes spend some time, like, "Let me see if I can just do this with like SNS and SQS." It's like, "Okay, does it have the properties that I need?" Or maybe I had to go over and look at MSK, right? And I'm just trying to figure out like, which one of these services is going to give me the properties that I need so I can build my business logic.

And I feel like Dapr, if I'm building around this SDK, that makes the code part a lot easier. And now I'm just kind of switching out the config like, "Hey, let's see. Oh, you know what? This doesn't support first in, first out." Like if I need first in, first out for whatever reason in this app, that one's just gone. I don't have to rewrite my code. I just swap the config to something that supports FIFO, right?

I feel like a place where this might be an amazing use case to introduce in some greenfields for probably some pretty big companies is just around all the stuff that's happening in AI. Internally, we've built our own stuff on SageMaker, we've used some MCP stuff, we've used some Claude things, and we're trying to figure out which one's the right thing for it. And it's like every time we change what we're using, it's like, "Man, it's reflecting that in code."

Is Dapr doing anything in the LLM abstractions, like MCP abstractions?

Mark:

Oh, yeah, yeah. I mean, we're big in this space and I think that there are two kind of very key trends that are happening.

So one is just the integration of LLMs into your code. And here's what I call an LLM. So we introduced, in the last release, a conversation API. And the conversation API is literally a prompt API - here's a prompt, and then you can plug in the different components for any one of the underlying language models. So you can, you know, plug in OpenAI or Hugging Face or Anthropic or DeepSeek or whatever else you want to behind it all.

I can just call the Conversation API, call a prompt, and just like we've told about swapping out message brokers, I can swap out underlying LLM API calls and just switch them all out without having to change my code.

Plus it provides additional capabilities on that. For example, it does obfuscation of any data that comes back and it actually does prompt caching generically across all of them, even if they don't have prompt caching. So it adds those sort of features. So you could do that, just add that.

I think the more interesting and exciting thing that's happening is this rise of agents. And I don't know how much you're getting in and hearing talk about agents generally, but, you know, agents now is where effectively we're trying to produce and automate human processes a lot more with language models to do those things on their behalf.

So we've always done lots of automation, we've liked to automate things and we have very deterministic automation. But now the automation of business processes using language models to make decisions is kind of becoming a hot thing around all this.

Let me give you an example. I was talking to a logistics company a couple of weeks ago and they have a warehouse manager that sits in a warehouse and his job is literally to go and query a database every hour and look at what orders haven't been sent and what needs to be sent for the rest of the day. He's doing a bunch of queries and looking these things up. And they're like, "Well, it would be so much easier if he could just have a little agent on his behalf who was like just doing all these things. Like, "What orders are critical or what do I need to do next because this order hasn't gone?", "You need to push this one to the top of the queue because if it doesn't go out tomorrow, they're going to be pissed."

And so they wanted to build a little agent that was basically a warehouse manager that's looking at all the orders and looking at their database and all these things. Working on their behalf with the language model, making those decisions and basically taking hold of that process or helping this warehouse manager. That's just one example.

The way I look at this is that all this talk about agents is really a Distributed Systems problem with LLMs. It's Distributed Systems, but everyone's made up lots of new names like "memory" and "agentic systems" and things like this. Dapr is very well grounded in key distributed systems principles like workflow and messaging and state management. And so back a couple of months ago, we introduced this framework in the Python SDK called Dapr Agents. And it really helps you build agents that are stateful, long-running and durable because it's built on workflow. That they communicate really well with Pub/Sub messaging. And they very much are sort of, I would say, enterprise ready.

To a large degree, I think a lot of the agent systems or frameworks out there today are still very immature in their enterprise readiness in terms of being durable and recoverable. And most of them are sort of like - as soon as the thing fails, I have no idea what I did before and I've got to start again.

So yeah, I see the rise of this agents thing as part of the continuation of distributed computing with models being the sort of decision makers amongst us all. And we're going to be plugging these things everywhere.

But you know, the Dapr agents introduction that builds on the core APIs of Dapr is an amazing piece of technology to help you build these sort of long-running, durable, stateful agents.

Cory:

That is very cool.

Well, I know we're at time. I really appreciate you coming on. This sounds super cool. I'd say if you're a Developer and you're listening, and you're like, "This sounds pretty awesome. Like, how can I start using it?" It's a CLI. You can just drop the CLI locally on your machine and like start playing through your tests. Like, let's try using the Dapr SDK.

If you're an Ops person, sounds like if this is exciting to you - Platform Ops, DevOps person - like, it might be good to find an engineer on your team that you know is like struggling with a cloud service that may need to be swapped and like, that might be a good getting started for you.

Mark:

Exactly, yes. Yeah. Super easy to deploy, try out, test. Yeah, exactly. You summed it up beautifully.

Ops people should care about it because it makes their lives easy with the application developers. Application developers should care about it because it takes away all the boilerplate code.

Cory:

Yeah. And Ops people, we know... we all know that you have it the hardest.

Mark:

Yeah, you do.

Cory:

We know that you don't have enough time. You're outnumbered. You need somebody on your side. Dapr's gonna put you in a top hat and a tie, it's gonna make you all professional. Like, you're gonna love it.

Mark:

You're squashed in the middle there between the application developers, who get all the demands, and the infrastructure below you, and you're like, "I'm in the middle here. Help me."

Cory:

And the CFO's mad at you because the cloud's expensive. Like, everybody's on your back. Compliance team's after you.

Like, we're running and we need help, and it sounds like Dapr is one of those tools that can definitely get you there.

Mark:

Yes.

Cory:

Where can people find Dapr? Where can they find Mark? Where can they find Diagrid?

Mark:

Yeah, if you go to Dapr.io, start there and dive in. And I would suggest that that's where you go and look at some of the case studies and then sort of try out the quick starts.

If you want to come to Diagrid, it's Diagrid.io. We have a set of services there. We talk about enterprise support for Dapr or our Catalyst Server, which is a Dapr server around these things.

Reach out to us and feel free to send me an email anytime at mark@diagrid.io. Happy to hear from you. Happy to hear about what you're interested in. Happy to talk about Dapr being used in any one of your solutions.

I just love talking to developers and platform engineers and just hearing their stories. So please reach out.

Cory:

Yeah. And that Catalyst project actually sounds really interesting. I mean, there's so many teams that are stuck like, hybrid. Like, they've started moving things to Kubernetes, but they've still got a bunch of VM workloads. That sounds like a really good fit for many operations teams that are kind of straddling these two environments today.

Mark:

Exactly. Yes. I mean, Dapr is just about helping you ease your journey for building and running these applications at scale around these things. And the Catalyst server just eases that whole burden.

Cory:

Awesome. Well, thanks so much for coming on the show, Mark. I really appreciate it. This was super exciting to learn about Dapr. And I'm honestly... I wish I'd learned about Dapr like two and a half years ago before I just took this like...

Mark:

You need to do some re-engineering.

Cory:

Oh, my gosh.

Mark:

A mini Dapr

Cory:

Yeah. I mean, I could probably kill half our code base by just deleting all the abstractions that we've built, man. I'm going to thin this thing up.

Awesome. Well, thanks so much. And yeah, hit that subscribe button. Please rate and review on Spotify or whatever podcasting platform you're using and I'll see you next time.

Show artwork for Platform Engineering Podcast

About the Podcast

Platform Engineering Podcast
The Platform Engineering Podcast is a show about the real work of building and running internal platforms — hosted by Cory O’Daniel, longtime infrastructure and software engineer, and CEO/cofounder of Massdriver.

Each episode features candid conversations with the engineers, leads, and builders shaping platform engineering today. Topics range from org structure and team ownership to infrastructure design, developer experience, and the tradeoffs behind every “it depends.”

Cory brings two decades of experience building platforms — and now spends his time thinking about how teams scale infrastructure without creating bottlenecks or burning out ops. This podcast isn’t about trends. It’s about how platform engineering actually works inside real companies.

Whether you're deep into Terraform/OpenTofu modules, building golden paths, or just trying to keep your platform from becoming a dumpster fire — you’ll probably find something useful here.