You Need AI Sysadmins Can Trust, With Cribl's Nikhil Mungel

What happens when a non-deterministic AI system is asked to touch production telemetry or generate changes for an SRE pipeline? The cost of being “close enough” can be lost data, downtime, or a security incident.

Cribl’s Nikhil Mungel joins Cory to break down what it takes to build AI that sysadmins can actually trust. The conversation digs into harness engineering and the practical guardrails that turn probabilistic models into repeatable, verifiable outcomes. They cover why breaking work into small chunks matters, how validation and testing become the real leverage point for AI-native development, and what “code factories” mean for review, CI, and platform reliability when teams can generate a thousand PRs an hour.

Platform engineers will also hear a pragmatic take on the future of the job. The focus shifts away from typing code and toward building systems for verification, simulation, and safe deployment at scale, plus clearer ways to decide what needs human scrutiny and what can ship automatically.

Guest: Nikhil Mungel, Head of AI R&D at Cribl

Nikhil Mungel is the Head of AI R&D at Cribl, where he's building LLM-powered systems for IT and Security data transformation and analysis. Before Cribl, he spent over a decade developing distributed systems across the observability and consumer social tech landscape. He lives in San Francisco with his wife and two kids. His current focus is applying AI to make complex infrastructure more intuitive and explainable.

Nikhil Mungel, Website

Nikhil Mungel, X

Cribl, Website

Cribl, LinkedIn

Links to interesting things from this episode:

Cribl Guard
“Open source died in March. It just doesn't know it yet.” by Dan Lorenc, CEO of Chainguard

Transcript

Cory: 00:00:04

Welcome back to the platform engineering podcast. I'm Cory O'Daniel. This week I sat down with Nikhil Mungel, head of AI R&D at Cribl. We got into what it actually takes to build AI into a product where your users are sysadmins who can't tolerate hallucinations, harness engineering, code factories that spit out a thousand PRs an hour, and how platform engineering teams figure out which code gets human reviews and what just ships.

My guest today has spent over a decade building distributed systems across observability and consumer social, and now heads up AI R&D at Cribl. For anyone not familiar, Cribl is the data pipeline platform for logs, metrics and security telemetry. His team isn't just integrating LLMs into the product, they're actually training their own models and building their own AI platform underneath.

Nikhil Mungel, welcome to the show.

Nikhil: 00:00:50

Thank you, Cory, appreciate you.

Cory: 00:00:51

Yeah, I would love to know, like, you've been building distributed systems for a decade, what has steered you towards AI and R&D? Like, what's going on there?

Nikhil: 00:01:01

Yeah, so I've been, you know, my first brush with technology was all the way back when I used to do Windows 95 administration and printer setup for, you know, small businesses and stuff like that. It was a pretty fun gig. That was my first. That's how I got into technology.

I remember, you know, fixing point of sale stations. These were like Windows XP, by that time it was not Windows 95 anymore. Trying to like, make sure coffee's not like dropping on me or something. And usually the problem used to be like a sticky cable or something of that sort, you know, in the back of the computer.

Cory: 00:01:35

Oh yeah.

Nikhil: 00:01:36

Those were pretty interesting times, you know, with Windows. And then when I first got introduced to Linux, that's when like a light bulb went off in my mind.

And that's when I really, you know, got into technology. And then one thing led to another and I was, you know, soon found myself working as a professional software engineer.

Happy to go into a little bit more about how I came up here. And then I spent a lot of time at Splunk, which was a big data observability company.

Cory: 00:02:09

Oh yeah.

Nikhil: 00:02:10

And spent a bunch of time optimizing. So most of these tools have a pretty complex language that you have to type in order to get back your results and get back your search results. It's a structured query language, you know, kind of like SQL or something else.

And it's kind of a pain because Customers and users have to learn this new language in order to retrieve data, which seemed like the perfect opportunity to, you know, use AI and you know, use some form of machine learning to recommend auto completions. And that was... gosh, that was like 10 years ago now. So that was my first brush with AI.

And then I've just been fascinated by neural networks and you know, how you can use pattern matching instead of just rules based approaches to make technology easier. And then, I mean the biggest unlock was ChatGPT, obviously, you know, when it came up. Or GPT... I was using one of the completion models back in twenty twenty-two, twenty-twenty-one. And that was just... that felt like magic.

I mean it still does, you know, every time you use any tool like ChatGPT or Claude, it's just absolutely fantastic. I can't believe computers are really doing this.

Cory: 00:03:23

When did you first have that like that realization that this stuff could potentially work? Like, were you very early? Like you saw ChatGPT back in like the early twenty twenties, you were like, "Holy moly." Or like, were you a late... like I'm a late comer. Like it was maybe like November of last year before I was like, "Oh, this actually may have legs to it."

Nikhil: 00:03:42

Yeah.

My first aha moment was, I'm going to say back in twenty twenty-one where it was not ChatGPT, it was just GPT, you know, two or two point... maybe it was like... it may have been three. It was not even. It was, gosh... I don't remember the name of the model, but it was a completion model. It was not an instruction-queued model, which means you couldn't chat with it as such. You gave it, you know, some content and then it completed that content for you. So it's just a completion model.

Cory: 00:04:12

Oh, old school.

Nikhil: 00:04:13

Yeah, old school. And then you could give it a few examples. They could be like, you know, roses are red and you know, violets are blue and then it would continue that. Right? And if you gave it something in the style of Shakespeare, then it would continue, you know, just rambling off of that.

And that was like the biggest unlock, right? Because then if you gave it a couple of examples of structured queries, then it would help you formulate the next one. And its ability to just generalize information from the provided context was fascinating.

And obviously OpenAI and Anthropic have done tremendous amounts of work to make that instruction tuned and make it so that you can actually have a conversation with it now, which has just been awesome.

Cory: 00:05:01

I agree, I've actually... me and my coworkers, we've kind of leaned into AI native development over the past few months. And I feel like the hardest part, you know, is doing the stuff that's always been good engineering. Like writing the code.

Like we spend so much time writing the code, but like so much of our job is the stuff around it, right? Like the guardrails around like linting, tests that run quickly and are true, formatting, you know, security scanners, like all the CI/CD stuff that like people typically aren't super excited about.

And I feel like a lot of folks when they have bad experiences with AI code generation, it's because they haven't done that work around their current code base.

They may have code that is hard to follow and when it's reproducing these patterns, it's reproducing patterns that are hard to follow. Right? When you have code that is very well designed and organized, it is reproducing very good patterns.

With your core audience at Cribl... I assume it's mostly sysadmin folk, right?

Nikhil: 00:06:05

Yeah, a lot of practitioners sysadmins, IT admins, SRE, DevOps, SecOps.

Cory: 00:06:11

Yeah. So for folks where determinism and high accuracy are the name of the game, LLMs are inherently non deterministic.

How do you build an AI native product for them when they can't tolerate the kind of errors that we might get in something that comes back from a ChatGPT? Where it makes up a Python library that doesn't exist.

Nikhil: 00:06:35

Yeah, that's such a good, that's such a good point.

The trade offs are pretty asymmetric when it comes to systems administration, SRE Ops, and any system that's really in charge of controlling like the operationalization of software. So you know, software goes through multiple stages. You know, you probably know this really well where you're... at some point it's just a little nascent thing on your computer where it's pretty contained and you have a pretty high tolerance for, you know, hallucinations or anything that's not quite working well.

And then as it goes through the stages and you know, as it goes through CI/CD and then gets merged and then it gets deployed when it's actually the same software that's now running on production hardware where it's like servicing, you know, customers, in our case it's maybe processing hundreds of terabytes of data every single day. Right. So the stakes are really high and you don't really want anything messing that up, right?

So AI can make you a little bit more productive, but then if it gets something wrong, there's going to be a pretty steep downside. Right.

So most of our use cases and most of the users that use or in these sort of roles tend to be optimizing for, "Hey, let's not drop data, let's not lose production data, let's not lose uptime," because there's nothing more important in that role. They're not necessarily optimizing for, "How productive can I be?"

It was like, well, if I drop transaction data, that's going to bring the business down. And it's a very different modality of applying AI to those sort of problems.

And we have, over the last three or four years, we have perfected a set of techniques that have enabled us to do that thing exactly right. I mean, there were so many things.

One thing is you don't, you try to minimize the stuff that you're using AI for and surround it with a lot of harness. So things that, you know, deterministically check the output.

We also just break it up into really small chunks as opposed to one shotting an entire file or you know, like an entire pipeline. Right. You just break it up into small functions and you know, let's do it one by one.

You take that to the LLM, you see what you get back from the LLM. You validate that you, you run it and then you insert it into the stack. Just giving an example, but really just breaking things up into small pieces.

Lots of guardrails, lots of harness that's written to validate the output, to make sure that it's performant and it doesn't have any security vulnerabilities, all those sort of things, and then stitching it all together in a deterministic way.

And that has actually been a pretty cool unlock in that we have been now able to use AI and agentic user interface paradigms to produce code or to produce configuration objects that are very close to being deterministic. So if you wanted to convert A to B, you could write it by hand or you could use one of these agentic experiences that we have within the product.

And it will do a very good job of it, not just once, but repeatedly. Like if you do it, if you, if you, if you try to do it multiple times, you're going to get the same output.

Cory: 00:09:53

Yeah, it's funny because like, what you're saying there is like, it's a lot... it's guard rails again. Even in a tool like this, it's a lot of engineering around the thing, right?

And I feel like a lot of folks, and I know... I'll call it the Anthropic guy right here... but like a lot of folks are calling out like, "Oh, white collar jobs are toast." Right? And it's like, I don't know man, like, I mean I feel like the job has changed quite a bit, but like, I still need more hands.

Like, I need more hands in this like world of like generating AI code. Like, because we still have all this other stuff to do around it, right?

We're not necessarily laboring over every single line of code, but we're thinking about how to do things like this. How do we generate something and stack a bunch of tools together that it can use? I'm still writing that software, I'm still building the guardrails around it.

Is that how your team is experiencing it? Are you still seeing we do need as many engineers, they're just doing different things and they're more efficient at it, or are you seeing I don't need as many hands to get this done because we've got AI here?

Nikhil: 00:10:58

Yeah, I think it's a mix of both. We have certainly felt... I mean, I'm actively hiring right now, I'll even make a little plug for it right now. If you're interested, please reach out. We are always hiring top quality engineers and candidates.

But the job or the role or the set of tasks, if I'm talking about that a software engineer was responsible for, is evolving. A typical software engineer would spend a certain fraction of their time writing code and a lot more time thinking about the problem, thinking about how that problem fits in or how their solution fits in to other systems that already exist within an enterprise. What are the trade offs and all those sort of things. I feel like that's still around.

It's the code authoring part that has been accelerated with the existing set of agentic software that we have like Claude Code and Cursor and everything else. And a new task family has expanded on an engineer's set of things to do, which is verification and validation.

Right now you have AI generating so much code and there's so many things that are, you know, engineered. Everyone is like suddenly 10x more productive. So you have a lot more code being generated. Well, who's going to validate that?

Who's going to verify that, that it's going to work, you know, just the way you intend for it to work in a hardcore production setting, you know, where there are millions or billions at stake.

The way the approach that we have taken is we have started to develop tooling that helps human software engineers to do that a little bit better, do that a little bit faster, right? To make that verification, validation a little bit easier.

Cory: 00:12:49

That's like most of the job now, right?

I was talking to one of my friends the other day and it's like... yeah, I feel like our job for the longest time was, you know, we get a task assigned to us by a product manager, from a Scrum Master, whoever... and we sit and we think long and hard about how all these little pieces plug into a library or a standard library or an API.

And now the job is like that stuff... which I don't know, personally, I like coding, but I didn't... I've never really found the rewarding part of coding, like figuring out all the things that I need. It's like, it's the elegance of like the solution and the experience afterwards, right? And like, that's the whole job now, right?

Like that's... your stack of code is not the thing that's going to be your moat anymore. It's this experience and joy of the user afterwards of what you've created.

And I feel like a lot of the job now is just like verifying and making sure that this is good quality functionality. It's something that I feel like we're gonna have a problem with this in the space in the near future, because I feel like that intuition comes from years of production experience... like being able to look at things and understand how the code all goes together.

I feel like we're gonna have a gap in juniors in the near future of like, people that don't know how to do that anymore because they haven't written production code.

But like, I feel like a lot of the job now is gonna be guardrail engineering, observability engineering, quality engineering, less so lines of code engineering. So I would absolutely love to hear what you guys are doing there.

Nikhil: 00:14:21

Yeah, I mean, I would... I like how you broke it up into observability engineering and guardrail engineering.

I've just been calling it harness engineering for the last few months now, where it's just really building the systems that enable teams to use agentic decoding to ship software.

I mean, when I think about it from a platform engineering standpoint, since you know, a lot of our listeners are right now, you know, platform engineers in many ways. The job is moving into developing these systems that then allow your product teams to quickly ship software.

So how do you abstract away creating new guardrails or creating new validation checks? What are the new primitives that you build in this new world that development teams inside your organization or product teams inside our organization can take advantage of? Because libraries... For a long time, there was just libraries and there was APIs, and arguably that stuff is being generated, at least in some settings, these days. But there's still a lot of things that are not being generated, which is, "How do you plug this in?"

Like, I'll give you an example. I'm sure you've tried your hand at vibe coding.

Cory: 00:15:38

Oh, yeah.

Nikhil: 00:15:38

Which is, you know, all the rage back in twenty twenty-five. And now it... I feel like it's still going pretty strong.

Cory: 00:15:44

Yeah, I feel like it is. I'm a person... I don't like typing, I get really bad tendinitis.

I have leaned fully into, like, Whisper tied into Claude Code, and I will either, like, talk my way through a spec and then just copy paste it in or I will just kind of talk through and just ramble about, like, what I want to build and then watch it materialize, which is absolutely magical.

Nikhil: 00:16:10

100% Right.

Cory: 00:16:11

Yeah.

Nikhil: 00:16:11

That's a pretty good workflow, you know, just using speech. I should give that a shot.

But if you think about the early days of vibe coding, when it was... not that long ago, it feels like long ago or like some time has passed, but maybe it was like a year ago or something... but when it was first crystallizing, it would generate everything. Right?

I mean, it wouldn't use something like React, it wouldn't use something like even Express. It would just write raw JavaScript or it would just write raw language without using any libraries, and it would still work. You know, like, it would still look like a little bespoke application. And then, like, if you try any vibe coding platform... you know, there's so many of these platforms these days... or even if you just use Claude, what it does now is it just uses React, it uses Tailwind, it uses Next.js, right?

So it's almost injecting all of these building blocks into the vibe coding thing and creating an app that kind of looks pretty robust and pretty modern. And, you know, it doesn't have as many failure modes as an app that was perhaps vibe coded a year ago.

And those building blocks are off the shelf building blocks. But when I think about platform engineering, I've been encouraging my teams to start producing those building blocks.

Like what are unique building blocks that a vibe coding harness pull from and insert into applications as your product teams are now building these applications. Right. So maybe it's a bespoke component library. Maybe it's a set of skills, you know, that, that have just... It's like it's a markdown file. A lot of engineering now has just come down to being able to clearly articulate yourself in English, which goes into, you know, a markdown file.

And what are those skills that you can produce as a platform engineering organization that can be used by vibe coding harnesses and pull in those dependencies as building blocks?

I'd be curious to get your take on that.

Cory: 00:18:15

Yeah, honestly, one of the things that's changed, I'd say, I hate to admit this, since I've been a developer for such a long time, but one thing that's really changed about the way that I develop, I've always been a fan of test driven development. I've been a TDD engineer for the longest time. And for a good portion of like my vibe coding era, I would still write all of my tests. And then I still do this to some extent today. And I would use my tests as my prompts. It's like, "Hey, this is what I want to exist. Make this so." Right? And it's very good when you do that.

Now the thing that I've started running into is it takes a lot of time to write tests. It's always taken a lot of time to write tests. And like, now that I have this speed in the development section side, I'm like, "I want that speed on my prompting side too." Like, I'm spending a lot of time prompting code gets generated. I'm like, "How do I optimize this side now that side's maxed out?"

And so now I've started to just talk through my tests, which is weird, but I still do TDD with my LLMs. Like, I will talk through the ideas of the tests, I will watch it, generate it. They all fail. That's great. That's what I want, right?

And then we iterate through it, right? And I'll usually open my first PR with just like failing tests. So it's like I can see the code. It happened in a work tree, right? I can see what the PR is going to change to the test suite.

Nikhil: 00:19:34

Right.

Cory: 00:19:34

I can acknowledge that is what I intended. I'll put a comment or something if I wanted to switch.

But the thing that's fundamentally changed about my development is I'm very good at documenting things now, which I have just never been good at. And I think a lot of engineers, like, they don't like... engineers like to write code.

You still see so many engineers that don't like writing tests, like, people are like, allergic to it. It's like, "Oh, I hate writing tests." That to me has always been like the joyous part. It's like, "Oh, I've expressed my intent." Yeah, it's like I've expressed my intent and it works. This is exactly what I'm looking for.

But I've never been great at docs. It's like, "Hey, it works. Get this fucker out the door."

And now I spend so much time on architecture decision records. Why is this the way it is? What impact from customers or our production infrastructure has caused us to think about solving this problem this way? And then it's like locked down in code. Right?

And that's one of those things. It's like, I feel like many teams that really leaned into things like ADRs and deep detailed specs on how your V2 API works, the constraints and things you put into it. Those are things that really established teams with plenty of time to do so do. And then other teams don't. Right?

Now, that is again, part of my job, and what's cool about this is it helps the LLMs quite a bit. And the humans, turns out they love documentation that's accurate too. Right?

And it's just like, that has always been like such a hard part to do, especially in startups where you're just constantly running and running and running, like moving on to the next MVP that a customer wants. It's so hard to go back and document the last thing.

But, like, again, like, that's the job now. It's this, like, quality engineering is kind of what I see? It's just like the code itself is just... it's... I don't care, like, if it rewrote it in Python tomorrow, I do not care so long as my quality engineering on the outside can tell you this is how the API works and I can guarantee it works. Doesn't matter to me what's inside the box.

I'd love to get your take on this because I've heard a few people start to say this... I think it was the CEO of Chainguard recently wrote a blog post about like open source is dead, it just doesn't know it yet... and I feel like I don't necessarily believe that, but I've got this gut feeling that it might be true.

And so it's like for an example, I would never say, like, "Hey, let's spin up a new React." It's good, there's a really good ecosystem, but at the same time there's a lot of vulnerabilities that are getting snuck in through like a dependency of a dependency of a dependency. While at the same time, like in our own products, like we get to things where we're like, "Hey, we're adding skim integration" and it's just like, that's important to us, we will actually take the standard and we will generate our own library that we can just own. It's so critical to... it's external system managing permissions. I want to own all of that code. It's not something I ever would have wanted to own in a world two or three years ago because... I've got to focus on my product. But now it's like, this thing is really important to us and microsecurity, so I want to understand how it works and I want to make sure that we have full control over its stack of dependencies.

So yeah, so I don't know, man, it just feels like a lot of it is good engineering principles that we've always had that we've never had the time for are the work now. And that's something that will benefit the LLMs and it's definitely going to benefit the person that replaces me, right? Or the person that's joining my team that I don't have time to go through everything with.

It's like, "Hey, here's an entire markdown directory of decision records of how we came to the architecture that we have and the tools that we have, and which libraries we do and don't use, and how we decide on when or when not to use a library versus just generating and owning something ourselves."

Host read ad: 00:23:30

Ops teams, you're probably used to doing all the heavy lifting when it comes to infrastructure as code wrangling root modules, CI/CD scripts and Terraform, just to keep things moving along. What if your developers could just diagram what they want and you still got all the control and visibility you need?

That's exactly what Massdriver does. Ops teams upload your trusted infrastructure as code modules to our registry.Your developers, they don't have to touch Terraform, build root modules, or even copy a single line of CI/CD scripts. They just diagram their cloud infrastructure. Massdriver pulls the modules and deploys exactly what's on their canvas. The result?

It's still managed as code, but with complete audit trails, rollbacks, preview environments and cost controls. You'll see exactly who's using what, where and what resources they're producing, all without the chaos. Stop doing twice the work.

Start making Infrastructure as Code simpler with Massdriver. Learn more at Massdriver.cloud.

Cory: 00:24:28

Like, how are you guys seeing that? Like you guys have a huge product, you got a big team. Like you guys are... you have customers that are very concerned about security. You guys are very concerned about security. Like how are you seeing open source in your own organization? Like is it still something you just, "Oh, that library, Python library, it's popular, I'm just going to grab it." Or do you think through like the supply chain of those libraries?

Nikhil: 00:24:50

Yeah, that's such a good point. Right?

I mean, let me first talk about TDD and you know how much I love TDD. My first job was ThoughtWorks and you know we were like... ThoughtWorks was a big proponent of test driven development. And you know, it was just like the greats like Martin Fowler and you know, like a whole... some industry stalwarts, I mean who were part of that movement. So that's really how I cut my teeth. And it's the same thing that we see today, right? It's the spec first, right?

It's being really thoughtful about what is it that you want to build and being able to express the constraints of the product in a tight deterministic way. I mean that's test, right? And you write the test, first it fails and then you write the code that fulfills that and just that and nothing more. And now that test is, you know, it's either a spoken record in your case or you know, it's a detailed spec markdown file that you know, describes exactly how the software is supposed to be... exactly how the system is supposed to behave.

Which can be used for two things, right? One is for using it to generate the code that implements that behavior and then more importantly the code that validates that behavior. Right? And you can also think about splitting your tests in the same way that you split training data sets when you're training models, which is you know, you have 80% and then you withhold 20% and use that for validation to make sure that your spec, you know, your model is not overfitting on the 80%. Right?

So you could also do... you could also split tests up in that way where you can have, you know, certain part of the spec which is hidden from the LLM when it's generating the code. And then it's only used for validation to make sure that it's validating the spec. Right? So there's a pretty interesting way to structure a smaller spec which is not handed to the implementation aspect of it, which is a pretty cool pack that we've used.

When it comes to supply chain attacks that we have seen, you know, in the last few months. I mean, you know, they've been rising at a pretty exponential rate. And so, yeah, I do get that point. You know, open source is unfortunately in a tricky, tricky spot. And my take is that it's going to evolve from open source implementations into open source standards.

So there's going to be a lot more standardization where you pull the standard from, you know, some sort of a GitHub or an open source repository. And the standard again has validations and tests and all those sort of things. And then the implementation, the actual production code that runs in the hot path in your system is generated by an AI that you have or your team of engineers.

And then the open source is literally just a spec and the set of tests that tell you that your implementation is watertight and that way you can defend your supply chain from any, you know... because unfortunately the reality is untrusted code execution is the biggest threat everywhere right now, especially with vibe coding.

Be that a supply chain infiltration or you know, be that a vibe coding surface in your product where you're allowing customers to extend the product by, you know, giving this like, "Oh, enter a prompt here, now generate something." It's all untrusted code. And I feel like it's... my take is that the way to reign in that untrusted code is by tests and it's by specs.

One interesting thing, you know, I was talking to somebody who works at a large defense tech company that we all know of and one of the things that he was talking about it... you know, he works at Frontier AI, in their world... and they have a really sophisticated real world simulation. So it's like a real world simulation that's built on like a game engine that, you know, allows them to, you know, really test how things work in a versatile setting without really having to engineer anything. I mean, like, that's a really cool way to validate software, you know, as well, like, which is not... and again, in defense tech, you're producing real world things. So your simulation is really a physics engine. And now the laws of physics are, you know, pretty well known and, you know, well written down for a long time now. So it's almost easier to build a physics engine.

But what does it mean to build a physics engine for a platform engineering team within their own domain? You know, let's say you're a logistics company or, you know, if you're an observability company. Like, if I had to like break down some laws of physics in the observability space, you know, when you have network flow logs that are going from one place to another, there's a source IP address, there's a destination IP address, and then there is the size of the packet, right?

Cory: 00:29:58

Yep.

Nikhil: 00:29:58

You don't break that law. No one ever breaks that law. Right. Like no networking equipment, firewall, router is ever going to break that law. They're always going to log that. Right? It's never going to be incorrect.

And then figuring out those laws that govern the environment for your own domain and creating this simulation realm where you can run your software and validate it to the point of a mathematical proof almost. So like, "Oh, if it works here, we know it's going to work in production."

That might be a pretty interesting path for platform engineering teams to, you know, think about and sort of pull the thread on a little bit.

Cory: 00:30:40

Yeah.

One of the things, like kind of going back to you were saying earlier about like, just like the different, like smaller tools that you kind of put together. When I talk to my LLM to do my tests... and for anybody who doesn't like TDD, you're going to recoil when you hear what I'm about to say that I do, but I swear it's the thing that helps us move so quick... I speak to it in plain English because I speak to it in like BDD, like Gherkin, like, "Given this, win this."

And our tests... our test suite has skewed that direction. And the thing that's interesting about it is like we now have these very like, factual statements, right? It's like, "When there is a cube set of deployments," it's a very English phrase that we use around the org all the time, like it's a sentence that even the sales guy understands what I'm saying. And so the interesting thing is it's so easy to reason about now because it's like it's a plain English sentence and that ties to a tiny function that does that exact thing, right?

And so now when I'm talking to my LLM, it's like a lot of the tests, it doesn't even have to write because I already have stuff tied to each of the given wins thens, right? So I get to speak in plain English. And even our sales guy can write like, "Hey, you know, I was talking to the customer and they said they need this."

It's like, "Oh, yeah, well, you can talk to this thing in plain English and it will start building features for you." And the crazy part is like seventy, eighty, ninety percent of the test suite at any point in time is already written because we have all these common phrases.

So it's coming to the point where it's like, "Hey, let's talk about doing some flow log stuff." It's like, oh, we just have to add a rule in plain English and a little function and it'll write that function for us, and it's so easy to reason about. And I see these people were it's like, "Oh, I've got a 10,000 line PR."

It's like, I've got 10,000 three line PRs. And it's very easy to reason about where almost anybody in the org can look at it and go, "Oh, we've added a new rule that does this." I can see the code that it produces for the test. And like, now there's some logic that's going to fulfill that, right?

I don't know, like, that feels like a really good flow. Again, I know that people that don't like TDD are like, "Ugh, cucumber. Like, I don't want Cucumber and Gherkin back in here." But it's so easy to develop this way, I cannot imagine going back.

And like, folks might be hearing like, "Oh, you're not developing anymore though." It's like, "No, I'm doing more architecture and thought about software than I ever have."

Nikhil: 00:33:02

Right?

Cory: 00:33:02

Like the last part of the TDD, right? It's red, green, refactor. Most people don't get to the refactor part. Like, how do you get your PM to give you time to refactor your code? It's like, it's part of the job. Now it's the job, right? It's like, I see stuff come in and I'm like, "Oh, this all looks like shit."

Like, now I get to think about like, what does a grander, better architecture look like for customers on the API or even internal developers using stuff. Like, how do I build something that works well for them? And it's like I get to focus a lot more time getting rid of debt than I ever have in any product I've ever worked on. It's really, really interesting.

So, coming back to platform engineers, I'd love to get your thoughts on this because I think the most underappreciated part of platform engineering - it's often invisible, right? Even a good DevOps team, good Sysadmins, like when it's working, nobody notices you, right?

Nobody's like running around like high fiving this admin and they're like, "That auto scaler really does its job well." It's like when it doesn't, that's when people recognize you. Right?

And so like in this new era of, you know, a lot of code being generated by agents and vibe coded by end users, like, how does that change what we're responsible for as platform engineers? And stability? Like, we can have a very... and I know it's kind of two phases - is the platform engineering team vibe coding stuff and then is the engineers on top of the platform vibe coding stuff - but I would love to know how you see the world through the lens of a platform engineer, where we're just getting things that may be a lot less stable at first, but it's unstable and coming through at a faster rate... which we can now apply quality to and get it stable.

Nikhil: 00:34:47

First off, I mean, with BDD and Cucumber, I mean, you're bringing back some fun memories. Ruby on Rails. Man, that stuff is so well rooted in that... it still is so well rooted in that ecosystem. TypeScript has taken over the world these days, but I do miss using that. I remember the syntax and those sort of things used to be a little bit unwieldy with Cucumber, but that stuff got figured out pretty quickly.

And now, I mean, it's just, you know, using natural language to express yourself and not even bothering with those, you know, those, those syntactical nuances and quirks. And really, I mean, that becomes your spec... the guardrail in which your software has to operate.

And I feel when it comes to platform engineers, you're 100% right. It's pretty asymmetrical.

Like when the system is working well, when you're doing a fantastic job, that means nobody's noticing that the system is working well.

And with vibe coding and with AI, that's exactly the thing that platform engineering has to uphold, right? Like can it sustain a ten x or even hundred x or even a thousand x more software running on the platform? So just scaling it up in a reliable way and really supporting these new coding paradigms that are emerging now. Which is, you know... coding factories is like my favorite one, right? I mean, where you now have... or development teams are moving toward a world where instead of writing code...

So, okay, I mean, so if you had to like back up, right, like before twenty twenty-one, developers were writing code character by character, right? One keystroke at a time. And then came AI based ID autocomplete, right? Where you start typing something and the AI was like, "Oh cool, maybe these are the parameters in the function that you wanted." And then came prompt driven agentic decoding, right? Where tools like Cursor and Claude Code became popular, where instead of starting in the IDE, you now start on the right hand panel and you, you know, it's like, "Hey, this is what I wanted to do." And then, you know, it proposes a diff that you then accept or reject.

And now I see more and more teams going toward a world where they're codifying their practices and, you know, the thing they produce in a set of markdown files and then creating a system where you just insert the spec from one end and it produces a fully functional pull request from the other side, right?

Cory: 00:37:28

Yeah.

Nikhil: 00:37:28

Taking into account like all the quirks of your system and your enterprise and your company and all the coding standards and everything else. And I feel like that entire harness again, that factory harness, which is called the factory pattern, right? I mean, it's different than the factory pattern that I used to use when writing code way back when, but now it's just like, okay, cool, like you have this, you have this agentic harness that can one shot an entire spec into a fully functional PR.

And that's going to be the next thing that platform engineering has to tackle, right? Like maintaining these factories and maintaining the assembly lines. And not only maintaining, but innovating on that and you know, making them faster and you know, creating systems where these pull requests when they get produced... because guess what happens when you have a factory producing 1,000 pull requests an hour, right? I mean, that's a lot of code. Where is that going to be deployed? How's that going to be validated?

So scaling up not only the production online infrastructure, but also everything around it is like, well, let's create systems that can deploy PRs and validate PRs at agent speed, not at human speed. Right? So like a thousand PRs an hour should be deployed, should be validated. You know, like some sort of a simulation environment.

I would expect that's the next frontier that platform engineering has to, you know, conquer. I mean, there's also. Sorry, go on.

Cory: 00:39:01

No, no, I'm just like, that is. I mean, you're just. Yeah, my brain's going...

It's funny because like so much of platform engineering has always been about like the golden paths to production, right? I can get you some cloud resources easily, or I can get your Docker Container onto a K8s cluster, or your functions into a lambda.

Like different teams got different paths. But like now it's like if every team... you got a front end team, you got a backend team, you got a checkout team, you got a payments team... if all of these teams are kind of inventing their own little, you know, guardrails and harnesses everywhere. Like, it's a lot of rework, right?

I think you're right. I think you'll probably see a lot of that come into the platform. And that's interesting because think about... I mean, we're already to the point on my team where it's like, the number of PRs that we produce today is stressful. I was talking to my dad recently. He was an air traffic controller in Guantanamo. So like a pretty intense job. Like way back in the day, not like recently.

Nikhil: 00:40:02

Sure. Yeah.

Cory: 00:40:03

I was just like, "I feel like today developing is so much more exhausting than it was a year ago." And I was like, "And it feels weird because I don't write software anymore." I was like, "I'm just reading and processing." And he's like, "That's the scope, dude." And I was like, "What do you mean, the scope?" And the scope is what they call like the little radar thing that they watch. He's like, "That's what the entire job of ATC is like, you do not get to use the flow state part of your brain whatsoever. You're constantly analyzing and like using the executive function side of your brain." And it's literally more exhausting because you don't tend to use that side of your brain a lot. You're using the creative side of your brain, you're thinking through things, right? You're having these deep flow states versus like, "A plane is landing, a plane is landing. Two planes are near each other."

And that's what we're doing with these PRs. It's like, "These two PRs conflict with main, like, what does that mean?"

I'm shifting context constantly and he's like, "The way they solve this in ATC is every two hours you have to take a break." Because your brain like after about two hours, like it just... the executive side just plummets, right? And with all of these changes... it's like the change rate is insane.

A ton of PR is going through and you know, to have people overlooking all of those is hard. It's hard on people, right? And so like, you know, as we think about, "How do we get value out of AI for a business?" Like the AI can't... like the business can't take all that value because it's going to exhaust humans. Right? Humans got to get some value out of it too. Like do our hours change? Do we work less? Do we start to see companies as benefits? Saying, "Hey, engineers work for two hours a day and then they have off until after lunch and they work for two more hours." Right. Like, I think the job's really going to change a lot.

But you know, at the same time with all those PRs coming through, it's like starting to... again building blocks and guardrails. Like maybe every single PR now that you have more time, maybe every single PR does go through k6 and do load testing on it. Maybe every single one does land in a preview environment that has a full suite of browser tests go across it that you've never had the time to do before because you were kind of slaving away over every single save to a database. Right? And that quality work is the thing that every PM wish you had time to do but you never did because they're focusing on revenue, right?

Like that's the job now. And so it will be interesting to see more of that start to land on platform engineers' plates.

Nikhil: 00:42:33

That's such a good comparison with ATC. I mean, like that's exactly how I feel when you... it's the shift away from being able to do deep generative work into like more reactive work.

When we talk about, well, AI is going to, you know, ten x or hundred x a team's productivity, ultimately the team is accountable for that ten x output, which means it's going to... from first principles... is going to come at ten x more cognitive load on that team. Because I don't see us as an industry getting to a point where we are okay with shifting that accountability to AI, right?

The team is still... the human team is still going to be accountable for whatever the team produces, right? Whether they use AI or not. So there is a pretty deep cognitive burden I feel when it comes to platform engineering. I mean just inventing tools and techniques that simplify that burden... that reactive burden... and helping, you know, that review process easier.

One of the ideas that we have talked about at Cribl is not all code in your code base is equal. You know, there is certain code that's important and that there's certain code that's maybe not as important. So like really splitting it up or you know, drawing rings around your code like, "Oh, this is the ring 0, the kernel." Like in Cribl's case that would be the score stream processing engine that can process hundreds of terabytes of data or transform, you know, hundreds of terabytes of data in an hour. Like that's the kernel and then the ring around that is maybe like the control plane for the kernel, right?

So what sort of stuff are you putting in? Optimizations, those sort of things, then kind of like expanding that. And then at the very edge you have smaller things like some settings panel hidden in the UI, UX somewhere. So like splitting that between kernel land and user land, kind of how the Linux system does it and then treating different grades of PRs. So (a) like a PR should not span across multiple rings, it should be in its own ring. And then (b) treating those differently. Right?

I mean like if it's a PR that introduces a bug, but then you can... if the bug is caught, like you know, if you... it's easy to remedy that, you know, like you can push out a fix in ten minutes and there would be no impact versus messing something up in like the core. You'll probably have like more human scrutiny.

Cory: 00:45:11

Yeah, I'm going to give you another Thoughtworks throwback that I feel like you're saying here, but you're not using the words - it's domain driven design. Like you're talking about like the core domain, right? Like that was always the thing.

It's like it's so hard. And so many code bases where it's just like not a monolithic repo, but like a monolithic code base where it's just like... you know, in Rails you open the app folder and there's just seven thousand models in there, right? But like starting to organize. And again it's... this is quality engineering.

This is the stuff we all wished we had time to do in two thousand fifteen and never did. And now it's like thinking about the different parts of your business that are critical. This stuff is going to get... you know, that core domain, that string transformer... like that's always going to get humanized. I don't know, maybe the user settings page is like, "Eh, we don't give a shit. Like let's just let it merge, like the testing, like let it go. Like let's focus on making sure that the secure parts of the app are secure, the core hot paths are good. And the domains that are not the core domain, maybe they don't need as much attention in guardrails, and we focus on the part that does."

And it's easier to reason about stuff in that world too, right? Because it's like everything's scoped down to the little circle that you're working in.

Whoah, baby.

Nikhil: 00:46:24

Yeah, yeah, yeah. It's changing fast. The world is changing fast.

Software engineering, platform engineering, our entire industry is at the tip of the spear when it comes to AI transformation. Since code is such a... I mean coding is like the first breakout case of generative AI, right?

I mean like it's the first breakout success. I don't think anything is quite that breakout even. I mean the next best thing is probably self driving cars. Those are in like a handful of cities, right? Whereas agentic coding is all around. Like every software engineer, every... like no matter where you're in the world is using some form of agentic coding. At this point we get to decide, you know, or we get to be the guinea pigs of, you know, how this is going to play out.

Cory: 00:47:08

Hey, and we don't... you don't have to be the guinea pig.

This is the thing that's exciting to me about this that I was like, I feel like all these like AI CEOs, like I feel like they are fucking terrible. They're great at raising money, but they're terrible at marketing.

Like coming out and being like, "Fifty percent of you are going to lose your jobs in the next three years." It's like, "Hey, you know why people throw Molotov cocktails at you? Like, it's because you aren't making people happy with your words."

But it's... at the same time, it's like we've had a few of these... like, I don't want to say revolutions... but these, like, big changes in, like, software engineering over the past, you know, twenty, thirty years. And we're in the next one right now, right?

And I feel like this is the time where it's like, you can get a lot of free labor from these subsidized tokens to make your code bases really good. And then you can think a lot more about what does the future of engineering look like for our industry and what does it look like for our orgs, right? And that to me is exciting. It's like, it's still engineering work. It's headier stuff.

I think it's more interesting problems to solve than save user to database. Like, you've done that at your last seven jobs in different languages. We're at an exciting time.

And I think most people that are afraid of it are afraid of it because of the words that these CEOs are saying, right? And it's like, I think they're saying that because they want to raise money again.

I need more people because I'm making more code, right? It's like, I need more people, right? So I don't know. It's wild times. We can be guinea pigs, but I think a lot of us can also be computer scientists now. Don't be the guinea pig. Be that, right? Be the guinea pig scientist. I don't know. Like, let's figure out how this new world looks because there's a lot of stuff to figure out.

I think CI is going to look a lot different in the next few years. People are starting to talk about... like, I know a lot of people are pointing to GitHub and be like, "GitHub's falling over." It's like, "Dude, GitHub is getting way more commits and repos than they've ever had in their lives. Like, they have a lot of just like, stability to deal with from just like the onslaught of data being thrown their way." Like, you can, you can blame it on core AI all you want, but, like, they're handling more writes than most of us, right?

So it's like the world, our engineering stuff that we have kind of clung onto since the first DevOps talk from Flickr, like CI/CD and Git and all this stuff... It's like, this stuff is likely going to change and it's us who gets to make the decision as to what that looks like.

It's us that has to get familiar with how these systems work to be able to make those decisions. And I feel like if you're sitting around like afraid of LLMs and AI right now, like those decisions are going to be made for you, which sucks.

Nikhil: 00:49:51

A hundred percent. Right. I mean, everyone has got to be thinking about how do they become an owner of whatever their domain is. Right?

I mean just start acting like an owner of the problem that you have, of the problem that you're trying to solve, and then, you know, use every AI agentic software tool out there to help you conquer that and just really deliver that ROI. Really.

So like if you're an engineer... if you're a front end engineer, you know that looks different for you, then if you're a platform engineer, that looks different for you. But ultimately it's just showing relentless ownership of the problem that you're trying to solve - What are users looking for? Where your customers are paying you money for and then going and really delivering on that front.

Which is a far better place to be in than being a guinea pig.

Cory: 00:50:36

Yeah. Awesome. Well, I really appreciate you coming on the show today. It was super fun.

Where can people follow your work and learn more about what you and your team at Cribl are building?

Nikhil: 00:50:48

Yeah. So you know, if you follow me on LinkedIn. Follow me on Twitter. We're pretty active on LinkedIn where we post, you know, all the cool stuff that we are doing at Cribl.

Just this week we released a bunch of benchmarks on LinkedIn where we compared privacy sensitive data scanning models that we build in-house, that we trained in-house, and how they stack up with OpenAI's new privacy filter. So OpenAI released a 1.5-billion parameter open source model on Hugging Face. It's awesome, definitely go check it out.

It can scan and redact sensitive information from any data that you throw at it. And at Cribl, we have our own in-house built models that are way faster than OpenAI's models. Much more accurate as well. So our F1 scores are... which is a score that really tells you about precision and recall when it comes to AI systems, really how accurate a model is... we're able to benchmark our models at more than ninety percent whereas OpenAI's model was about fifty-seven something percent.

And then our model is also a hundred and thirty-three x faster than OpenAI's model because what we have been able to do is we have been able to compress much higher accuracy in a really small model. So you're talking about a hundred million parameter model that can run on a lot of logs, a lot of telemetry data, observability data.

Because guess what? With agentic AI, LLMs are everywhere. And people are sticking all kinds of things into LLMs these days, right? Personally added information, health records, Social Security numbers, phone numbers. And one of the biggest challenge that IT teams and security teams have to conquer is how do you scrub all of that information from access logs, from audit logs... a really pressing problem across the industry.

So I'm really happy that we are leading the charge on that and developing these models that enable you to scrub sensitive information so it doesn't land in places that it shouldn't. You don't get one of those weird letters in your mail saying that, "Oh, there's been a data breach." Or, "You're signing up for one year free of some monitoring..."

Cory: 00:52:54

You know what happens when you have six Equifax monitorings that all overlap from all your breaches? Nothing. Your data's still out there.

Nikhil: 00:53:01

It's all out there. Right, exactly.

Cory: 00:53:03

That is funny.

We actually... we just added a hook to our stack that essentially like... I don't want to say... kind of like a loud list, like what we could put into OTel. We caught that... you know, we lean into OTEL pretty heavily. Folks that have listened to the show before know that I used to do a lot of what I call the OTel driven development.

I don't do as much of it anymore. But like we lean into OTEL pretty heavily. Like, we don't... we actually don't emit a single log from our server, we just... everything goes into OTel.

And I'll tell you what man, LLMs love to put the entire record into OTel attributes if it can. And so like we have rules on like the types of data that can go in. Like even stuff that's not... we don't necessarily deal with PII, but we deal with other things that... you know, password secrets, credentials for databases, like all sorts of stuff like that. And yeah, so it's just like, "Dude, that is important." Like, you have a big team... a hundred, two hundred, three hundred engineers... you are going to start leaking stuff into logs that probably should not be there for sure.

That is awesome. Is that model a part of Cribl or is that open sourced as well?

Nikhil: 00:54:10

It's right now part of Cribl. It's Cribl Guard, which is a product we released a few months ago and that's how you access that model. And we're going to release a paper on arXiv in the next week or so that will enable you to reproduce that sort of a model on your site as well. So that part will be published.

Cory: 00:54:30

Awesome.

We'll put a link in the show notes to Cribl Guard and then if... yeah, when that paper's published we'd love to put that in the show notes as well for folks.

Nikhil: 00:54:37

Awesome. Yeah, we'd love that.

Cory: 00:54:38

Yeah. Awesome. Well, thanks so much for coming on the show today. Everybody listening, I'll see you next time.

Episode 48

13th May 2026

You Need AI Sysadmins Can Trust, With Cribl's Nikhil Mungel

Transcript

Listen for free

About the Podcast