The Stack Overflow Podcast

In a deterministic simulation, you can debug with time travel

Episode Summary

Will Wilson, CEO and co-founder of Antithesis, joins Ryan and Stack Overflow senior director of engineering Ben Matthews on the podcast to discuss deterministic simulation testing, the pitfalls of chaos testing in an AI-driven world, and how testing can help developers deal with technical debt.

Episode Notes

Antithesis is an autonomous testing platform that finds bugs in your software with perfect reproducibility.

Connect with Will Wilson on Linkedin.

Congrats to user hannes neukermans whose question How can I do tag wrapping in Visual Studio Code? won them a Stellar Question badge.

Our 2025 Developer Survey is live! We want to know what your developer life is like!

Episode Transcription

[Intro music]

RYAN DONOVAN: The 15th annual Stack Overflow developer's survey is live and we wanna hear from you. Every voice matters, so share your experience and help us tell the world about technologies and trends that matter to the developer community. Take the 2025 developer survey now. The link will be in the show notes.

Well, hello everyone and welcome to the Stack Overflow podcast, a place to talk all thanks, software and technology. I am Ryan Donovan, your humble host and I've got a very special co-host today, Ben Matthews, Senior Director of Engineering here at Stack Overflow. Hey, Ben, welcome to the show.

BEN MATTHEWS: Hello. Thanks for having me.

RYAN DONOVAN: We have a fun guest I know you're interested to talk to. I'm very interested to talk about testing and AI and simulation, a whole basket of things that should be interesting to the listeners. Will Wilson, CEO and co-founder of Antithesis. So welcome to the show, Will.

WILL WILSON: Thanks so much, man. Good to be here.

RYAN DONOVAN: So, top of the show, we'd like to ask our guests how you got into software and technology.

WILL WILSON: Oh, yeah. No, this is actually sort of a fun story because I took a very roundabout path. I did not study computer science in college at all. I'd screwed around with computers a bunch as a kid and knew how to program, and so on, just from playing with computers. But I ran away from tech in college because I looked around and I was like, man, this computer science stuff seems cool. Too bad it's all over already. This was back in the early 2000s and Google had already been founded and Facebook had just been founded. And I was like, it doesn't seem like anybody's gonna ever do anything interesting again. So that was obviously very stupid. So I ran away from tech.

I studied the most impractical kind of math that you can possibly study. And then after graduating, I got into scientific research and I did a bunch of that and I sort of bounced through a few different dead end jobs and I had one very important moment happen to me at one of those jobs. I was assigned some awful menial drudgery task. I don't remember what it was, even. And I was like, well, this is gonna take a while. So I wrote a little python script that did the task in 45 minutes, right? And it went like crunch, crunch, crunch and like the whole thing. And I took it to my boss and I said, here you go. And he like looked at me in horror and I was like, “What's wrong?” And he's like, “That's supposed to be what you're gonna be doing for the next three months.”

And so then I was like, oh, huh, that's interesting. My ability to write this like crappy Python script is actually my highest value talent, maybe I should lean into that. And so I basically kind of went back to school, right? I went on the internet and I took a bunch of online classes and programming and so on, and wrote my own ray tracer and wrote my own compiler for a toy language and just tried to like sort of learn computer science myself while I was out on paternity leave and then also late at night. Then eventually, I kind of bluffed my way into a job. And then from that, I like bluffed my way into my next job. And this actually ties into the story of why I fell in love with testing and why I think testing is so cool. So the second one of those jobs I bluffed my way into was at a very, very cool startup called FoundationDB, which made a very sophisticated fault-tolerant distributed database. And if you know anything about distributed databases, you know that these are very difficult systems to get right. Right?

And they hired me like, haha, I made my way through the interview and like, I had no idea how to write a database, right? Or like how to write a distributed system or how to write any, you know, really any code at all. Like I was not a professional engineer at that point, but I snuck in, right? And so think about that. Ben, think about having somebody on your team like that, right, who's like gotten in the door and you don't realize they're actually a charlatan now they're like committing code. This is like your worst nightmare, right?

BEN MATTHEWS: I know. A lot of people have that dream of not going to an exam they haven't studied for. That's my nightmare.

RYAN DONOVAN: (laughs)

WILL WILSON: So I'm writing code and this like should have been a disaster. I feel like at most companies this would've been a disaster, but it was not a disaster because FoundationDB was the pioneer of this incredibly sophisticated testing technique called deterministic simulation testing, where basically you take your real system and you run it through like thousands or millions of different scenarios of like all possible orderings of events and how they could play out and what could happen, and, and it's, you know, all sort of autonomously generated in the background.

And I wrote lots of bugs in my first few months at that company, and zero of them ever made it to production and zero of them ever caused any harm. And that was like one of the things that made me realize that like powerful testing is a crazy superpower for organizations. It means you can hire an extremely unqualified person. It means you can take a risk on a person, right?

RYAN DONOVAN: Right.

WILL WILSON: And like they can't do that much damage, but while they're coming up to speed, while they're learning– and like, it just means that like your whole team can move faster.

RYAN DONOVAN: Yeah.

WILL WILSON: Because you can take more risks with people and with projects.

RYAN DONOVAN: I mean, in the age of a vibe coding, that's gonna be even more valuable.

WILL WILSON: Right.

RYAN DONOVAN: So can you tell us a little bit about this deterministic simulation and how do LLMs fit into that or AI?

WILL WILSON: Let me tell you about the first part first.

RYAN DONOVAN: Sure.

WILL WILSON: Deterministic simulation is a testing technique that was, as far as we know, invented at FoundationDB, but now has spread way beyond that and has many practitioners throughout our industry. The basic idea is you want to take your system and you want to convert it into a form where it can be run completely deterministically, right? Like, what does that mean? Why isn't your system already deterministic?

RYAN DONOVAN: (laughs)

WILL WILSON: Well, if it gets input from the user, it's not deterministic. Right?

RYAN DONOVAN: Right.

WILL WILSON: If it checks the clock, it's not deterministic. If it reads files off disks, it's probably not deterministic, but then there's like trickier stuff. If you've got multiple microservices and they're talking over networks, it's not deterministic like a packet could take a larger or a smaller amount of time to get from one machine to another depending on when you ran the test.

If you've got a bunch of threads, like it's not deterministic, the OS can decide what order to schedule those in. Real world software, real big software– that's like bigger than the scale of like a single function or module, it's almost always non-deterministic.

RYAN DONOVAN: Right.

WILL WILSON: So deterministic simulation testing is basically saying, let's suppose we could solve all those problems somehow. Like let's– we're not going to think too hard about how we're gonna do it, but let's suppose we could. Well, what would that enable us to do? Well, it would be amazing because you could then take all kinds of crazy situations, right? You could ask questions like, hey, when machine A talks to machine B, if I like drop this network connection exactly 32 milliseconds into that process, what happens? If the result is something I don't like, I can run exactly that situation over and over and over again, even if it's some incredibly rare bug and always perfectly repro it, right? And then, the next step beyond that is like, oh, well now I can search over the space of all possible moments at which I could have dropped that packet and see if any of them cause a problem and if that one did, then I can just replay that one.

So, if you've heard of chaos testing, deterministic simulation is like the way better version of chaos testing. It's better in many ways. It's better because it gives you extremely actionable things to like look into and debug when it finds a problem. It's better because it's not happening in production, so it's not affecting your real customers. It's better because you can speed up the simulation and you can do it with a lot more parallelism and find bugs faster. So it's basically like the next evolution beyond chaos testing.

And what does it have to do with LLMs? Well, I think the main thing is just that. LLMs are great. They can like bang out huge amounts of code very fast and they can like write some unit tests, or whatever, but like, let's be real, like unit tests don't actually catch most production issues. They're okay for like, do I have like an extremely obvious regression in this one little function or whatever, but like the gold standard for testing is like end-to-end tests that force your whole system to work together and see if it actually works. I think LLM generated code is generally not very testable when it comes to that and generally has a lot of problems in it. And so I think deterministic simulation is probably like one good approach to trying to validate and feel better about all the stuff that you vibe coded.

BEN MATTHEWS: I think you make a lot of sense, especially around the unit testing is that it's really finding things that you know to look for to begin with, when you're writing these kinds of tests. But what I'd be really interested in is– you say about creating a fully deterministic simulated environment, which sometimes would feel counterintuitive to the chaos testing because it's so self-contained. Could you walk me through like a company that really wants to try this and take this route, what would be that step to creating that fully deterministic environment? What does that look like?

WILL WILSON: So the road forks here. Do you want me to talk about how you do this without Antithesis or how you do it with Antithesis, because they're very different stories?

BEN MATTHEWS: I would love to hear how Antithesis does it.

WILL WILSON: A lot of the value that Antithesis provides is that it makes this process like stupid simple to adopt, where previously it was very, very hard to adopt. For instance, the thing I mentioned about threats, right, like in the old days, if you want to do deterministic simulation testing and you wanted to have more than a single thread's worth of concurrency, you have to build some crazy user space scheduler, right? You have to somehow get the Linux kernel out of the picture so that you can have total control over what order the threads are running in and where they interleave. That's crazy. Like most people don't wanna do that, man. And not to mention like, you have to do this not just with your software, but with all of your dependencies. It's like crazy stuff.

There are people who've done it, actually, a number of our customers I can mention. Right? Like Tiger Beetle is another database company. They did this very successfully and have done a really great job with it. You know, a bunch of people in the cryptocurrency world have done this because the stakes are so high there. But this was definitely a technique for fanatics, right, when you had to do all that yourself. What Antithesis did, like our approach is like, why don't we just solve that whole set of problems once and for all, for everybody. So what we did was we wrote a special kind of hypervisor, like something that runs a virtual machine. But the difference is our hypervisor emulates a deterministic computer. So there's nothing that you, or the Linux kernel, or anything in the user space, or any of your dependencies can possibly do to break out of this like deterministic world and make it non-deterministic.

And what that means is you can just take your software as is, pretty much unmodified, and take your OS, pretty much unmodified, and just drop it in there and it becomes deterministic. Right? And anytime you try to access a random number from the hardware, random number generator, you're gonna get the same random number, unless we decided that we wanted to reach in and give you a different random number to see if something different would then happen. Maybe we'd find a bug that way. Right? So anything your software tries to do, it will always get the same answer if we're running it with the same random seed.

So now the only thing left, basically, that a customer has to do or an interested party has to do is make it so that their software can basically run disconnected, because obviously if you're calling out to some third party web service out on the internet that is now non-deterministic, you might get different answers depending on what it does. You know, there's sort of two basic approaches to that, right? Like one is, if it's like a Postgres database, just run the database in there too. Right? We can get very, very large virtual machines now, and you can just– all the dependencies that you can deploy, you just throw them in there too. And now we're also testing them and testing if they have some crazy race condition, which is great.

But the other approach is mock it or stub it out and we've built our own good mocks for the most common things, like we have basically an entire fake AWS that we can run in there with you. But there's a lot of good like open source things as well that we leverage for this stuff. So, you know, if it's on the list of things where we already have coverage, then you're good. And otherwise, you might have to build a mock. That's still like a pain. Right? It's like a friction point for adoption, but it's a lot better than like, I have to figure out how to turn my microservices into a deterministic monolith, or whatever.

BEN MATTHEWS: That was gonna be my question around distributed systems being as popular as they are, that the ability to simplify that process would be a sort of a superpower of it.

WILL WILSON: Yeah, well that's, that's sort of like most of our customers actually. So most of our customers are people who have a big complicated collection of distributed systems– got some client, and they've got some API server, and they've got some database backend, and they've got a bunch of microservices doing whatever, and they wanna tell, like, if I make a change to one piece of this, is it gonna break everything? And is it gonna break everything in some like really weird random situation, like when I do a failover or the transaction's getting rolled back, or like when I'm in the middle of upgrading my database. These questions are traditionally very, very, very hard to answer, very hard to test. And so by sort of taking this whole thing, this whole complicated pile of stuff, and like turning it into a deterministic pure function and then having computers sit there and think of thousands or millions of ways to screw with it, that just gives people tremendous confidence that things are gonna go okay when they do this in prod. They've gotten really good results.

RYAN DONOVAN: I assume that if you have one of those flaky bugs that shows up in random situations, you can run the determinate system with different seeds or whatever. Right?

WILL WILSON: Yeah, exactly. You know, that's honestly one of the superpowers, right? Like you have a flaky bug, it happens like one a million times. Bad news: You work at Facebook, so one in a million is like every day. So, or like, maybe like every second. So how do you debug this thing? Right? Like sometimes you add some debugging code and that literally makes the bug go away or like become asymptomatic or whatever. I've definitely had bugs like that.What's cool about a simulation is it's also a form of time machine–

RYAN DONOVAN: Right.

WILL WILSON: Effectively, because you can rerun the simulation and just pause it before you got to the F.

RYAN DONOVAN: Yeah.

WILL WILSON: You can back up. Right? So now I can do this like counterfactual style of debugging. I can be like, make that bug happen again, but a second before it happens, I want you to change this config parameter. Does it still happen? Like Oh it does. That's interesting. Okay, well two seconds before it happens, I want you to do this other thing. Like, does it still happen? Oh no, it doesn't happen. Okay. Or you can go back and be like, hey, one second before the bug, I want you to turn on packet logging. Right? Or I want you to like get a core dump of this process.

Having a time machine is actually very, very, very nice for debugging these quasi production issues. Like I would love it if I had a time machine when things– real things happen in prod, right? Like I would simplify my life a lot. So So yeah, it's pretty cool. It's pretty high productivity. Like when we were at FoundationDB usually when we did run into a bug, which our simulator didn't catch, which was rare, but it could happen, especially if the simulator was wrong about what the OS might do in a certain situation. You know, what we would do is we'd first go and try and fix the simulator so that the bug was getting repro there, and then we'd fix the bug in simulation because that was just like so much faster than trying to fix it directly in a prod reproduction.

BEN MATTHEWS: That's really cool around like the places that people can focus on of like maybe the creativity is like, I want to change this config value here, like the exploratory part of it while it seems like Antithesis is doing a lot of the stuff that might be very long and painful, like the three months that your boss originally wanted you to do that, that's being solved. And there tends to be a push in the industry around AI of “take all the boring, tedious stuff, solve that for me, and I want human beings to focus on the creative focus of it all.” Would you say Antithesis is leaning into that paradigm, changing that paradigm?

WILL WILSON: Yeah, 100%. Like at one point– I don't know– this is not like our current motto or whatever, but at one point we had a marketing tagline that was like “Replacing the worst 50% of your job.”

BEN MATTHEWS: Oh, I love that.

WILL WILSON: Because like every software engineer spends about half their time debugging, like stupid crap. And it's like, what if we just got rid of that and let you do the part of your job that was actually fun? I think we're a little bit opposed to the side of the AI coding world that's saying humans are obsolete.

There's like– there's some vendors in this space who are just like, yeah, we're gonna replace all the engineers. I think that's incredibly premature. Even if it will ever happen, but it is like definitely not happening this year. I'm sorry.

RYAN DONOVAN: Right.

WILL WILSON: And I think that, you know, our approach here is not to try and replace humans, but to try and augment them with tools that make them vastly more productive and effective, and let them build cool things that they couldn't have done before and let them get back to the fun parts of their jobs. That's sort of the vision.

RYAN DONOVAN: I wonder how the, you know, the two things play together because you have this very deterministic simulation and then you have these non-deterministic gen AI things. So how do they work together?

WILL WILSON: They actually work together pretty well. So, here's the thing about gen AI. It's awesome in any situation where either it's fine if it's totally wrong, because sometimes it will be, or where it's really, really cheap and easy to check. I think software development is not quite there because telling if a program is buggy is actually quite hard, and if a software program is wrong, it's often catastrophic.

The interesting thing though, is software testing specifically, I think is a place that has these qualities, right? If I ask a gen AI system to write some, like calling code, you know, you're the developer of a library, let's say, and I wanna test your library, and I'm like, hey, ChatGPT, write some code that calls this library. Maybe it'll do it right, maybe you'll do it totally wrong. That's actually fine because as the developer of the library, you probably wanna test both of those cases. Right?

RYAN DONOVAN: (laughs)

WILL WILSON: You wanna test the case where the user's using it, right, and you wanna test the case where the user's using it wrong.

BEN MATTHEWS: And so, so then the hallucination becomes a feature.

WILL WILSON: Yeah. Exactly, right! And so we basically turn this deficiency into a virtue. And so for that reason, like we're kind of bullish about using these things for testing purposes. Even when there's stupidly wrong, it's actually good

RYAN DONOVAN: They can find new ways to break systems. I love it. So having it on a hypervisor, how big of a compute can you simulate?

WILL WILSON: So today, your software and all of its dependencies need to fit within the memory of a single computer and that sounds like a bit of a limitation. It's actually not as bad of a limitation as I initially thought it would because I was not up to date on just how fricking big computers have gotten these days. You can rent a computer from Amazon with– no, you can rent a computer from Google with like tens of terabytes of memory. I mean, it's crazy. Amazon's actually pretty close as well. Like they're not quite as big, but they're very big.

So that's like not as bad of a limitation as it sounds like, but if you get into like, you know, I have thousands of microservices…if we had Amazon themselves as a customer, right, and we wanted to like simulate all of AWS in a single simulation, I think that would probably run into scalability problems. It would probably also not be the most efficient way to test that system though. Right? You get better performance if you just take a slightly smaller self-contained piece and like test it in isolation. You do still wanna sometimes test everything together because you wanna make sure that nothing crept into the corners of your model where things talk to each other. You know, on the drawing board, we do have plans for being able to distribute the simulation itself across multiple hypervisor instances on different physical hosts and have them emulating in lockstep, but we've never actually had a customer come to us with something too big to fit on one computer. So that's purely design space right now, like we haven't done it yet just 'cause we haven't had to.

RYAN DONOVAN: Yeah, that, that was actually gonna be my next question is can the simulations talk to each other?

WILL WILSON: Yeah, yeah, yeah. You totally can. Because I mean, basically, so this is actually how a lot of multiplayer games work. This is actually one of the places where my co-founder Dave got the idea. He used to be a game programmer back in the day, and like the way that a lot of traditional multiplayer game servers worked, it was basically a deterministic simulation running on multiple computers so that if there was some lag or if there was some partition, some disconnection or whatever, the like disconnected clients would just keep running the game logic and they would basically get to the same place as the other peers. And then when they were able to resynchronize or reconnect, the drift would not have been very big, right, because like they're essentially computing the exact same function.

BEN MATTHEWS: Well, I am thinking like two kind of distinct personalities that would approach this of people that want this sort of artificial way of helping and testing those that are starting a new application and want to do this from the ground up and those that might be inheriting a huge enterprise application with a bunch of buggy tests, a lot of flaky things. What would the journey be like for each of those trying to make this part of their day to day?

WILL WILSON: For group one, this is, I think, very, very attractive because you can essentially– like when you first start writing code, you don't have any bugs yet. Right? Everything's great. Like definitionally because your thing doesn't do anything yet. And so–

BEN MATTHEWS: Yeah, we should have stopped there (laughs).

WILL WILSON: Right. So, but if you have a magic box that tells you if you've introduced a bug and you start with something with no bugs, you have a very easy way of like keeping things good, which is just like, never allow it to say you have any bugs, right? So you write some code and it's like, yep, good. And you write some more code and it's like, yep, good. And then you write some more code and it's like, eh, I found a bug, and you just roll it back. Right? And now you have no bugs. And like you try again.

And that sounds crazy, but it's actually insanely high productivity because if you just never allow a single bug to make it into your program, you are not like getting distracted fighting production fires and you're not like trying to like go root cause some like complicated repro of something that somebody brought to you without the right information. It's just like you kind of instantly know with each change you make whether you have a bug and you can always just like roll it back right there in your editor or whatever. So it's very high productivity. It can help you find issues when you're designing stuff like in the first place. We have some people who– some customers who've done that.

A guy named Carl Sverre, actually just wrote a blog post about this. He's an early Guinea pig customer of ours. He was building a new postgres based SQL synchronization system from scratch and he basically used Antithesis from scratch to do this and, you know, he wrote about how he sort of lived this way. Right? And basically didn't allow bugs to happen. You know, that would be like, that's nice, but like most people don't live in that world. Right?

Most of our customers are definitely in the second group you said. They're maintainers of some giant enterprise thing that's like full of flaky tests and full of bugs and really complicated and nobody knows how it works, and there's been multiple generations of people coming in and inheriting it. Right? That's the world that 99% of people live in. I think the good news is we can help those people a lot too.

So, it just requires a little bit more thought about how you do this because the basic problem is you're gonna turn on the system in Antithesis and it's gonna be like you have 73,522 bugs. Right? And you're gonna be like, well, screw that. So basically what we recommend is you don't panic and you don't freak out.You do not need to solve all these bugs right now. You had all these bugs yesterday as well. Like you were still alive yesterday. You're still alive today.

BEN MATTHEWS: Don’t shoot the messenger.

WILL WILSON: Right? Take a deep breath. First question: Are there any bugs that are currently on fire in production that we have found in this simulation? If yes, let's go look at those and maybe we can help you debug those and get those under control. And now you can breathe a little bit and like your life has already improved a little bit. Okay. Take the very highest priority ones, which you know, we're gonna interpret as like some customers actively screaming about them right now and like, let's just fix those. And that's probably not very many of the bugs that we found. It's probably like a small, small subset. Now you've like gotten rid of the super high priority ones, but you have this like giant remaining backlog of stuff and like nobody wants to go look at it.

That's also okay because what we can do is we can just pretend that we are living in the first world of the person who started from scratch with this system. And the way we do that is we say all of these past bugs, we're gonna keep tracking them. We're– because maybe they're gonna happen in production tomorrow, and then we're gonna wanna go back and repro them fast and fix them or whatever. But we're not gonna like go work on them proactively.

What we're gonna do is we're gonna first try and make it so that no new bugs are ever introduced. This is a relatively new feature for us: You can now configure Antithesis, so that it basically only tells you about new bugs and that is what we recommend for people who are dealing with a giant mountain of technical debt. Because the thing is, like I said before, solving a brand new bug is really fast. Going back and root causing and understanding and analyzing an old bug is really slow. So if we can just stop the bleeding, right, like stop new bugs from coming in, we will not allow the number of bugs to go up.

Basically, that's a really good starting point. And then opportunistically, as the team gets time to breathe, then you can sort of start going back through the list. Look and see if any of them are issues that you really wanna fix. Like one challenge honestly, with selling this is it sounds a little bit too good to be true, right?

Like, you know, like engineers are like pretty cynical people. Like I'm a pretty cynical person. It's like people are like, come on man, like this doesn't really work. And like it does really work and, you know, it's not magic, right? Like it definitely has problems and it, you know, it's like not the perfect tool for every job, but like, I actually do think that it makes people's lives a lot better. Like if you're in the like already drowning in technical debt situation, you need to approach it in a methodical way.

RYAN DONOVAN: Alright, every, right, everyone, it is that time of the show again where we shout out somebody who came on to Stack Overflow and dropped a little knowledge, shared a little curiosity, improved the site. Today we're shouting out the winner of a seller question badge. Congrats to Hannes Neukermans for asking “How can I do tag wrapping in Visual Studio Code?” If you too are curious about that, we'll include that in the show notes. I am Ryan Donovan. I edit the blog and host the podcast here at Stack Overflow. If you liked what you heard, disliked what you heard, want to test so that the email address exists, you can email me at podcast@stackoverflow.com and if you wanna find me on socials, you can look me up on LinkedIn.

BEN MATTHEWS: I'm Ben Matthews, the Director of Engineering at Stack Overflow. If you wanna reach out, you can find me on Bluesky and LinkedIn at Ben Matthews.

WILL WILSON: I'm Will Wilson, co-founder and CEO of Antithesis. Feel free to email me at Will.Wilson@antithesis.com. You can also find me on LinkedIn and you can find our company on Twitter.

RYAN DONOVAN: All right, gates are open. Thank you everybody for listening, and we'll talk to you next time.

(Outro music)