The Stack Overflow Podcast

Durable execution: autosave for your microservices

Episode Summary

Ryan is joined by Jeremy Edberg, CEO of DBOS, and Qian Li, co-founder of DBOS, to discuss durable execution and its use cases, its implementation using technologies like PostgreSQL, and its applications in machine learning pipelines and AI systems for reliability, debugging, and observability.

Episode Notes

DBOS Transact is a lightweight, open-source library that makes durable execution simple so you no longer need to worry about manually coding retries and recovery procedures.

Connect with Jeremy on LinkedIn.

Connect with Qian on LinkedIn.

Shoutout to Stack Overflow user Vanita L., whose answer to What does the Swift 'mutating' keyword mean? earned them a Lifeboat badge.

Episode Transcription

[Intro music]

RYAN DONOVAN: Hello everyone and welcome to the Stack Overflow Podcast, a place to talk all things software and technology. I am your humble host, Ryan Donovan, and today we're gonna be talking about durable execution– how it can help your ML pipelines, your failure-prone RAG systems. I have two great guests today, Jeremy Edberg, CEO of DBOS, and Qian Li, who is one of the co-founders and, apparently, the brains behind the whole operation. So welcome to the show, you two.

JEREMY EDBERG: Thanks for having us.

RYAN DONOVAN: So top of the show, we'd like to get a little sense of how you got to where you are. Can you give us a little flyover of how you got into software and technology?

JEREMY EDBERG: So, I've been in software and technology for longer than I care to admit. I started my career by dropping out of college to go join a startup, which was supposed to guarantee retirement after four years, and that was in 1999. So as we all know, that didn't work out. And then I've been all over the place since. I've worked in security at eBay and PayPal.I was the first employee at Reddit and the first SRE at Netflix, and the first person to use AWS Lambda outside of Amazon. So I've done that. I've done a few other things as well, but my journey has been long and winding, but led me here to DBOS where I met Peter and Qian, who were building this for a couple of years before I joined them, and I was so impressed with them that I had to join the company.

QIAN LI: So, as for me, before DBOS, I was a PhD student at Stanford where I did research about DBOS. So DBOS, the company's actually built on three years of a joint research project between Stanford and MIT.

RYAN DONOVAN: Talking about durable execution, I think I have a sense of what it is, but can you give us a definition for the folks at home?

QIAN LI: The easiest way to understand durable execution is check pointing your application. So think about like– what is a durable workflow? It's actually just a sequence of operations. Think of a checkout operation, like when you click the checkout button in any of the e-commerce site, you'll probably first reserve the inventory, like subtract inventory by one, by updating the database, and then it will redirect you to an external payment system like Stripe or PayPal, and then after that you may want to send a confirmation email to users. And durable execution makes sure that every step is essentially executed and the result is persistent so that you don't ever lose your payment process, you don't ever double checkout or double charge you, and you will always receive what you ordered.

RYAN DONOVAN: So it's sort of an auto save system, right? No longer going in hardcore mode.

JEREMY EDBERG:: Yeah, exactly. It's like checkpointing in video games is the best analogy.

RYAN DONOVAN: Yeah. This is something I've seen and talked about. There's obviously certain calls in a network system that are too important to fail. Right? And then retrying those ones becomes a problem too because you don't want them to be double counted. How do you manage that tension?

QIAN LI: Yeah, so the core technology is that we use the database to store your execution state so that it also combines with item potency. So for example, every time we start a workflow, we store a database record saying this workflow has started. And then before executing each step, we check if this step has executed before from the database. And then if it has executed before, we'll skip the step and then just use the recorded output. And then after each step’s execution, we'll checkpoint the output in the database. So essentially by looking up the database and checkpointing your state to the database, we’ll be able to guarantee exactly once, or at least once plus item potency is exactly once.

RYAN DONOVAN: Yeah. With that save state, I wonder how much information you have to save to get it to work right like is it a full memory dump or are there shortcuts?

JEREMY EDBERG: The beauty of it is that that was like the crux of the research, right, what do you actually have to save? And so it's actually very lightweight. We barely add any overhead. We've clocked it around a couple percent or less to the database, and so we're not checkpointing full memory dumps. We're checkpointing just the key inputs and outputs that are necessary to replay or to know that you've successfully played the function through.

RYAN DONOVAN: You know, you've built off of three years of research, you said, between MIT and Stanford. Obviously, that's a deep, deep well. What are the sort of questions you were answering with that research and what were the sort of false starts or revelations you had?

QIAN LI: Yeah, so the core idea is always the same: how do we leverage database technologies to make your application more reliable, more observable, and more debuggable? So I think that's the same theme. And from the research we basically concluded that what people really want is a better, easier-to-use interface to help them program reliable systems.

RYAN DONOVAN: And this was based on research from Mike Stonebraker, right? We've heard his name come up before on the blog and the podcast. He's sort of the king of database research. Do you use a custom database on the backend or are you using something else?

QIAN LI: It’s just Postgres.

RYAN DONOVAN: It's just Postgres.

JEREMY EDBERG: Straight up, regular Postgres, any Postgres compatible database. We've tested it against a few, but yeah, totally standard vanilla Postgres, no special add-ons or anything.

RYAN DONOVAN: Is that something people can customize?

JEREMY EDBERG: Well, I mean, they can customize their Postgres however they'd like. You know, add in a vector database or whatever it is you wanna do to your Postgres, as long as you continue to support the basic transactions of Postgres. That's all we really need.

RYAN DONOVAN: We've talked how this program about the sort of power of Postgres that it's number one in our developer survey here at Stack Overflow, and some people have said that it's the default database for Gen AI. What about Postgres is key to the DBOS backend?

QIAN LI: Yeah, so we basically leverage Postgres transactions to guarantee atomicity, consistency, isolation, and durability. And beyond that, we also leverage Postgres features like LISTEN/NOTIFY to quickly– to improve performance, like if there are changes in the tables, we can quickly get a notification to our application so we know what to read. But in general, DBOS is built on SQL queries to the database, so we don't need to modify the database to use DBOS.

RYAN DONOVAN: Like I said, this is something I've been hearing about recently. We've talked to other companies doing this and it seems like a pretty recent phenomenon. What is the driver for durable execution adoption?

JEREMY EDBERG: We've always had a cost to downtime, right? Now, though, it's getting much more important because of AI for two reasons. One, because AI is non-deterministic. It's inherently unreliable because you can't even get the same answer twice, and it is just unreliable because it's so new sometimes you just don't get an answer or it cuts off in the middle, or– there's lots of things that can go wrong with the AI itself. With the AI pipelines, we need to clean a ton of data and get it in there. AI code generation, we're now generating code that sometimes the developer doesn't know how to grok. And so it's– reliability is even more key. So all of these things coming together are making reliability and durability really important. Interesting side note, we recently just tested out that you can actually one-shot generate reliable code using our open source library and Claude, and a prompt that we provide.

RYAN DONOVAN: So with the sort of like mid AI failure, is there a different amount of information you have to store for an AI transaction, or is it, you know, you just treat it as an API?

QIAN LI: We treat AI as a very unreliable service (laughs).

RYAN DONOVAN: (laughs)

I would say many of the existing apps don't have such save points or they manually add those. So it's really hard to tell what's going on. And what we've tested is that you can actually add DBOS to, say, AI frameworks or AI tools so that when an AI calls out, like a specific function, say, I want to process this refund, I want to process this payment, you'll be able to trace what's going on. I think observability is important.

JEREMY EDBERG: Yeah, exactly. What's interesting is that a lot of the AI libraries today already include keeping the context and smartly resending such context on retries, so we can leverage that by adding durability to that.

RYAN DONOVAN: Does that mean saving whatever seed? Because if you resend the prompt right, you'll get something slightly different.

JEREMY EDBERG: Right, right. Which is why saving the returns of the functions, the outputs if you will, is really key, right? Because now we know what the AI said last time. We can replay it and know what changed. We can go back and say the first answer was better and replay forward from there. So with workflow management, you can say all the next steps come after that last response, not the new one, or vice versa.

RYAN DONOVAN: What are the, you know, in an ML pipeline, what are the sort of things that can go wrong that you all are there to help?

QIAN LI: So it's actually a very important use case for DBOS, and many of our users are using DBOS to build ML pipelines, so a typical pipeline would be for the first step to scrape some data from the internet, like to download PDFs to screenshot website. And then second step would be to, for example, send the information and your images or PDFs to an LLM to generate some vectors or summaries. And then the third step is to either do more analysis or persist it into your document store or even pgvector, like vector storage. So any step can potentially fail.

And one of the biggest pain points is that those LMS could be unstable, they can return failures, and also they'll rate limit you. Right, because LMS are expensive, most of the APIs will say, don't call me more than five times per minute or so. So like you have to– you want to process your documents efficiently, but also you want to be within the limit, and then you have to handle those errors if they happen.

So DBOS makes this really simple. So first, to handle errors, you can configure, for each step, you can configure exponential backup retries. So we can automatically retry, say up to five times, and each time we'll wait a certain period of time before retrying. And second, DBOS provides this primitive, too, of queues where you can set concurrency limits and rate limits to the queue. So this makes it life much easier because you can say, I will in-queue a thousand tags, but every time, only one or five outstanding tasks can be processed at a time. And then we'll make sure that we don't call this API more than five times per minute. So those are things can make it really simple to build a ML pipeline.

JEREMY EDBERG: And an example from one of our customers is a few of them are doing what we call like fin AI, where they're scraping the internet for financial information and then trading on that information. And when you're doing that, there's a couple places where you don't wanna mess up, right? You want to make sure that you actually scraped all the information you intended to. You want to make sure that your AI actually read all that information today before you make a trading decision. And then when you make a trading decision, you want to make sure that that trade actually executed as you intended. And so in all of these places, they're using durable execution to make sure that they actually scraped everything, that they actually got an AI response, and that they actually executed the transaction.

RYAN DONOVAN: When I talk to other folks with durable execution, it's sort of part of a larger system. Is there a strength to having a sort of tighter focus on durable execution?

JEREMY EDBERG: We believe that every piece of code should be written durably from the ground up. If that was the case, the internet, in general, would be far more reliable, and that's why we built Transact, right? Transact is all about DX, right? And that's why we give it away for free because we believe so strongly in building durable software from the ground up, that we want everyone to do it, whether they're working with us or not. We want them to at least be building it durably. And so you know, that I think is– that's where we are and that's why we think focusing on durability makes a ton of sense. And beyond that, we're getting this unique data set of inputs and outputs that we can do lots of stuff with. You can use it for your AI. The beauty of it is the way that we do it, it's all in this database. You can SQL query it yourself. You can do whatever you want with that input and output.

RYAN DONOVAN: I wanna dig into something you said there. Do all queries need to be durable? I mean, if you're just retrieving product information that can fail and be retried pretty much willy-nilly, right? Like there's not that much cost to it.

JEREMY EDBERG: It can, but if you do it durably, you don't have to write all that extra code about, oh, okay, I need to– did I get this right…did I need to retry? Right? That's all– you get that for free when you build durably.

RYAN DONOVAN: Okay. So it saves all the emergency backup code.

JEREMY EDBERG: Yeah, yeah. When we talk to people, they'll tell us 80% of their code is retries or what happened in case of a failure edge cases. Almost all of that goes away when you build durably because you just say, did this succeed or not? And if not, do it again.

RYAN DONOVAN: So obviously, if you're saving information from– during execution, having save points, can you use that information for things like debugging or observability?

QIAN LI: Exactly, so that's a great point for durable execution because we store the output in the database, we also store inputs and other status and errors in the database. So it makes debugging really simple. As a matter of fact, we developed a time travel debugger. Where, say, you have some bugs in production and then we have the full trace of what happened, like here's the input, here's the output of every step, and then we'll be able to just connect your database and use our debugger to walk through step by step like what happened in your workflow and that makes debugging really easy.

JEREMY EDBERG: Yeah. Like you can actually replay things that actually happen, right? So when I was doing reliability forever, one of the biggest issues we had was when something went wrong, you had to hopefully have a trace of it, and if you didn't, you had to set up logging and hope that it happened again. And now you don't need that hope, it's there.

You know, you have that input and output, you can debug it, the real thing that actually happened.

RYAN DONOVAN: It's interesting, I actually wrote an article a while back on time travel and programming languages and saw the sort of debugging use case for that. So that's interesting to sort of get a snapshot into the past through the debugging and through the durable execution.

JEREMY EDBERG: It's honestly one of the favorite tools that we've built of mine.

RYAN DONOVAN: Can people customize when the save points happen and what information goes into them?

QIAN LI: Yeah, definitely. So that's one thing we've been working on to make it really simple and easy to use for developers. So at the core of the durable execution library, we provide decorators, so currently in Python, TypeScript, but you can just say this function is a workflow. So you decorate it as workflow, and then within the workflow function, you can decorate each step functions as steps, and then that way we'll be able to control the boundary of your steps and workflows. So typically, for example, if you don't decorate a step, then we don't need to snapshot it. So basically we don't– we'll not never need to memory, dump or record everything in the database, which will add too much overhead.

JEREMY EDBERG: So it's totally up to the developer to choose their trade off of how often they checkpoint versus the granularity on their replay.

RYAN DONOVAN: Those decorators are in TypeScript. Do you do custom transpiling to JavaScript with those or is that handled automatically?

QIAN LI: So we use the normal TSC compiler and we leverage decorators provided by JavaScript.

JEREMY EDBERG: Decorators are first class already, so.

RYAN DONOVAN: So what are you looking forward to the future of durable execution? What's exciting about what's to come?

JEREMY EDBERG: What's really exciting to me is the way that it's tying in with AI. By using durable execution, you're creating a very unique data set of inputs and outputs that are important to you as the developer and that can be fed into an AI to do a lot of really interesting things. It could be fed into AI SREs, it could be fed into your AI coding tools to train those tools to do better coding, it could be fed into your security tools. That data can be fed into a lot of different systems to create really interesting, unique insights to you, and you don't have to do anything special to do it because you're already building your software this way.

RYAN DONOVAN: I guess I didn't think about that, where it's like you can give them the progress of your program. Have you seen any interesting use cases that surprised you?

QIAN LI: So for AI use cases, broadly, there are two ways: One, a general use case is the ML pipeline, as I described earlier, and then the other one is AI agent, where AI will generate a lot of code and AI will also drive the program, essentially. And I think that's why we think job execution should really be integrated into your program because with AI there won't be any like static workflows. You can't just really map, here's a deck, here's a static deck…1, 2, 3, step. AI will decide, based on the LM’s response, AI will decide what the next step would be. So with DBOS and with this checkpointing, you'll be able to track exactly what happened. And, actually, we do have customers using this.

RYAN DONOVAN: So it becomes less of just like retry management to sort of like a deeper understanding of your program as a whole.

JEREMY EDBERG: You can do a lot of interesting things with traces because you can see your program execution and that's something that we provide in our commercial product is tracing. And those can be really interesting to see, oh, this step ran here, it took this long, this overlapped with this one. All that stuff. So that's some really interesting insight that you can get out of that database that we provide visualizations for.

RYAN DONOVAN: Is it fully seeing samples or fully seeing traces, or is there sampling involved?

JEREMY EDBERG: Since we are checkpointing everything in the database, they're full traces. So everything is gonna be in there because everything's being checkpointed.

RYAN DONOVAN: That's actually an interesting thing we've talked about with companies before is, you know, you have an open-source software. How do you build the business on top of that open source software?

QIAN LI: So I think it's true that you can use open-source software. The benefit is that you'll be able to run it anywhere and you can run it on your local machine, on your Kubernetes cluster and the cloud. It's all good. The real pain point is operations. So normally in production, you'll have more than one machine serving your workflow, serving your request and what happens, especially in a large production use case, you can probably have thousands of machines and any machine can potentially fail. So on average, every hour you have some machines that fail. So the question is how to handle those failures. And of course, like if you use Kubernetes, they can automatically restart your pod, but you still need to handle your application logic.

Like say, if you are in the middle of a payment process, how do you recover it? So it's always a pain point. Like usually, people sometimes have to do it manually, but we think the benefit of DBOS as a commercial product is that we provide distributed recovery. So say you'll connect your process to DBOS conductor, and then we'll monitor which machine is down, and then if one machine is down, we'll automatically load balance pending requests from that machine to other healthy executors.

JEREMY EDBERG: Building a company on top open-source is hard, but what we're doing is we're leveraging our expertise and providing that to you commercially.

[Outro music]

RYAN DONOVAN: All right. Well, thank you very much everyone. We're at the end of the show. We are going to shout out somebody who came onto Stack Overflow, dropped a little knowledge, helped out the community. Today we're shouting out the winner of a lifeboat badge. Somebody who found a question that had a score of -3 or less, dropped an answer that got a score of 20 or more. Today we are shouting out Vanita L. for answering, “What does the Swift ‘mutating’ keyword mean?” If you're curious, we'll put it in the show notes. I am Ryan Donovan. I host the podcast, edit the blog here at Stack Overflow. If you wanna reach out to us with comments, suggestions, complaints, you can email us at podcast@stackoverflow.com and if you wanna reach out to me directly, you can find me on LinkedIn.

JEREMY EDBERG: I'm Jeremy Edberg. I'm the CEO of DBOS. You can find me as @Jedberg pretty much anywhere on the internet, but particularly on Reddit, and Hacker News, and Stack Overflow, as it turns out. And also, of course, please check me out on DBOS.dev. Please check us out because we do offer a lot of extra stuff on top of the open-source library, but honestly, we really, truly believe that durable construction of software is so important that we want everybody to use Transact and so that's the main thing.

QIAN LI: I am Qian Li co-founder of DBOS. You can find me on my homepage, QianLi.dev and please check out DBOS.dev as well.

RYAN DONOVAN: All right, thank you very much everyone, and we'll talk to you next time.

[Outro music]