The Stack Overflow Podcast

An open-source development paradigm

Episode Summary

Temporal is an open-source project focused on durable execution and workflow orchestration. Cofounder and CTO Maxim Fateev tells Ben and Ryan about the challenges of building a cloud service based on an open-source project and how Temporal is helping teams simplify their code and build more features more quickly.

Episode Notes

Temporal is an open-source implementation of durable execution, a development paradigm that preserves complete application state so that upon host or software failure it can seamlessly migrate execution to another machine. Learn how it works or dive into the docs.

Temporal’s SaaS offering is Temporal Cloud.

Replay is a three-day conference focused on durable execution. Replay 2024 is September 18-20 in Seattle, Washington, USA. Get your early bird tickets or submit a talk proposal!

Connect with Maxim on LinkedIn.

User Honda hoda earned a Famous Question badge for SQLSTATE[01000]: Warning: 1265 Data truncated for column.

Episode Transcription

[intro music plays]

Ryan Donovan Monday Dev helps R&D teams manage every aspect of their software development lifecycle on a single platform– sprints, bugs, product roadmaps, you name it. It integrates with Jira, GitHub, GitLab, and Slack. Speed up your product delivery today, see it for yourself at monday.com/stackoverflow.

Ben Popper Hello, everybody. Welcome back to the Stack Overflow Podcast, a place to talk all things software and technology. I am Ben Popper, Director of Content here at Stack Overflow, joined as I often am by my colleague and compatriot, Ryan Donovan, Editor of our blog. Ryan, we have done a bunch with this company, Temporal. It started out as a blog that was pitched to us. I think that was almost pre-launch for the company or right as they were getting started, right?

RD Well, one of our regular writers back then got hired by Temporal.

BP Ah, right. We had did that, and then we did a podcast with them that was about microservices and figuring out what is the trade-off between monolith and microservices as the pendulum is swinging one way or the other. And so today we are lucky to have Maxim Fateev, the CTO at Temporal, back on the show. Maxim, welcome to the Stack Overflow Podcast.

Maxim Fateev Thanks for having me. We certainly made a lot of progress in the last 12-18 months. We've got around 1,200 customers since our launch 18 months ago, and around 120 of those customers are what we call ‘strategic customers,’ which are large enterprises.

BP So before we dive in a little bit more, for folks who haven't listened to the other episode, can you just give them a little background on yourself, how you got into the world of software and development, and how you ended up focusing in on this particular area that you're working on now as CTO at Temporal?

MF So I've been working in large companies all my life. I worked eight and a half years at Amazon, also I was at Google, Microsoft, and before starting the company, I spent four years at Uber. I was an individual contributor all my life. I was a principal engineer at Amazon and then staff and senior staff engineer at Uber. And I worked on practically asynchronous Pub/Sub systems and orchestration engine workflow engines all my life. At Amazon I was tech lead for the Pub/Sub platform which ran the whole Amazon for a long time. Later that solution was adopted by SQS and SQS backend, practically a Simple Queue Service. And then I was tech lead for Amazon Simple Workflow Service, which most of the ideas of Temporal were created back then– its high-level ideas. And then at Uber, again, as individual contributors, we created this project which was called Cadence, and within three years it got pretty popular within Uber, and also as it was an open source project, it became popular outside of Uber. And we started the company four and a half years ago. Since then, it was a pretty smooth ride because it's easy to start a startup when you have a product market fit before you started your company.

RD Like you said, it's pretty amazing that you got such pickup in such a short time. Why do you think you had the product market fit so well ahead of time?

MF Because we are solving the real problem, and it's interesting because every engineer or senior engineer who had to do any backend applications knows that this problem exists. The problem is that you need to practically manage state across multiple request replies. So practically anything you do of any real value is a multistep process and this process can have complex state management. And there are so many ways to do that, but until now, until we invented this kind of new paradigm, it was practically impossible to do in a generic way. We introduced this new paradigm which we call ‘durable execution,’ and the idea is extremely simple. The idea is that you write code, and there is runtime which preserves the full state of your code execution, all the time. So by full state I mean, every block and call, every thread, stack, all the variables. And then if a process crashes, we reconstruct the same program kind of execution on a different machine in exactly the same state. So from the engineer point of view, the process didn't crash. The process keeps running as if nothing happened. And that is very powerful because you can have a function which runs for a year, for example. So if you do something like subscription, you can write a loop and say, “Sleep 30 days, charge, send email,” and run it 12 times for a year. And that will be one function execution. All state, all variables will be always doable. That's why it's doable. And all failure conditions will be taken care of by the runtime. This abstraction allows you to push a lot of complexity into the infrastructure layer. Because right now, if you think about it, we have a lot of infrastructure, but all of that is very, very leaky. You're like, “Oh, I need to do request reply.” Fine, you do request reply. But now I need to guarantee that my request is complete. Now you need to forget about request reply and start doing async applications. We need to do event driven systems. You need to do all sorts of other things, databases and so on. And here we say, “No, if you want to request reply, it doesn't matter if it takes 10 milliseconds or it takes 10 months. It's the same request reply. You're blocked on the same API call and all your state is preserved.” Super powerful and allows you to practically just focus on your business logic and push all the complexity of a large scale distributed event driven system into the infrastructure level.

BP And so it's interesting that you mentioned this was a problem you encountered and worked on within a company, created a solution that was open source within that company, and then stepped outside to sort of replicate this product market fit. Was there an issue of IP there? How did you go about recreating the same solution that you had done internally, externally?

MF We didn't recreate it, we actually forked it. It was an open source project from the beginning under MIT license, which practically allows you to do these things.

BP Gotcha.

MF We forked the project. So the code base which we actually run with now is not 4.5 years old. It is over seven years old, almost eight years old because we started that at Uber.

BP This is a strategy that more engineers need to employ. Build your startup inside of another company and convince them to open source it, and then once you've realized how well it's working, then leave and continue working on it. That's pretty brilliant.

RD I think Uber was actually pretty good about that. We've talked to other companies that were built off of Uber open source projects. I think Chronosphere is one of them.

MF Uber might be a different company now, but back then Uber was extremely friendly to open source development. But at the same time, I don't think it is like Uber losing something, because first, I would never join Uber unless they promised me to work on an open source project. Because I had a very good offer at Amazon, and realistically I lost a lot of money joining Uber versus Amazon because Amazon stock grew multiple times while I was at Uber. But I joined Uber because they let me work on open source projects. And then at the same time, my personal opinion is that any infrastructure-level complex project, if it's company-specific, will die one day. There is no way. I've witnessed that Amazon. So many cool projects which were 10 years ahead of practically the whole industry, but they never got outsourced, they ended up being deprecated 10-15 years later because open source analogs appear. The Pub/Sub system we created at Uber, at Amazon almost 20 years ago was super powerful. It was a replicated storage and Kafka wasn't even conceived. We absolutely could take over the market if we open sourced that and created a public project, but it's still inside of Amazon. I don't know if it's still used, but my point is that any infrastructure software should be open source. And if you are building infrastructure inside of your company, especially things like databases, queuing systems, workflow engines, state management, anything like that, and you're building custom, you're doing it for the fun of your developers. The long-term strategy for your company is a losing proposition, because at some point this team will build and leave and the company will have to deal with that legacy. The only real path is open sourcing that.

RD I think when we spoke last time, it was just the open source. What's the cloud version? Are you hosting all of the state management?

MF So the way Temporal in general works is that, how do you use it? It's a library and we support six official languages and we're planning probably to add Ruby later. You just link that library to your application. So if it's Java, it's a normal Java dependency. So it's either Maven or Gradle to include the dependency and then you compile your application and you run your application. Temporal either uses open source or uses cloud. We never run your code. Your code runs inside of your infrastructure. And then there is a backend cluster service which keeps the state and it does a lot of other things. Practically, it's a large-scale asynchronous event driven system with durable timers, with flow control, with state management and event sourcing, but it's all hidden from you. All you need is a connection to the backend cluster, which again, you can run yourself on top of an existing database, which we call ‘self-hosted,’ or you can just connect your code to the cloud and you don't need to change a line of your code besides the connection string. You just outsource the management of the backend cluster to us. And it's a consumption-based service, so it means that you don't pay for capacity, you don't pay for cluster. It's serverless. You practically just pay for what you use. Practically, you pay as you go for the number of actions you execute on our platform. So that's why it's usually cheaper to run on our cloud than self-hosted.

RD I'm sure you've thought of this, but I imagine there are some organizations that may have been a little wary about hosting full application log, execution log, on an external server. What kind of security measures do you have to make everybody comfortable with that?

MF It's interesting. It wasn't probably very thought through, but it's actually ended up like what we have now. It is what it is. Temporal doesn't need to look into the data because it's passthrough. It's logically more like a queue for your user data. What it means is that, as your application code runs inside of your cluster, inside of your VPC, and everything which it sends to the Temporal backend can be encrypted using your keys and your encryption algorithm. What it means is that even if the Temporal backend is hacked, you cannot really see anything because everything is encrypted using client side encryption. And I feel like there are three parts there. The first one is that we don't run your code, you run your code. Second, everything is encrypted by you. And the third one is that Temporal SDKs only connect to the Temporal backend server, but there is no connection back. You don't need to connect back so you don't need to open any holes in your VPC. All you need to do is a single outgoing connection in all practical situations for large corporations to the private link endpoint of the Temporal service. It's amazing, but we pass security reviews of very, very large companies very well, because again, we are not databases. We don’t need to actually look into your data or understand your data.

BP The last few times we had conversations, it was focused a lot on this idea of distributed systems working with companies that increasingly have a lot of microservices. I know one of the things that was brought up when we were discussing this episode was that now you're trying to improve the reliability of AI applications as well. So what's going on in this new market and how is Temporal adjusting to that?

MF So think about it. Any AI application has two parts. There's the part of creating your AI model, whatever it is, and a lot of data massaging and training and execution of large clusters, and you need a control plane for that. And the nice thing about Temporal is that it's probably the best technology to create control planes for any backend. HashiCorp, for example, built their cloud around Temporal in durable execution because you need to provision resources reliably. And there are a lot of AI companies. We have probably over a hundred companies with ‘.ai’ as our customers already. I'm pretty sure not all of them do real AI, but you can see that. But the reality is that all these pipelines, all this training, end-to-end lifecycle, Temporal is very good to manage the lifecycle of entities over a long time. You can think about it as durable actors. And so all this backend for this AI is very, very common and this is a very common scenario for us. Another interesting emerging area is AI agents. Because think about it. What is an AI agent on the surface? It's practically that AI tells you what to do and what to query, which APIs to call. It calls a bunch of APIs, then gets information and then tells you what to do next. It's kind of a very smart rule engine. “Okay, here's my state. Give me actions to run. Given these actions, this is my new state. Give me the next actions to run.” It's all good, but how do you execute this loop reliably in the presence of failures, and if this interaction can be long-running? Because, yes, a lot of AI agents are just, “Okay, two seconds, call five things,” but what about if you want to do a real thing which takes time like, “Send email, wait for reply, do something.” And these interactions are very well modeled with Temporal, because these durable executions allow you to write these flows and these interaction flows very, very easily and guarantee high scale and very high reliability.

RD I wonder with that, how do you maintain that state and replicate with AI, and AI is notoriously nondeterministic. With an agentic workflow, if it gets interrupted in the middle, do they have to go through the AI calls over again or is there some sort of storage of the response?

MF Okay, there are a couple of things. Durable execution by itself, if a call completed and is recorded, never will repeat that. It doesn't matter if it's AI or whatever. So the whole idea is that if you need to execute a sequence of actions– A, B, C, D, and the process crashed when you're waiting on C, A and B will not be executed and C request will not be executed. You will be just waiting again for C result. At the same time, Temporal is not a big data technology. We don't expect people to pump gigabytes of data through Temporal itself. It is a control plane. So it is very common that practically any large application ends up using some database and Temporal together. So we are not competing with databases, we are replacing a lot of scenarios where a database is abused. You don't want to use a database for application state, your transactional state, but the state of your model, state of your tokens, the state of interaction, you absolutely can store in a database, and Temporal will be able to help you to execute over that. Most state probably will be in a database, but Temporal still will help you to orchestrate that.

BP In the pitch, one of the things I noticed was that Snap was an adopter of your technology. They have 414 million daily active users and when they have something big like Super Bowl and New Year's Eve, I'm sure they have quite a few concurrents. Can you talk to us about what it's like to support millions or tens of millions or hundreds of millions of concurrents for a service like that?

MF Now it's certainly much easier, but Snap was our first very large customer. It certainly took us some time to be able to accommodate their scale. The main differentiator of our cloud, because obviously we are MIT licensed, we are not planning to change our license to BCL or whatever. And the main reason we believe we can do that is because for our cloud, we practically built our own database engine, which is optimized for practically just one database call, which works with Temporal very efficiently. So it would be insane to create a general purpose database engine, but we created the base engine, which is extremely scalable but it works for a single practical API call which is very, very specific. That's why we can run probably at 100x larger scale than you would be able to run this open source right now, and there's much better latency performance characteristics and more reliability. And that's why we're able to run these workloads right now without much strain. It's just mostly about capacity management at this point. There is not really a lot of infrastructure changes for that. But yes, we certainly need to plan for some of those events because sometimes New Year's Eve or Super Bowl can be interesting and sometimes unexpected things happen. So we always have spare capacity and are able to accommodate those. We are a multi-tenant service, so we can kind of pack our customers into larger cells, what we call cells, and it allows us to actually manage capacity very efficiently and provide reasonable competitive prices.

RD Going from an open source product to a cloud-hosted service, what was the biggest technical challenge y'all faced there?

MF Well, multiple things. Obviously just the infrastructure itself, because me and my co-founder, we were always focused on actual software and we all always ran this inside of a large organization like Amazon or Uber or Microsoft, and there there is always existing infrastructure which is very specific to the company. So we had to hire a pretty strong team of infrastructure engineers who would help us to run it on AWS and now we are just adding GCP and we will add other clouds in the future. So that was the first part, just how we mapped it to the actual Amazon infrastructure. And then creating a control plane. The good news is that we have the best tool to create control planes which is called Temporal, so if you think about it, what is our control plane? It's just a bunch of these durable execution processes which everything you do is durable and guaranteed to execute because this is just a bunch of what we call ‘workflows.’ That has made our life much simpler, but that was most of that. And certainly a lot of usual things about quotas, about provisioning and provisions and so on. It's just a lot of things. One thing we realized is that, to have a real cloud service, first, it should be fully automated. You cannot have manual operations there. They're built across 15 AWS regions in a reliable manner across thousands of customers. Without automation, it doesn't work. And then the other part is that just table stakes is a lot of stuff. We cannot just, “Oh, we need billing. We need metering. We need that. We need that. We need security and we have like 15 ways to integrate with every company's security and IAM.” So just the sheer size of that was hard and we're still working on a lot of that. Now our cloud offering is pretty solid, but we're just adding more and more features.

BP Maxim, let me ask, is there anything in particular that you feel that you want to discuss that we missed or anything you want to hit on before we end?

MF I just want to reemphasize again, just look up what durable execution is. Just learn it. It's a new paradigm. Most people have never heard about that. So every time you say, “Oh, I need to build an event driven system,” or, “I need to use a workflow engine,” or, “I need to do whatever state management,” just learn about that. You don't have to use it, but just learn about that. An awesome way to learn about that is we have a conference in September. I think it's the 18th to 20th. Just join. It's called Replay, so search for Temporal Replay. And if you just look at the speakers we had last year and we will have this year, you'll be amazed. And I think even if you don't know anything about the technology, it will be super exciting to join. And by the way, if you're using Temporal, we still have a call for proposals open until the end of May, so you can still present as well.

[music plays]

BP All right, everybody. It is that time of the show. I want to shout out someone who came on Stack Overflow and helped to share a little knowledge or spread their curiosity. This is a question, but you wouldn't know from the way it's phrased. “SQLSTATE[01000]: Warning: 1265 Data truncated for column.” That's an error message and somebody has kindly come along and helped them to solve it, and 17,000 other people had that same error message it seems. The mods are listening: “Please rephrase that as a question.” As always, I am Ben Popper. You can find me on X @BenPopper. You may have been listening recently and heard us. We've had some guests on the show to discuss great topics who are listeners, so email us, podcast@stackoverflow.com with suggestions for a topic, or if you want to come on, we'd love to hear from you. If you enjoyed today's episode, the nicest thing you can do for us is to leave a rating and a review because it really helps the podcast.

RD My name is Ryan Donovan. I edit the blog here at Stack Overflow. You can find it at stackoverflow.blog. And if you want to reach out to me with your latest technical challenges, gripes, whatever, you can find me on X @RThorDonovan.

MF So I'm Maxim Fateev. I am CTO of Temporal. Our website is temporal.io. So you can reach me through our Slack channel. Just go to our website, join our public Slack of our open source project, and I'm there all the time.

BP All right. Welcoming contributions or chats in the Slack. Everybody, thanks for listening, and we will talk to you soon.

[outro music plays]