The Stack Overflow Podcast

Run your microservices in no-fail mode

Episode Summary

The home team sits down with Maxim Fateev, CEO and cofounder of Temporal Technologies, and Dominik Tornow, Principal Engineer at Temporal, to talk all things microservices. What are the tradeoffs in moving from monolith to microservices, and is the pendulum swinging back toward bigger, less complex architectures? Plus, why is state so hard to nail down?

Episode Notes

Temporal Technologies is a scalable open-source platform for developers to build and run reliable cloud applications.

ICYMI, here’s a post we wrote with Ryland Goldstein, Head of Product at Temporal, discussing how software engineering has shifted from a monolithic to a microservices model—thereby introducing a whole new set of challenges for software engineers.

Maxim, who grew up in Russia, is renowned in the microservices world. He spent decades architecting mission-critical systems at MSFT, Amazon, and Uber, where he designed Cadence and spun it out into Temporal. Netflix, Descript, Instacart, Datadog, Snap, and plenty more are all betting their critical systems on Temporal’s OSS technology, so Maxim has a dedicated following in the dev community.

Dominik’s father is a nuclear physicist, so Dominik had early access to computers growing up in Germany. His professional path led him from SAP in Germany to SAP in Palo Alto, then to Cisco, and finally to Temporal.

Replay, Temporal’s inaugural developer experience conference, is happening IRL from August 25-26, 2022 in Seattle. Check it out!

Connect with Maxim on LinkedIn or Twitter.

Connect with Dominik on LinkedIn, Twitter, or Medium.

Today’s Lifeboat badge goes to user Thanos for their answer to How to wrap text without regard to space and hyphen. (This makes up for the Snap, right?)

Episode Transcription

Dominik Tornow So the developer experience you get is, you get to write a function or a method, depending on the language you're writing in, and you can write this function as if failure isn't even a possibility, so the function code is failure-oblivious. And then, Temporal will execute that function code as if failure on an application level does not even exist. So again, the Temporal workflow also mitigates adverse effects on a platform level and makes it entirely invisible on the application level.

[intro music plays]

Ben Popper The UiPath 2022.4 release brings automation for all. Learn new skills and focus on critical thinking for value added work. Welcome Robots on Mac, semantic automation through clipboard AI, and a new attended framework. You can get started for free with UiPath Automation Cloud at account.uipath.com.

BP Hello, everybody. Welcome back to the Stack Overflow Podcast. I am your host, Ben Popper, joined as I often am by my wonderful colleagues and collaborators, Ryan Donovan and Cassidy Williams. How's it going, y'all?

Cassidy Williams Hello!

Ryan Donovan Oh, pretty good. How are you doing?

BP Yeah, pretty good. So we have some folks joining us today from Temporal. We've also had some colleagues of theirs, I believe, work for us on the blog in the past. Ryan, I sent this email around. You said it seemed like interesting technology. What piqued your curiosity here?

RD I always like systems and this seems like it's a really interesting service backend. It's a little deep in the backend to where I'm not sure I understand it completely, but I'm excited to hear more.

BP Gotcha. All right. Well, let's get into the weeds then. I'd like to bring on our guests, Maxim and Dominik. Hello to you both.

DT Hello!

Maxim Fateev Thanks for having us.

BP Hi! So Maxim, let's start with you. Can you tell us a little bit about your journey into the world of software and some of what you did that brought you to working at the company you do now?

MF My journey started probably when I was in my high school back in Russia. And you didn't have home computers back then at home, but my school was an advanced physics and mathematics school. We did practice in one of the research facilities. We had this IBM 360 mainframe clone by Soviets but they used the IBM 360 operating system. So that was my first experience programming. After that, I went to Moscow State University in the physics department. I did nonlinear optics. And then I ended up in Brazil in '95. I studied computer science there, so my computer science degree is from Brazil, Rio de Janeiro. And then I came to the US and worked for a startup and ended up at Amazon in 2002, and worked for Amazon for eight and a half years and witnessed Amazon kind of grow from a practically monolith, having a single monolithic application. You could compile the whole website and make Amazon on your desktop back then. And then it became kind of broken into multiple thousands of microservices. That was quite a journey. And I ended up in infrastructure well before AWS existed, and I was tech lead for the kind of pops up part of the ecosystem. So I was tech lead for the distributed storage for the pops up and later it was adopted by the AWS Simple Queue Service. And then later we realized that using the right abstraction to link services together, and build this kind of complex backend system, and we started the Amazon Simple Workflow Service as a part of the [unclear audio]. And I was tech lead for the public release of AWS Simple Workflow. And then I ended up leaving Amazon and joining Google. I worked for years for Google and then later joined Uber because Uber opened an office in Seattle. And at Uber, I worked on this technology which we are talking today about.

BP Dominik, let's do a quick flyover for you and then maybe I'll leave some room for Ryan and Cassidy to ask some questions based on the wealth of experience you've had.

DT So I grew up in Germany and my journey also started in my teens. My dad is a nuclear physicist and had access to computers early on. So then I started dabbling in Pascal and Basic, eventually Visual Basic. Then that motivated me to study software engineering. I joined the Hasso Plattner Institute in Brandenburg, Germany, and Hasso Plattner was the founder of SAP. So then I joined SAP and shortly after I relocated from SAP in Germany to SAP in Palo Alto, California here in the United States. And I was with SAP for about 10 years. And you could argue I basically professionally grew up together with SAP and the cloud. I never touched SAP's core systems though, so I know nothing about R3 or ABAP. And then after about 10 years at SAP I joined Cisco for two years. When I joined Cisco I didn't know anything about networks, and when I left Cisco I still don't know anything about networks. So if anybody has some enlightening podcast about that, I'm all ears. I imagined myself to stay with Cisco many years to come, but then I came across Temporal and was just absolutely fascinated by the technology. Temporal reached out to me so it was like, "Okay, that's fate." And yeah, after that brief getting to know each other, then I joined Temporal.

CW Dang, you both have worked at such large, big places so it makes sense that you have a lot of ideas of architecture and how it probably should be done.

RD Yeah, we had a post talking about some of your technology. One of our writers joined your team, Ryland Goldstein, and he wrote this post: "The Macro Problem with Microservices." So can you start talking about what are the issues you run into when you start having a big microservices architecture?

MF I think the main problem with microservices is that while they solve a set of problems associated with monolithic applications, they introduce a different set of problems. A lot of effort was put in the last 20 years into fixing some of those problems, mostly around deployment and operations. So this way we've got virtual machines, we've got Docker, we've got Kubernetes, but they didn't solve the core problem which is that now you don't have a single database behind the application, and every service kind of owns its own data and there is no transactionality between them and you need to stitch together all these little pieces and every application becomes kind of like this construct where you have to kind of take all these pieces and put them together. And developers actually have a worse experience right now with microservices because they need to deal with all this complexity. And that is exactly where Temporal comes in. There's other parties, like developers right now have become practically distributed systems developers. It's a very different skill set and it's complicated and they practically need to deal with all this complexity with distributed systems. And Temporal eliminates a huge part of that complexity. It allows you to focus on your business logic, your code. It's a product for developers, they write code. We just make your code fault-tolerant and robust and take care of a lot of failure states which you would need to do otherwise manually. That is kind of the main value proposition.

RD I remember my one experience working with a microservices architecture. There were so many ways that things could fail because all of these services were talking to each other, changing states of a single transaction going through the system. And a lot of tooling grew up around that to just find out what happened in some failure states. So how does Temporal actually reduce that hassle of failures and finding failure states?

DT So, I like to draw from the analogy of database systems. If you look at a database system, for the last let's say 45 years, database developers are able to enjoy a fantastic developer experience, because a database developer can literally write code as if a failure doesn't even exist. So for example, adverse effects like crash failures or concurrent access are entirely shielded from the developer. So the database systems do that by exposing a core abstraction, right? It's a core abstraction of database transactions. And then what is a database transaction? Well, a database transaction is a sequence of steps. And a step is either a read or a write. But it's not just any sequence of steps, it's what we call a failure-oblivious or failure-agnostic sequence of steps. And it is failure-oblivious on two different dimensions. It is failure-oblivious in its definition. That is in the code. For example, SQL statements, they don't talk about failure. You do not see special handling instructions for a crash failure or concurrent access. And they are also failure-oblivious in their execution. So a database transaction either executes observably equivalent to exactly once or observably equivalent to not at all. So every adverse effect is handled on a platform level. It is entirely invisible on an application level. And with Temporal, you can argue that a Temporal workflow is to distributed systems what a transaction is to a database. So it mitigates adverse effects on a platform level, making it entirely invisible on an application level. And similar to a database transaction, a Temporal workflow is a sequence of steps, and also that sequence of steps is failure-oblivious or failure-agnostic. So the developer experience you get is, you get to write a function, or a method depending on the language you're writing in, and you can write this function as if failure isn't even a possibility. So the function code is failure-oblivious, and then Temporal will execute that function code as if failure on an application level does not even exist. So again, the Temporal workflow also mitigates adverse effects on a platform level and makes it entirely invisible on the application level.

MF Just to make it a little bit more concrete, you just can write code, which is something like sleep 30 days, send email, and in a loop. And then you get back your subscription workflow, because this code cannot fail and we take care of that code execution and the presence of all failures scenarios, and it's not linked to a specific process, you can code things like sleep, and all your state, including local variables, everything is preserved. Imagine doing that without such a system. Like you can have it do the loop, just right in Java [unclear audio] and sleep for 30 days and send email. And send email also can have an associated retry policy and everything so it means that if those systems are down for 10 hours, it still will execute that function and the actual code will be blocked for 10 hours on that call. So that is kind of a concrete example. And you can have millions of those because you can have clients with millions of customers so you can have millions of those running in parallel executing these functions.

BP I want to dive a little bit into some of the experiences at big versus small companies, but Cassidy, do you have any questions or things you want to opine on the technical side of that?

CW I think it's a very interesting way to strip out a lot of unnecessary things by thinking in this model. And I'm more reflecting on that than anything. It's an interesting approach.

DT If you don't mind me adding one more. So my most crisp mental model of what a Temporal workflow is is a function but with additional execution guarantees. The execution guarantee is that your function will execute observably equivalent to exactly once, which is actually equivalent to a function execution where the possibility of failure is just removed. That's my most crisp definition of a Temporal workflow.

CW That's so powerful. The concept of failure being removed, it feels like something that's impossible but it's cool to hear. That level of guarantee kind of blows my mind a little bit.

RD The system's failures are going to happen, but to remove the sort of effect of failure, it almost feels like magic.

DT Our entire job as engineers is to build reliable systems from unreliable components. And that is true for all of engineering, but of course also for software engineering. However, Temporal actually sits in the middle of that just like databases sit in the middle of that with their transactions. And Temporal composes unreliable components into reliable components so that you can build reliable systems from reliable components which is much easier. If the possibility of failure is removed, many problems become basically trivial to solve.

BP So I just want to understand in case I missed it. Maxim, if you can, can you talk a little bit about what the need that was perceived inside of Uber was, the motion that began there in terms of the technology, and then why you decided to step out and sort of spin that out and try to do it independently?

MF So Uber, at least while I was there, put a lot of thought into the availability area. Because obviously with Uber, it's pretty bad if you cannot just press a button and get a car. So availability was something which was very, very important. So Uber engineers put an immense amount of effort into the reliability and availability. But what happened is that they did it the same way as any other engineers out there do that, they do it themselves. They have these existing underlying low-level components, they have databases, they have queues. They have Cron jobs, they have all sorts of leader election systems, routers, and so on. So they compose those systems from all these components and it was a huge exercise. Given our experience with Simple Workflow Service at Amazon, me and my co-founder Samar who worked with me on that project as well at Amazon, we kind of realized that this brings a lot of applicability for that type of system at Uber. And so we built it, and it was a small project, it wasn't even funded initially, it was more like a prototype. But then when our management realized the potential of that and we found a couple of internal customers, the project got funded fully and we started to get a pretty significant adoption. By the time I left Uber, it was over a hundred use cases within three years running on that. And after I left it's adoption is everywhere. Almost every big system at Uber uses that. I think they publicly, for example, said about the payment system. It was like a bottoms-up kind of movement. And our company is always like that. We always go in bottoms-up. Most adoptions in the big companies come from a single team or single developer doing hackathons, just doing PoC and finding out that it solves a problem they had.

RD That's interesting that the adoption is all bottom-up. You know, they're the ones experiencing these problems, they're the ones having to engineer around these problems, right? Unless it gets really bad, a CTO is not going to see this very often.

MF Actually, it's sometimes not true. In a lot of companies, again, it's bottoms up because of the engineers who are adopting that. But we had quite a success talking to CTOs and other technical leaders, if they're technical. So in some companies the CTO is not a technical person. But if he's a technical person who was an engineer at some point they almost immediately get the value. After they get the value they actually say, "Okay. I have a thousand problems in my company this would solve," and they get pretty excited and they usually fund a PoC and then engineers get involved. So we kind of see two adoption paths, one coming from architects and like technical leaders in the company understanding the value of that, and from just application teams solving the actual problem they have.

BP Can you talk a little bit, both of you, about what your experience has been like working at a startup? I don't know if that's still how you think about Temporal. But trying to build your own company, something that starts smaller and how it's been growing, versus some of the very large organizations you worked at, Amazon. And Dominik the same for you.

MF I don't think we are an average startup in the sense that certainly it was much easier than most startups out there. I talk to a lot of founders and I find that their journeys are much harder than average. The reason is that, there is this saying, I don't remember who said it, that startups are very, very hard until you find the product market fit. And after you find the product market fit, things become easy and this is how you figure out that you've got product market fit. Like things just click in place. Temporal was an open source project from the beginning. We started the company practically having product market fit already approved because there were a lot of companies using us in production by the time we started the company. So we never had to actually go around and look for that magical product market fit, we already had it. So most of our company existence is just execution. So the most complexity was around building the team, like building the company, hiring the right people, that was the hardest part. Otherwise from the product point of view, you have a pretty good idea of what we are doing and we are just still executing the original plan. We obviously did a lot of adjustments, but our original plan always was that we want to be an open source company. We have open source projects. We want every engineer to know about it and have it in their toolbox. Obviously we can not ask for anyone to use it all the time, but at least look at that if you have a problem and then just know what it can provide to you. And experience shows that in most cases people choose to use it just because it saves them a lot of time and effort. It makes their life much easier in production and operations. And for us, it was mostly a journey about building the company. I always was an engineer and never was a manager. My co-founder as well, he always was an engineer and never was a manager. So for us, just the whole company building process was certainly a very interesting journey. We are approaching 80 people right now and our engineering is extremely strong.

CW I think that that team size is kind of ideal too. When you're at that somewhat early stage, right before really taking off and stuff, everyone still kind of knows each other and it's a really fun time in a startup I think when you're just under a hundred people and you're executing and building a lot of cool stuff.

RD Maxim, I'm curious if as CEO you get to still do any engineering.

MF I prohibited myself from coding. I'd love to code, but I was coding for the first two years of the company and the first year I was coding a lot. But now we have such a strong team that I find that me spending time actually coding is detrimental to the company's success. I still participate very closely in a lot of design and architectural discussions. I probably follow almost all high-level architectural decisions. I need less and less of that because we have very strong technical leaders in the company, and for a lot of those I need to just understand high-level what's going on, I don't need to go into the details anymore, which I think is very exciting. And I spend more and more time looking at other parts of the company, like go to market and so on. But we always will be a technology company and I always want to be involved in the technical decisions.

[music plays]

BP All right, everybody. Appreciate you tuning in. Thanks for listening. It's that time of the show. I'm going to shout out the winner of a lifeboat badge who came on Stack Overflow, saved a question from the dustbin of obscurity and shared some knowledge with the community. "How to wrap text without regard to space and hyphen." Thank you to Thanos. Doing some good this time, Thanos.

RD It was a snap, really.

BP Yeah, just snap your fingers. I am Ben Popper, I'm the Director of Content here at Stack Overflow. You can always find me on Twitter @BenPopper. Email us with questions and suggestions at podcast@stackoverflow.com. We'll shout you out. And if you like the show, leave us a rating and a review. It really helps.

RD I'm Ryan Donovan. I edit the blog here at Stack Overflow. You can find me on Twitter @RThorDonovan. And if you have a great idea for a blog post and want to write it, please email me at pitches@stackoverflow.com.

CW I'm Cassidy Williams. I do developer experience at Remote and at OSS Capital. You can find me at @Cassidoo on most things.

MF I am Maxim Fateev. I'm co-founder and CEO of Temporal. You can find more about our company on our website at temporal.io. You can also find me on Twitter @MFateev. And please join our conference at the end of August, 25th and 26th in Seattle.

DT Hi, I'm Dominik. I'm a principal engineer at Temporal. You can find me at temporal.io/slack. You can reach out to me directly. You can find me on Twitter as well @DominikTornow. And I will be at the Temporal conference August 25th and 26th in Seattle in real life at the Temporal conference Replay.

BP In real life. IRL. Very good. All right, everybody. Thanks for listening. And we will talk to you again soon.

CW Bye!

[outro music plays]