The Stack Overflow Podcast

Mastering microservices with a former Uber and Netflix architect

Episode Summary

Ryan welcomes Jeu George, cofounder and CEO of Orkes, to the show for a conversation about microservices orchestration. They talk through the evolution of microservices, the role of orchestration tools, and the importance of reliability in distributed systems. Their discussion also touches on the transition from open-source solutions to managed services, integration opportunities for AI agents, and the future of microservices in cloud computing.

Episode Notes

Orkes is a developer-first enterprise workflow orchestration platform. Explore the developer edition or dive into the docs.

Before cofounding Orkes, Jeu was an architect at Uber and Netflix. Find him on LinkedIn.

Shoutout to Stack Overflow user Alex Stiff, whose answer to Bash - Sort a list of strings earned them a Lifeboat badge.

Episode Transcription

[intro music plays]

Ryan Donovan Hello everyone and welcome to the Stack Overflow Podcast, a place to talk all things software and technology. I am Ryan Donovan, and today we're gonna be talking about microservices orchestration. We're gonna be figuring out, you know, what it is, when you need it, and how it differs from some of the other platform engineering things you may have heard about. My guest today is Jeu George, who is the CEO and Co-founder of Orkes. Welcome to the show.

Jeu George Hey, thanks Ryan for having me here. Excited to be on the show.

RD At the top of the show, we'd like to find out a little bit about our guests. How did you get into software and technology?

JG I think it's a long time back. I was a mechanical engineer in my past life, right? And after that, I think during my master's, I got interested, you know, doing coding work, primarily working on projects for [inaudible] and stuff like that. When I was doing my internship at the space organization NASA of India there. And that's what got me introduced and interested in this and, you know, and there was no looking back since then. And then, you know, later on came to do my master's here in the US, same computer science, and then, worked at Microsoft for a long time before I moved to Netflix, where the Orkes company has a long history of Netflix as well. And that's where it all started.

RD Yeah. So let's talk about microservices orchestration. You know, we've heard of container orchestration and other sorts of things like that. What is microservices orchestration and how does it differ from a service mesh?

JG Yeah, so I think just going back into a little of history of how this area really evolved, right? You know, the company, the product Conductor that came out of Netflix. It had a lot to do with that, but if you wind the clock back a bit, Netflix was also the first company of its size to operate completely in the cloud. And this was, you know, way before cloud was even a thing to do, right? Very few companies thought about operating in the cloud, and even if they did, it was like, you know, a small set of services that they will try out and stuff like that, right? But that basically, and if you look at it like, you know, pretty much all of the cloud computing principles that you see today, all originated from Netflix and things that we take for granted today. Like auto scaling that started at Netflix, you know, one to two years later to make its way into the AWS tool sets, for example. But what the cloud made it simple was for developers to go and you know, spin off small services really, really quickly. Right. And that led to one interesting thing, the whole microservices exploration and Netflix engineering kind of pioneered like, you know, how microservices needs to be developed and made it fast. And the tooling also kind of exploded during that time. Right. But that led to one interesting problem, and the problem was very soon there were more microservices than people in the company. And that's actually true even today. Right? And you will see that half the services were built to call into the other services. And those services were generally, where the high level logic would lie. Right. Sometimes it's business logic, sometimes it is, you know, someone's using this for doing DevOps. It would be like platform based logic. But that's where all those, you know, the calling and back and forth would lie. Then Microsoft built with one purpose in mind saying, Hey, you do this one thing and do this really well. And if you do that really well, it can be really decoupled. But this, the way people started to, you know, do this was like, you know, build out a services to call other services, that means that, you know, it's not fully decoupled, like you're calling service and then calling in a service B. And then, you know, the whole thing of what happens if a call fails, like who's responsible to take it to completion? All of that stuff. Right. We at Netflix just hit the problem much earlier than others because we were early adopters, right. And, that kind of led to the creation of something like Conductor, right? And then you talked a little bit about container orchestration. So container orchestration is a layer kind of below that, you know, spinning containers and then a few companies kind of that do that. But this is like at an application level, right? But if you kind of think of the tech stack of all these companies, right on the bottom of the layer, there is a cloud right? Today ruled by all the big cloud providers -

RD Right

JG A layer above that is people think about building applications and that's the top most layer. You know, that's the layer where, you know, the companies want to spend the most of the time in. But you need to have like this robust, solid platforms on which you can build this module applications and a product like Conductor and, you know, the whole orchestration thing, you know, was created with Netflix with this thing in mind, and that's how it has evolved. So that's the real big space that we play on. Right. Again, it's super hard to get this right when you do it yourselves. And we are also seeing that big transition from how developers don't think about, or companies don't think about, building data centers anymore, right? And the same transition is happening where people don't think about building these platforms anymore. And then instead go with tools like Conductor, started with our Netflix engineering, one of the most robust used platforms in the world today, right? And that's the journey that we're seeing today.

RD Yeah. You talk about, you know, spinning up containers as needed. Is that how the sort of orchestration works? Do you spin up new microservices as you need them? Are you acting as a sort of traffic cop at some point?

JG No. So you, yeah. You also touched a little bit on the service mesh piece as well, right? So, you know, yes. Can Conductor do, you know, spinning of containers and that's one use case that people use it for. But you know, that's generally not the complete space that we are in. Think of it this way, companies that build services, developers that build services, services come in, you know, they return different programming languages and they're spun up, right? And this is the layer that basically glues all of these things together, right? Like saying, Hey, start with service one, you know, call that, get the output, take that, wire that into service two, and so on and so forth. And this is where the whole reliability also comes into place, right? You know, start with you don't want this thing to fail, so once it starts, you wanna make sure that's taken to completion. So the whole big reliability aspect of it. So slightly different from, you know, the container orchestration thing that you're talking about.

RD Okay. Yeah. You said this was born out of Netflix, and like you said, it's famously very friendly to microservices, and I think you also work for Uber, is that right?

JG Yep.

RG That's another company that is famous for having a lot of microservices. I think we've had two or three companies spun out of open source microservices management tools at Uber. Is this the sort of microservice management orchestration something that is like, if you have X number of microservices, suddenly you need this. Or is this something that can apply to a smaller company?

JG Yeah, it can apply to smaller companies, right? Think about it this way, why would people start in the cloud today, right? And, there's a cost and capital of obviously, of building these data centers, right. And, AWS has kind of really solved it for them. Right. And back in the day, let's say another parallel that if you wanna draw, right. Uh, no one also thinks about building databases anymore, right? They look at like, if they have data needs, what kind of database do you need? Like, you need a SQL, or SQL key-value pair, and then you pick the right database of [inaudible] and start with that. The journey before, you know, the transition shift that's happening today is people building homegrown stuff, just like you mentioned, ‘Hey. Let me build that thing myself.’ Right? And most people don't realize they're building that, right? Because hey, I'm building a service to call into these five other things.And then over time what happens is that that area becomes complex. That is also where you need the highest amount of reliability and getting that platform right is an extremely hard thing to do. Right. And you know, back in the day, like in a few years back, if you look at it, yes you are right, everyone would build this in-house. And slowly the transition shift is happening. People realize that, hey, to build this distributed application, you just need something like this. So that transition shift has already happened and I think that's also rapidly evolving. And I think the single most biggest, if you ask like competition wise, and you talked a little bit about Uber and I go into some details on that, but homegrown solutions moving over to platforms like this, that's the biggest transition shift that we are seeing.

RD Yeah. I know with a lot of the innovators, especially now, it ends up being a sort of open source solution. They end up developing the first one of its kind. And, you know, Netflix Conductor was open source. I know they're deprecating it, I think, right?

JG No. So the founders of the company were also founders of open source wherein, the CT of the company also started Conductor. You know, when we were there, I was the first user of Conductor in the actual use case there. And then, rapidly actually that spread to the rest of the company. Once we did Orkes, and because we are the founders, Netflix and Orkes got together to shepherd the open source and build up the community. Back then the whole Conductor project was under the whole Netflix open source umbrella. So they had the Netflix sources list of things and then usually asset popularity of these projects going up, right? Like they graduate from there and then move to something else. Like for example, there's a bunch of projects that started in Netflix. Great open source project that they have moved since then from the Netflix services umbrella to the Apache Foundation, for example. What we did here for Netflix was we got Netflix to archive that project and then we moved that whole thing to the Conductor OSS Foundation. And then, you know, so now there's a lot more companies also involved in this, right? Because it's moved to a foundation that's external to Netflix and it's really taken off since then.

RD Right.

JG And as I said, the other thing is, I think Netflix also, I think has increased in the last, you know, year or so, their Conductor usage within the company itself has gone five x. So it's rapidly evolving in the company as well there and then at Netflix.

RD So it's still an open source software. We've talked to a few companies building businesses on open source software. Do you find that with, you know, whatever services hosting the extras you provide, is it a difficult sell? Or are companies like yes, absolutely, handle the rest of it for me.

JG Yeah. So I think the way people think about using open source, monetizing open source, both from the buyers and the seller side, that's changed over time, right? So back in the day when you look at, you know, not the big early popular ones like Linux, right? Companies like Red Hat, their whole thing was, Hey, it's a free thing, but I will provide support, right? So, a lot of open source projects during that time started with this support model, right? And then as the cloud evolved, a lot of these things moved slowly from, yes, I will support this thing if you are hosting it in-house or in data centers. But it slowly moved from there saying, Hey, yes, here's open source, it's a great way for you to test out, but I cannot host this thing for you in the cloud, so you didn't even have to take care of that. A bunch of developers in the company go and manage these clusters, right? And I will host this thing for you in the cloud. And a good example for that is like something like MySQL. Like, you know, who do you want to host MySQL? Yes you can. You can download MySQL, it's open source. You know, now there's this variance of that, like MariaDB, for example. You can host it and do stuff, everything yourself, but with the cloud, you can go with something like AWS. And you get the whole thing around what happens, you know, data application, right. You know, backups, right? Like if things break down, like how do you recover? It's all managed by the cloud and it's something that is so important but it's already taken care of, right? And so the open source model has also evolved slowly into, I can host this for you, like from, from starting. I'll host this myself and take support too, I will look at a great provider and I'll just connect to that and just use that. And then over time, you know, there are these features that developers need value, but there's also product value that you get, that is needed for them to price when they're running it at scale, right? Or when they need compliance or they need to run it in their own cloud, right? Like, so think about security governance, auditing, features that is important for enterprises, right? And that division has also happened. So this is where we are in, right? So. Your main question around concerns about doing this versus that. One thing that we make it simple for enterprises to use. We make everything that's in Orkes completely backward compatible to open source. So if ever somebody wanted to move back, you know, they can just migrate. It's a drop in replacement. Obviously the features, enterprise features that we added on the top will not be available. But the flaws by itself will be completely backward compatible. So that's the confidence that's that we need to give to enterprises to start with. But I think that boat has sail, right? So people don't think about that anymore. Yeah.

RD Right. Yeah. I mean, especially I think if somebody's doing this at an enterprise level, you know, there's so many, like you said, compliance, governance, SSO features where it's like, I don't wanna do that. Somebody else knows how this works. Let them sort of take the heat.

JG Yeah, you're right. I think see there's that whole thing of, ‘hey, I can host it myself’ but there's all this operational stuff, right? Versus on the top of that, there's also this lost opportunity cost. For example, there are features that enterprises needs. It's there in the enterprise version, it's already backward with open source. So should I have a spinoff team kind of doing this thing and developing on the top of this? Or, taking the enterprise, I'll get a continuous set of features on and on. For example, like the AI stuff that we launched, you know, our enterprise customers just get access to that. They don't have to think about building the whole AI agent workflow and agent platform on top of that.

RD Yeah. I wanna go back to something you said earlier about the sort of reliability of certain transactions. I know microservices, you know, some of them will fail. Some of those, it's okay if you fail, you retry, whatever. If you retry twice, that's fine. But, for things like payments, you needed to go once and only once. How do you manage that sort of need for reliability, but also have it only do it once?

JG Yeah. So I think there's two questions there. One is like the general reliability of platforms, right. And then how do you make sure, you know, in the use case, like payments, right? You know, and you do this once and do this only once, right? So the journey that we have seen over the last few years, like when people come from homegrown solutions or people come from, I think it talked about a project that, you know, that Uber called Cadence, companies like built on the top of that, or old school companies like Kaunda who do BPNM. When they come over to, you know, products like Conductor and Orkes, the number one thing that they look for is reliability. So we built this thing with reliability in mind, right? That was like the first class kind of tenant that we kind of focused on. Every feature was built with reliability in mind. But, to go back to your example on, do this thing like, but do it only once. So we support, you know, saga patterns, for example, right? Back in the day when everything used to run in one box and you just write this whole thing into a database and it's a transaction. It passes or fails. Right. So you know exactly what happened.

RD Right.

JG But the way, right now things are generally built in the cloud, things are distributed. And when this whole distributed nature happens, transactions now are distributed over a bunch of different services. Right. So what we do is twofold things, right. One is, you know, when things fail, we let them recover and retry and stuff like that. So there may be pieces of things in the application which needs extremely, you wanna make sure those pieces, for example, the actual act of moving the money needs only happen once, right?

RD Right.

JG They may be things like, you know, a notification. It's okay once in a while if it, you know, it happens more than once. But when things fail, we also give hooks for people to unwind that distributor transaction right. And the user knows best on what needs to be unwind because there's business logic in that. So the user can write how that unwinding should look like, right? In case they tried and it failed, and if they want to unwind that transaction, they can go unwind that. But that's how it's built. You can actually have configuration that you can put in your flow saying, ‘Hey, this one should only happen once and only once’, for example. And saga patterns will just, you know, the capabilities that a platform provides, like, you know, you can build your own flows and take care of all the things that you mentioned.

RD Hmm. Okay. So they control the actual flow of the failure recovery.

JG Yeah. So two things. One is like the whole retry mechanism concurrency, that comes outta the box. Like, you know, you can configure, you know, how often do you want to retry? What type of retry, you know, what's the gap between retries by saying the retry logic, concurrencies, rate limiters. Those are all things that natively our platform supports. Right. When you do all of those things and it still fails, but, and then you wanna unwind the transaction and you don't want to retry this, right? Like you can fail it and come back and retry this later. And those are useful when you know that one component, one service in that whole flow is down. There was probably a bug in that when it comes back up, you want to retry that flow. That capability already exists, but there are cases where you just don't wanna do this, right? And so when those things fail, you can configure your whole retry flow, right? And that basically tells you. You can put your custom logic in there saying, how do you want to retry this? Right.

RD Right. Yeah. Do you have protections in there? If somebody's like, it fails and let's assume that data center's down right now, retry a different data center, but then the original goes through. Do you have sort of protections so people aren't stepping on their own toes?

JG Yeah, so we have like, you know, the thing that you mentioned, like there's this few variations to that, right. One is like, you know, when things fall down, how do you need to handle it? Right. So there are [inaudible] of this thing that are, when you operate in the cloud, like, you know, either Multi A-Z right? Availability zones, so multi-regions. And then within multi-regions there's this active, passive active, active active, right? And our deployment allows for all of this, right? And there's a few things, there are pieces of the services, those individual blocks, which developers write code for, they continue to write code for this, right? And they know best on which are the things which you need to do only once, and which are the things that, you know, it's okay to do it right. So the idempotency factor of it, needs to be taken care of by the developer, right, in the piece of code that they're writing. But that said, the signals on where you need idempotency, the things like the idempotency key, that get passed on from here. So now with all of this thing, it becomes a very powerful tool where you can create idempotency where needed. And you also have the context on where the idempotency is needed or which flow is actually being called from by the key that's getting passed. Right?

RD Cool. Okay. And I know, you know, these microservices, super distributed, there's a lot of things going on at once. How do you manage the sort of like, what do you have in place to manage the sort of scale concerns of that?

JG Yeah, so the actual act of doing the orchestration, right, that's where the power of our platform comes in, and that's why people buy us, right. Or use open source, right. This is extremely highly reliable and this was built at Netflix scale, right, and that's probably one of the most scalable software companies out there with the amount of, you know, paid subscribers, I think the largest one in the world there, right. So it was built with that in mind, right. So the actual scale and reliability of the orchestration platform itself, that's our power, right. But there's also services that it's trying to orchestrate, right. Now those services also need to be provisioned for the scale that it's doing, because those services are also being called into when the actual flow is being run. So there's two things. One is if things fail, right, you know what kind of reliability patterns that you have in place? There's a few things that we do, like things like circuit breakers, right. If the service is actually going to fail anyways, right, going and bombarding that with additional calls is just going create that same problem. So you can solve that by a few capabilities that we have with the platform. Things like circuit breakers, there are things like rate limiters, right. And the retry mechanisms that we already have saying you can, you can back off, but back off exponentially so that you give time for those services to recover. But if the services are also set up for scale, we also give signals, auto-scaling signals to those services, right? Like through APIs, right? So let's say you're getting a thousand transactions per second, or 10,000 transactions per second, super high volume and maybe we are a payment transaction and then, you know, there's Prime Day that's come up, you know, and then suddenly there's a big burst of traffic that comes in and those services are set up for scale, but you also need signals on how to auto scale and when to auto scale. So we provide APIs for people who say, ‘Hey, there's a bunch of calls that are coming in. I'm not provisioned yet for that scale, but I know the amount of scale that I need to do to kind of take and consume that in a reasonable amount of time’. And you get signals and you can set up auto scaling policies accordingly. And that's how the whole scaling kind of mechanism works.

RD And, you know, everybody wants to talk about AI agents these days. And with AI agents, you know, everything is becoming an API, microservices sort of talk to each other through APIs, but soon those will be all exposed to the public. Is there any additional work you're planning to sort of support the potential of microservices being exposed to agents? Or is that just sort of part of the package as it is?

JG Yeah, so this is also a rapidly evolving space, right. And if you look at something like Conductor and the Orkes, the company that we have built, started off early on in Netflix as a pure play workflow engine, primarily built for asynchronous flows. Long running flows, right. That was the nature of the beast when this was built in the backend, right. Like primarily used for backend applications. Quickly that dev, you know, moved from a workflow engine to an application platform. The initial use case where long running asynchronous, might even short running, but generally asynchronous in nature. Primarily built for the backend. That evolution continued and then we moved into the API real time orchestration space. Right. How do you do extremely large scales with low latency? Especially when, you know, payment transactions, credit card transactions, when people are sitting, when the consumer's sitting on the other side and they swipe the credit card and then you wanna do this in, you know, very, very low latency. And then people think about this and say, I wanna construct my business process in this, right? Like, so the core platform, the capabilities that we have built on the top, like moved from application platform also in mode into real time, a orchestration for example, the API gateways that came out of that. Also, things like, the business process orchestration, right? The last couple of years, one big evolution that's happened is the capability of the platform to go and build agents and agentic workflows, for example, right? And now when you think about agents, right? Like, you know, everyone has used Chat GPT today, right? You know, it's, it's wonderful when it works, but it also hallucinates and you know -

RD Sure

JG And when it doesn't work, you know, you have to go back and forth and try to do a bunch of things to make that right. To account for that, we have the whole prompt engineering studio that we build, and you can go and test out these prompts and you make sure that they're reliable before you go and use it. But that said, when people think about, when companies think about, doing and building agents in their enterprises, few things come into play. One is, what kind of models do I want exposed to my organization? The interface to those AI models today are through prompts, right? So how can I, and that's where the data leakage happens, right? So how can I kind of tighten that and things like the prompt engineering studio that tells you exactly what data leaks or what data, you know, is exposed via those prompts. And then build security and governance on the top of that. And then the other piece is that there's not a lot of, there's limited AI talent today, right? Like, so you want to do it in such a way that a non AI, an engineer with zero AI talent can come and use this like, almost like an API, right?

RD Right.

JG So that was our initial set of things that we did, right? But as people are building AI agents using this in the enterprises, a few more things come up, right? How can I run this in my cloud? Because I don't want my data to leak into those things. So that's something that we provide out of the box and that we have that from day one. Can I connect to my internal APIs? Can I connect to my internal knowledge base? Right. So that's a powerful thing that people need, right? So that way data stays there and, but agents also get access to that. So that makes it also more powerful. And the last thing is how do you make the agents extremely reliable? Few things that come into play there. Why do people use agents? So for example, when people build agents, you know, it could be customer service agents, it could be, you know, travel booking agent. It could be someone who's doing fraud detection agents in real time, you know, payment processing or DevOps agent, right? When you do this, there are pieces of that which you can use stuff for that are extremely deterministic, right? Sending an email, sending a notification, right? And that's part of the whole agent flow, for example. But you don't need to ask an agent to do this if you already have an API for that, right? So that's the whole deterministic aspect of it. But there are things which LLMs are extremely, extremely good at. And so first it gives you, the platform, gives the ability to pick and choose what you want. But at the end of the day when it doesn't work, you also want to put guard rails in place, right. And let the AI do all of this magic and then add the guardrails at the very end and the guardrails and come in forms of APIs that you can connect and test. And that can be part of the last part of the agent flow. Or if humans come into play, right? Like if you still want a human to go and just take a look at it, one final check before you let it through. So we have given enterprises the path to go and build completely autonomous agents. And the way people do that, we give them the capabilities through that. So that's how the platform is kind of rapidly evolving. Right.

RD Okay.

JG One thing that kind of differentiates Orkes from the rest of competition, if you look at it, what we are really replacing, or where people are coming away from is homegrown solutions. Right. And when people move away from homegrown solutions, there are competing products that people look at, right, Conductor, you know, being that most stable, most used in that in the world today, in that segment, but there are other competitors. But what we do is, we give the choice to developers to go and build stuff in the way they want to, in the language of their choice. You wanna write your flows in code or configuration or UI, you know, you should give the, and that's where we kind of really differentiate also, give the choice to the developer to do what they wanna do, how they wanna do it. Give them the choice for bringing in whatever language of this choice. So, you know, there are companies who are Java heavy, you know, some people build stuff in Golan. Right. And some people, you know, are still stuck in the C++ world. Like, and maybe there's a need for doing that. And some are still using mainframes. Right? So give the choice, we are language agnostic that way. And you also wanna give the deployment choice, right. Like to companies, right. Like you want to use the cloud, Orkes cloud account and just connect to that? Yes, there's a way to do that, but if they wanna bring their own cloud or run an on-prem, right? So that's where we differentiate. Give the choice to the developers, give the choice to their devices, but the value that people come to us most is reliability.

RD Yeah. So almost homegrown, but a little more reliable.

JG Yes, exactly.

RD All right, everyone. Thank you very much for listening today. We are at the end of the show where we shout out somebody who came on Stack Overflow, got the little knowledge, earned a badge. Today we are shouting out a lifeboat badge winner. Somebody who came into a question that had a negative three or less score and dropped an answer that got 20 or more. Today we're shouting out Alex Stiff for answering bash, sort a list of strings. So if you've been looking to sort your strings and bash, we have an answer. I'm Ryan Donovan. I edit the blog. Host the podcast here at Stack Overflow. If you wanna send us any comments, feedback, concerns, new topics, you can reach us at podcast@sacoverflow.com. And if you wanna reach out to me directly, you can find me on LinkedIn.

JG Ryan, thanks for having me on the show. I'm Jeu, I'm the Founder and CEO at Orkes. You can find our company's information at orkes.io. And you can also follow me on LinkedIn to Jeu George, Founder of Orkes, and you can also follow me on Twitter.

RD All right. Thank you very much for listening, everyone, and we'll talk to you next time.

[outro music plays]