The Stack Overflow Podcast

At scale, anything that could fail definitely will

Episode Summary

On today’s episode we chat with Pradeep Vincent, Senior Vice President and Chief Technical Architect for Oracle Cloud Infrastructure, or OCI for short. He shares experiences from his time as an engineer at IBM and what it was like to be a senior engineer working on AWS during the early years of its development as a commercial product.

Episode Notes

Pradeep talks about building at global scale and preparing for inevitable system failures. He talks about extra layers of security, including viewing your own VMs as untrustworthy. And he lays out where he thinks the world of cloud computing is headed as GenAI becomes a bigger piece of many company’s tech stack.

You can find Pradeep on LinkedIn. He also writes a blog and hosts a podcast over at Oracle First Principles.

Congrats to Stack Overflow user shantanu, who earned a Great Question badge for asking:

Which shell I am using in mac?

Over 100,000 people have benefited from your curiosity.

Episode Transcription

[intro music plays]

Ben Popper Hey, there. It's Ben Popper. Listeners of this show are obviously interested in software, so I wanted to recommend checking out the Web3 with A16z Crypto Podcast. Brought to you by venture capital firm Andreessen Horowitz, which first coined the phrase “software is eating the world,” this podcast is your definitive resource on the future of the internet, from the latest trends in research to insights from top developers, scientists, and creators. This show is about more than crypto and Web3. It's for coders seeking more ownership of their work, for business leaders trying to prepare for the future today, and for others trying to understand the next innovations and tech trends. Be sure to find and subscribe to Web3 with A16z Crypto wherever you get your podcasts.

BP Hello, everybody. Welcome back to the Stack Overflow Podcast. I'm your host, Ben Popper, back after a restorative sabbatical, and I'm here with my co-host, the Editor of our blog, Ryan Donovan. Ryan, you're about to head out on a sabbatical of your own. You're passing the torch.

Ryan Donovan That's right. I've been hosting for a little bit and now you're back, dreaming of lobsters. I'm off to dream of lobsters myself.

BP That's right. I've challenged Ryan to catch a bigger lobster than I did on my sabbatical. But Ryan, you booked today's episode so why don't you set us up? Who are we going to be chatting with and what's the topic of discussion?

RD So today we're talking to Pradeep Vincent, SVP and Chief Technical Architect of the Oracle Cloud Infrastructure. So today we're going to be talking about large cloud platforms, how to incentivize high quality engineering while still maintaining developer velocity, and what does the AI era mean for the cloud era.

BP Pradeep, just give the audience a little bit of a flyover. How did you first get into the world of software and technology, and what brought you to the role you're at today?

Pradeep Vincent Well, Ben, Ryan, thanks for having me. So what got me into software engineering? Well, I think it's probably in high school days– 1990s, I guess. I was fascinated, I think, probably by some video games, which were fairly rudimentary back in the day but it certainly caught my attention. I wanted to play video games, but then more importantly, I actually wanted to create some and that was really the genesis. This was early 1990s with whatever PCs I could get hold of.

BP Some MS-DOS games?

PV They were DOS games, yes. And it was essentially manipulating pixels directly as opposed to fancy high-level libraries and so on and so forth. Fairly rudimentary, but it was really fun.

BP Were you yourself a software engineer or independent contributor before moving into roles of management and now leadership?

PV Yeah, I got my college degree in Computer Science, master's in Computer Science, and I got into software engineering. I've worked on a variety of types of software myself. I used to do firmware, microcode type software for a variety of, let's call them mid-range mainframes in IBM and storage controllers. Some of them are on servers, some of them were on a PCI card, microcontrollers. And then I joined Amazon in 2005 back when it was just a book company. And I think even though it was a book company, I was actually doing low-level software, kernel, hypervisor, Xen hypervisor, those kinds of things back in the day. But then this whole new thing called AWS came up in 2006. I don't think too many people heard about AWS back in the day, but it started somewhere in two different parts of Amazon and then I got hooked into one of them. And I was in EC2 for a while. Early on, I was a hands-on contributor but moved to more of a high-level engineering leadership role, and then moved to different parts of AWS compute, networking, storage, and eventually I decided I wanted to do something fresh and new, and Oracle was essentially looking to enter the market, if you will, in a big way. And I really felt that Oracle back in the day was known as a database and application company. I really felt that there was a big opportunity for somebody that really, truly understands enterprise customers and enterprise markets to disrupt the cloud industry as it were. So me and a few of my colleagues as well, both in AWS as well as in Azure, we decided to take a bet. We moved to Oracle, and then I moved in 2014. We launched our cloud in 2017 as a full-fledged product and here we are. So seven years later I think we have a proper cloud and we are growing well and we have pretty good scale as well.

BP I think it was mentioned at the beginning of the call, right up there in place behind some of the huge giants and creating a name for yourself in a place. It's interesting that you brought some folks with you. You had a little dream team of people who wanted to take this adventure?

PV We certainly did. I think it was an interesting time overall and I think we had each other to help each other, support each other back in the day. We also had a common notion of engineering culture around building highly available distributed systems. And that essentially formed as a foundation of any large cloud provider, if you will, and I think that common set of understanding and common set of principles we brought in from prior experience really helped.

RD It's not every day we get to talk to the chief technical architect of a large cloud platform. What does that sort of infrastructure and architecture look like? And like Ben was saying, you brought over a dream team that had worked on other cloud architectures. What lessons did you take from those cloud architectures and what did you change?

PV I think there's a whole bunch that we learned and then a whole bunch we learned that we need to change as well. It's fairly common across some of the key large scale cloud providers. The way we deal with scale, the way we deal with failures, the way we deal with trade-offs of availability and performance and cost, those are some of the fundamental core challenges that cut across engineering. And this is really true for any other major cloud provider as well, but everyone has a different way to deal with it. They have different trade-off points or optimization points. Let me give a few examples. Failure– a big part of how we handle failure is essentially architecting for resilience with respect to many of the common components. We kind of assume across the board that underlying infrastructure is going to fail. It will fail at scale. Anything that could fail will definitely fail. So as we increase the number of regions and the number of data centers, it's just a mathematical certainty that something is going to happen. So we essentially architect for that. And then we essentially also have a core principle around creating bulkheads, if you will, where we minimize the blast radius. The notion of a region is very strong and it's actually drilled deep down into every software stack. And that's very strong, you kind of take it for granted. But if you look broadly, certainly in OCI, you will not find availability challenges that span across multiple regions. And that's very, very important for us because our message to our customers is, “Hey, if you want DR, true DR, just go deploy in multiple regions and have DR across them and you're not going to see failures.” And I think we've kept that true for a long period of time, and it's apart from different regions being different data centers and whatnot, which is all basic, but it really goes to the way we manage software as well. And so that's very, very foundational, fundamental. The other thing that I would call out are some of the things where we actually change or look at things differently. One of them is security. So when we started in Oracle, we looked at our customers and asked ourselves why is it that 90-something percent of on-prem workloads haven't moved to the cloud yet? This is essentially our thought process back in the day before starting OCI. We looked at it and we came to the conclusion that trust was a huge part of it. Enterprise customers had their own data center and they had a degree of control and that came with a certain amount of trust. And when they go to a multi-tenanted cloud provider, they don't have the visibility, they don't have the control, so the trust was missing. It was a nervous transition for them. They're happy to do that for things where they're not core to them or central to the operations if you will, maybe analytics, certain types of workloads, but then certain drills, if you will, they want to kind of keep it close to their chest. So we took a look at that as like, “Hey, how do we actually architect a cloud that actually earns their trust?” And a lot of what we do stems from that. Right from the beginning, we took a very different security approach in terms of how we do multi-tenancy in the cloud. We called it Gen 2 architecture, where we essentially took the posture that we need to have layers of defense from a security standpoint that's very strong. As a concrete example, when we run VMs, when we actually host VMs in cloud in OCI, the VM layer itself, we view that untrusted. Now customer VMs are untrusted, that's obviously true, but we actually view the VM layer which is actually honestly owned and operated by OCI, we view that internally as untrusted. And as a result, we have yet another layer below that to protect the core of our infrastructure, and that essentially gives the layer of trust. And similarly, we view the NIC, which is the network interface card that customers see, as untrusted. And then we have yet another layer, and we call that control computer, we call that virtualization device. In different contexts we call them different things, but it's essentially a thing that sits outside of the customer’s server and then isolates the servers from our core infrastructure. What that does is –essentially this goes back to trust– we essentially take a more rigorous view of the multi-tenancy wherein we really don't want to trust any of that, and we want to protect our core infrastructure from that. Another way by which we address the trust part is our distributed cloud strategy. And as part of that, that's a marketing brand, but under that, if you look at it closely, we have dedicated regions –Alloy– and the underlying theme there is essentially to take the OCI regions, take them as close as possible to the customer, put them where they want, and in some cases it's in their own data center. In some cases, we actually put it in an Azure data center because customers may actually have existing Azure workloads that need to work with OCI services. And one of the reasons for that is trust, in the sense that when we actually put an OCI region in a customer's data center, they essentially have more control, and that leads to more trust. And in some cases, customers, particularly with Alloy customers, want to have more of a say in terms of how and when software deployments go out. They have more control over that. That again earns trust. So all these essentially are building blocks that we believe are fundamental to earning the trust of enterprise customers wherein they feel comfortable moving the key workloads to the cloud.

RD That's interesting. I worked on a cloud security book about seven or eight years ago and the multi-tenancy thing was definitely a big concern. There was talk of various attacks that could break the multi-tenancy wall. There were crazy exploits like row hammer which could read the memory directly.

PV Yeah, side-channel attacks.

RD Side-channel attacks. Do you think those are still valid attacks in this age of infrastructure as code with Docker containers everywhere?

PV I believe they are valid attacks in the sense that I don't believe that the hardware world has fundamentally solved it. I think we have come a long way in the last 10 years if I look at the silicon industry and the architectural improvements they've been making around multi-tenancy and so on and so forth, but I don't believe you've eliminated the risk. So what that means is that, as customers and as cloud providers, it's our job and responsibility to find further solidification and layers of protection so that customers are comfortable putting their workloads in. So I think, to some extent, at a chip level, when one particular customer can control fine-grain aspects of what's going on in the chip, then giving the option wherein that chip itself is single-tenanted– not the entire cloud, maybe not the entire server, but the chip is single-tenanted– actually adds quite a bit of trust. So that's one particular approach. The other approach is that you just don't give customers the type of access where they can actually control what's going on in the execution, and that essentially provides a layer of abstraction on compute where they essentially move out and we have enough confidence in that interface so that they can't maybe even know what hardware it's working on. It's an abstract thing. We're going to schedule it on whatever you want. That type of abstraction is another way to approach it. So to sum it up, I would say yes, it is a valid attack. I do believe that side-channel attacks can happen. I don't believe as an industry we've eliminated that, so I think we do need to continue to have protections against that in various forms.

RD I think I would go back to just how you plan out an architecture of this size. You talked about regions. How do you make region failover and swapping seamless? Servers fail all the time. How do you handle that?

PV So I think let's talk about region first, and I think the way we think about multiple regions are very different than how we think about failures within a region. The general principle is that within a region, we want the failures to be as seamless as possible. So we essentially build constructs that cut across what are essentially failure domains, and I'll define what that is in a minute, but then essentially have layers that span across these failure domains so that the failures can happen automatically pretty quickly, sometimes usually in the matter of seconds, and in most cases it should be invisible to the application or the customer. So the failure domains typically are data centers. They're distinct data centers that have distinct power, cooling, energy source, generators, generator backups, those kinds of things. And in addition to that, they also have a distinct core network. But then in order to build higher level seamless layers or the convenience layers, if you will, we also have a common network that collects these multiple failure domains, if you will. And then we have higher level services that are what internally we call regional services versus AD local services. AD stands for availability domain, which loosely corresponds to one or more data centers that, across availability domains, are guaranteed to have distinct data centers with distinct power, cooling, whatnot. The higher level services, the regional services if you will, as an example, let's say load balancer service. When a customer uses a load balancer, internally we have multiple load balancer instances, and it's not a one to one, it's not like an on-prem load balancer, it's more of a fungible fleet that we manage and we do failures across them, but they are spread across multiple of these failure domains or availability domains if you will. We have software architecture that detects failures and moves the IP addresses over so that customers that use the load balancer don't see any problems. The same way with our autonomous database service where we have a database in one AD and another database in a different AD, and there's a seamless failover. As an application that's using the database, it just fails over. There's nothing to be done. We take care of all the internal movements and the failover and so on and so forth. Now the architecture of this has its own implications. We want to failover as an example of a trade-off, an engineering trade-off. We want to failover as quickly as possible, but at the same time, when we actually have software that fails over, sometimes when you have let's say a transient failure, there's a mass migration of something, maybe IPs, whatnot. It triggers a massive chaos somewhere in the system. It could be a bunch of BGP resets. It could be a bunch of storage that starts moving. It could be a bunch of containers that it’s trying to do reconciliation of various control plane elements. While we want to give the customers a better experience possible, we also want to minimize the customer impact of these mass migrations or mass moves, if you will. So we have various throttling mechanisms or slowdown mechanisms wherein, hey, move it fast, but then if everything wants to move, maybe we should slow down. We have a variety of those types of approaches to deal with failures. And generally, it's a tough problem to solve. I actually wish we could move IPs in a matter of microseconds. It'd be possible, except we’d be churning a lot and actually injecting other problems. So we kind of figure out the right set of trade-offs there, and sometimes we actually do failovers differently based on the blast radius. And then you asked about how we deal with multiple availability domains or multiple regions rather, there I think our approach is a little different. Our goal is to essentially have software that's as independent as possible, period. That's drilled across all our service teams, that's drilled through the way we do change management, so that they don't ever have failure at the same time. We do have some services that act as glue across them, as an example, identity. So you can essentially have identity rules and identity domains populated in one region and it'll automatically propagate to other regions, with some delays. It's not instantaneous. And then similarly, we have object store buckets that are propagated out to other regions so that customers don't have to worry about replicating the data on their own. Similarly for database, we have an autonomous database service that provides cross-region replication that is again, asynchronous, but it's taken care of automatically. But at the same time, to get the degree of independence that we want, there's always a trade-off between how much of a glue layer we can have across the regions to make it seamless and smooth versus how much of an independence isolation we’re going to try. And one of the things that we try not to do is, as an example, we have a backbone that connects our regions, but we have independent pop locations where we actually connect to the Internet, if you will. An implication of that is that when we failover an application, typically you want to do a DNS space migration as opposed to an IP flip over. And that's a trade-off. It comes with a bunch of trade-offs, and we do that deliberately because we want to drive home that degree of isolation across the regions. So essentially it's a trade-off between how much seamless and smooth integration we can bring across regions versus how much convenience we want to offer the customers.

BP You started out at Amazon, you mentioned, when it was just a book company. You saw the rise of AWS and of this entire world of cloud computing, struck out on your own to try to create from scratch a new platform for folks. How has the last year or two been for you? What are you responding to in terms of big trends, whether that's just the continued growth of cloud or the continued explosion of data, or more specifically, the shift in focus in the industry towards machine learning and AI services?

PV I think the security, the amount of data, and the rich analytics of data has been a trend for the last, I would say, 10 years. It's been on a constant rise and it's a trend that we saw when we started OCI. The AI boom, on the other hand, is not a thing that we quite saw. I don't think most people saw it and I think in the last few years it's taken the industry in a very different trajectory. So it's very exciting for us. I actually fundamentally believe and I see that the cloud industry is essentially, to a large extent, enabling the growth in AI in the sense that if you look at, perhaps compared to the other types of AI, the generative AI models, the way the training and inference are done requires massive amounts of infrastructure, and that requires the wherewithal and the engineering expertise to deal with that infrastructure, whether it's networking, power, cooling, construction, those type of things, and operations of those, but as well as the CapEx intensive nature of it. I think the cloud providers fundamentally, including OCI, are used to dealing with those CapEx intensive cycles, if you will. And I think we are essentially providing the platform by which the innovators, the startups that do model innovations, can come in and innovate really fast without having to build out their own capacity. So they essentially get into their hands a lot of capacity very, very quickly to do their training runs and inference. And I think that's very powerful, and frankly, the way I look at it, I think it's been a key enabler for the generative AI trend that we are seeing. I think it's a huge opportunity. It is also a big challenge from an engineering standpoint. There's a lot of things that are actually quite different about the modern AI workloads. One of the big things I would call out here is essentially the way we think about availability, change management, and scale are actually quite different with generative AI. It won't strike you. It looks like, “Hey, there's a bunch of GPUs and stuff, CPUs. What's the difference?” Well, it's actually quite different. As an example, these are large clusters, very large clusters that have tens of thousands of GPUs, but they're still clusters. Nobody says it's a distributed system, and that's a big difference. And it is truly a cluster because when we have hiccups somewhere in the cluster, the entire training run actually slows down. It's bottlenecked by the slowest GPUs and that's a big deal. And if you look back at the cloud, we actually evolved out of that. Back in the day 30 years back, we had tightly-clustered systems and we learned our way out of it to more of a distributed system where we essentially assume everything is going to fail all the time and we have a looser coupling or looser dependency across them. But now we have kind of turned back the clock in some ways and we are dealing with availability requirements that are quite different. And I think internally we are changing our internal architecture, engineering practices, operational practices to deal with that. As an example, when we traditionally do change management, when we deploy a change, we do a very staggered change. We deploy to a small number of servers, watch, make sure we have everything good. We have probes, we have canaries, all that stuff. But in large clusters, generally you don't want to do that. You don't want to drip, drip, drip, deploy something. What customers generally prefer is that you don't actually touch anything at all and then wait for some window and deploy in a big way, which is the exact antipattern of what we traditionally do, but that's exactly what works in this world where there's a big impact to change. In a variety of ways, I think the engineering looks different. It's really exciting, so it's a huge opportunity for all of us and I really look forward to all the things that AI can do in terms of productivity gain. But at the same time, I think it brings in a bunch of engineering challenges that cloud providers need to go handle for our customers.

RD You said the tight clustering of GPUs. Is there a technical reason for that or is that just habit in how we built them?

PV I think the way the generative AI training algorithms work requires it to be that way. So it schedules jobs and it shifts model weights across them and does a sync, if you will, of the model weights across various runs. It requires the GPUs to run in lockstep, and I think it exists for good reasons. Having said that, I do believe, and I've already seen the trend where the degree of tight coupling is losing. We now have notions where there are mechanisms within the algorithms where we can ban nodes, where you can essentially move that to a site and then do some churn and then admit them back. But nevertheless, you still have a point where if some of the nodes slow down unexpectedly, they do actually have negative implications on the training run and the output of the training run, if you will. I don't believe this is the status quo. I do believe in the next five years we'll get to the point where we will have less of a dependency on the tight nature of the workloads, but I think it's going to go in phases. And I think the application is going to come a little bit and then infrastructure also needs to go a bit, so we need to meet them halfway through, if you will, and that's what we are doing. We are not saying, “Hey, your applications, you need to come and meet us.” That's not going to work, that's not practical. So we are trying to do what we need to do to go up. I do expect the applications to loosen the grip as well and be more accommodative of transient failures that you generally see in the cloud.

[music plays]

BP All right, everybody. It is that time of the show. We want to shout out a Stack Overflow user who came on and shared a little knowledge or maybe some curiosity. A Great Question badge was awarded to Shantanu for, “What shell am I using in Mac?” Asked seven years ago, viewed 100,000 times. Lots of people with this question, and now lots of folks with the answer. So Shantanu, congrats on your badge. As always, I am Ben Popper, Director of Content here at Stack Overflow. Find me on Twitter @BenPopper. Email us with questions or suggestions for the show: podcast@stackoverflow.com. And if you liked what you heard today, leave us a rating and a review.

RD I am Ryan Donovan. I edit the blog here at Stack Overflow. If you want to reach out to me, you can find me on LinkedIn.

PV Well, my name is Pradeep Vincent, SVP Chief Technical Architect for OCI and Oracle. You can find me on LinkedIn. I'd be easy to find. My name is fairly unique. If you want to learn more about the cool engineering we do in OCI, we actually have a blog, both a written blog and a video blog series online. It's called OCI First Principles. You can just Google for it. You'll find that.

BP Sweet, we'll put a link to it in the show notes. All right, everybody. Thanks for listening, and we will talk to you soon.

[outro music plays]