The Stack Overflow Podcast

Understanding SRE

Episode Summary

Vladyslav Ukis, Head of R&D at Siemens Healthineers and an expert in site reliability engineering (SRE), joins Ben and Ryan to talk about the relationship between SRE and DevOps, balancing SRE principles with organizational structure, and how he thinks GenAI will impact his field.

Episode Notes

Vlad is Head of Research and Development at Siemens Healthineers, the healthcare arm of tech conglomerate Siemens. He wrote about SRE on our blog here.

His book, Establishing SRE Foundations: A Step-by-Step Guide to Introducing Site Reliability Engineering in Software Delivery Organizations, is available now. 

Site reliability engineering (SRE) applies a software engineering approach to IT operations and infrastructure, with the goal of building scalable, reliable systems capable of handling constant updates from dev teams. SRE is closely related to DevOps.

ICYMI, we talked with Chef cofounder Adam Jacob about how he’s creating a new-and-improved approach to infrastructure automation. Listen to that conversation here.

Connect with Vlad on LinkedIn, where you can also read snippets of his book on SRE.

Lifeboat badge winner Abbas Galiyakotwala’s answer to How do I split a comma-separated string? filled a void of ignorance with a little extra knowledge.

Episode Transcription

[intro music plays]

Ben Popper Good morning everyone, or whatever time it is for you. Welcome to the Stack Overflow Podcast, a place to talk all things software and technology. I am Ben Popper, Director of Content over here at Stack Overflow, joined as I often am by my colleague and collaborator, Ryan Donovan, Editor of our blog, maestro of our newsletter, and the impresario behind today's episode. Ryan, you invited a guest who you worked with a blog on. Take us away. 

Ryan Donovan Well Vlad Ukis, our guest today, wrote a blog about setting up team topologies for SRE, what the sort of SRE looks like within organizations. And he wrote a book on all things SRE, so we're going to talk about site reliability engineering. 

BP Well Vlad, welcome to the program.

Vlad Ukis Thank you very much for having me. I’m looking forward to the conversation on SRE. 

BP So for folks who are listening, just give them a quick background about yourself. How did you get into the world of software and technology, and what led you to sort of focus in on SRE? 

VU So I started tinkering with computers many decades ago, I would say. The interest started back in school and then I started computer science first in Ukraine where I'm from, and then in Germany where we moved to, and then also in the UK where I did my PhD, so lots of university backgrounds in computer science. And then my professional life was at Siemens Healthineers, which is the healthcare arm of Siemens. And there we were working on the enterprise architecture when I started and then started slowly exploring the cloud domain. First of all, the private clouds, what would it mean to put systems in the cloud for the healthcare domain, and then at some point we also started exploring public cloud, and this is the product suite where I'm working right now and have been working for the last several years. So that's called Siemens Healthineers Teamplay Digital Health Platform, which is a cloud platform for the entire company to put digital services on and offer them as a service to the hospitals around the world. And if you offer software as a service, then you are also obliged to operate the software, and that then brings you into the world of software operations. And once you are there, you then try figuring out how to operate software in the best way possible, and that's what brought us to SRE. 

RD So give a little baseline definition here. When we're talking about SRE, what are we talking about and how is it related to DevOps which I think is a bigger, broader thing? 

VU SRE is first of all an abbreviation for site reliability engineering, which is a discipline within computer science which was composed by Google, because Google, several decades ago, had a problem that, on the one hand it was growing very fast, but on the other hand, it also required linear growth in the certain number of engineers in order to run Google. And that couldn't continue that way, that linear growth of the number of engineers running Google and the actual growth trajectory of Google. So they started experimenting with approaches of how to do operations in a software inspired way instead of the, back then, traditional way of running software using administrators who are manually doing things with servers and so on. So they came up with a set of practices which they then wrote up in several books, which are very well known in the operations community, which are called the original SRE books from Google. So this is a discipline in order to run services reliably at scale. And if you compare SRE with DevOps, then you are totally right in saying that DevOps is a more overarching philosophy of running product development. So with DevOps you've got a philosophy of short feedback cycles where you are able to deploy to production frequently, you are able to measure the effectiveness of what you deployed on the users, and that way you are able to operate fast under the tight guidance of user feedback. And within that realm, you also need to operate the services reliably as well. And although DevOps tells from the philosophical standpoint that you need to do this, it doesn’t give you very concrete tools and organizational methodology in order to do so. So if you look on the internet you'll find really great videos that compare DevOps and SRE, and they tell you that SRE is a concrete implementation of the DevOps philosophy. So in computer science terms, SRE implements DevOps interface, so to speak. And this is what you find when you dig into the methodology, so it provides you with a way to operate services reliably at scale and it satisfies the DevOps philosophy. So that's how I would position those two things. 

BP But is there any dev in SRE? Are you mostly just handling and maintaining and operating things once they've been put out into the world, or does SRE sometimes also involve that first part? In DevOps you're trying to do both– help folks through the development side and then also make sure operations run smoothly. In SRE, do you have both of those or just the latter half? 

VU So although SRE focuses on running the services reliably at scale, in order to do so you actually have to start way before that. So you need to build reliability into the services as you develop them and as you conceptualize them, as you incept them from the beginning, so that in the end they actually run reliably. And then of course you've got certain procedures in order to measure their reliability, in order to see whether their reliability in production actually gets fulfilled as you envisioned. But in order to be effective with putting out a reliable product, you need to start way before the service is in production. 

RD The post you wrote for us was about what the SRE program looks like in a team topology, and I think what was interesting is that you don't necessarily need separate SRE people. Sometimes the devs do it themselves. Can you talk about what those topologies are? If you don't have an actual SRE person, what does that mean for doing an SRE program? 

VU Yes, so when you do an SRE program then you will involve several roles and you will necessarily need to bring onto the same table product development, product operations, and product management. So those three parties, they need to come together in order to set up an SRE program that permeates the entire organization so that the organization is aligned enough on operational concerns. Now in that realm, you need to, at some point if you want to professionalize this, also think about a suitable organizational structure for the whole thing. And here there are many options, and as we described in the blog post, based on my research from my book, I identified nine options to do so. So we will not go through all of them right now, but by and large, there are three big areas there to consider. So you've got a development organization, and you can put everything that's related to SRE into the development organization. That's one option. Then you have got the operations organization. Another option would be to put everything related to SRE into the operations organization. And another third option would be that you set up an additional third organization like what was done at Google, and that organization is called SRE organization and you concentrate the activities there. And there are of course things done in between where there is shared responsibility between those organizations and so on. So there will be a big decision whether you want to set up an entire SRE organization or not, but in any case, I think you'll have a development organization and operations organization. And depending on how you distribute the responsibilities, interestingly, you will also then inspire different identities with the people who run the services. So if you put all the activities into the development organization, then it’s likely that the people will identify themselves with the products they run a lot, because they are part of the development organization that lives and breathes those products. If you put the SREs into the operations organization, then it will be more oriented towards incidents and kind of numbers of incidents and we want to reduce the numbers of incidents and so on, those kinds of classical ops things. And that will be regardless of the products you're on, so you’re more focused on the pure context-free ops stuff. And then if you put the SRE activities gravitationally more into the dedicated SRE organization, then I think those things that are related to SRE will come to the forefront. For example, you would be more about, “Okay, so how do we really properly define service-level objectives so that they really reflect the reliability experience by the users so that when the service-level objectives are fulfilled, we know transitively that the users are happy, and if the service-level objectives are broken then you know that you've actually violated user experience in a visible way.” So those kinds of SRE core things and principles will come to the core more in the SRE organization. And then depending on how you set it up, in general it can be said that the more you put developers on call for their services, the more incentives they will have to implement the reliability into the services from the beginning. So this is coming back to the question that you asked a couple minutes ago of is it only about really operating this stuff towards the end, or is it also more about involving the SRE thinking at the beginning of the product lifecycle? And the more you put the actual developers who implement reliability into doing operations work, the more they will actually do the reliability implementation and reliability thinking in the design process and development process of the services. 

BP So is that your sort of platonic ideal? It sounds like there's a lot of different topologies, a lot of different ways you could slice this, and at different organizations there might be different permutations of this. Your ideal is to make sure that the folks who are building understand that they're going to have some responsibility at the end also for maintaining and operations. That way they're going to bake it in and they're going to be with this code through its whole lifecycle, from the ideation out to the delivery, and hopefully that will lead to the best result. Because in my experience, SRE has been more like some of the earlier ones you mentioned where somebody builds it, they throw it over the wall, the other people are responsible for the runbook and putting out the fires. And those two things are two different disciplines, not necessarily connected.

VU Right. And if you follow the original SRE books from Google, then you'll see that although they've got an entire SRE organization that is running the services, actually it never starts with the full SRE support from the SRE organization. So you actually need to convince the SRE organization within Google to help your product, and before they do this, you are responsible for running your service yourself. So that means you are totally in the “you build it, you run it” philosophy. And once you've later enlisted support by the SRE organization, then your services fall below certain service-level objectives that you have agreed between the development organization and the SRE organization, then the SRE organization will return the services to you and then you are back to “you build it, you run it.” So that means that, as the development organization, you always have skin in the game of running services, even in the original SRE literature. And I would recommend doing so. As you suggest, I think the developers need to have some operational responsibility, and the extent to which they have got the operational responsibility, that's negotiable, that's organization by organization, that can be discussed.

RD Yeah. It's interesting that you talk about wanting to get the SRE voice at the beginning in the development lifecycle. It feels like it's part of a lot of disciplines trying to shift left: security, privacy, and SRE too. Do you think that there is a sort of stereotypical pitfall that happens when you don't have SRE voices in the development cycle?

VU Oh yes, absolutely. I think the development teams will learn it the hard way easily then. That'll inevitably happen, but actually, interestingly, that will only inevitably happen if the product is showing signs of success. So if you deploy something to production and nobody uses it, whether you have SRE or not kind of doesn't matter that much. Also if that product doesn't get a lot of traffic, you actually might not run into lots of issues from the beginning because if something doesn't work and there is not a lot of traffic, you can still kind of fix it manually and so on. If, on the other hand, the product is starting to take off, that means there is more and more and more users. And with that, your availability requirements are getting harder and harder and harder to fulfill and therefore your time to recovery then must be shorter and shorter and shorter. This is where you need proper processes. This is where you will not survive without SRE for long. 

RD So if people want less bugs in production, they should have less users then, right?

VU That's one one way to put it, but then you can start questioning how long you're going to survive with a product like that. 

BP Forget product market fit, we're talking errors down to zero. So it sounds like you've worked in this area for a long time and you're going back to these original Google documents as well as writing your own book on it. What are some of the best practices that have emerged recently or some of the technologies and frameworks that have emerged recently that you think are the most powerful in this area that are enabling folks to evolve what SRE can be and to improve outcomes? 

VU To my great surprise, actually there are lots of companies out there who are saying that they are practicing SRE, but they haven't defined service-level objectives. So this comes out of several servers on the internet and so on. And that surprised me. Why? Because if you look at the SRE principles, then one of the central principles there is to manage by service-level objectives. So if there are no service-level objectives, then how can you manage? Then you’ve sort of done big bits and pieces out of SRE methodology and you say that you are running the SRE way, but actually the core essence is still missing there. So therefore, I'd say one of the core practices that's on the rise now is to actually define and adopt and manage by service-level objectives. I think this is absolutely key because without this, you cannot really quantify reliability. Without this, you cannot have your SRE infrastructure provide you with data– short-term data, long-term data about what's reliable to which extent, what's not reliable to which extent, and for which time period, and so on, and therefore you cannot present that data in aggregation to your product management in order to influence prioritization of reliability stuff where there is least reliability at the moment and so on. But I think adopting service-level objectives and really leaning into that idea and really bringing folks to the same table in order to get that done, I think this is really core and it's surprising to me that not all companies are doing this who are talking about SRE. I think this is absolutely essential.

RD I think when you talk about the service-level objectives, it almost sounds like a good SRE program is basically an in-house web hosting, hosting the sort of online presence. Is that fair to say or is there something more to it? 

VU When you say good web hosting, what do you mean? Which aspects do you specifically have in mind? 

RD They take care of all of the sort of infrastructure and all of the stuff behind the code running in production. Ops usually handles code in production and then a sort of separate SRE team makes sure that it's got the uptime, it's got the response time. 

VU Right. So I would suggest that a successful SRE program will first of all establish a joint understanding of the reliability objectives that we've got as an organization and then as a set of services owned by a particular team. So each team will have an understanding of their reliability goal, number one. So then number two, what will happen is that there will be transparent tracking of those goals, whether the services are fulfilling the goals or service-level objectives or not, and then there will be a rather continuous dialogue within each team and also at the higher level within the organization whether we are fulfilling the reliability objectives, the SLOs that we set for ourselves, and whether despite fulfilling them, we are still getting customer complaints– or the other way around, we are not getting customer complaints, but we're still violating our defined SLOs, which means that the SLOs have been defined too tightly and we can actually relax them without violating the user experience. So that continuous dialogue in the organization around reliability will happen. And because the dialogue happens at different levels of the organization, so at the team level for the services they own, and then at the management level for aggregate services that are offered externally, you then really have got an alignment on operational concerns throughout the organization and how the workload will be distributed. So between the development organization, operations organization, and possibly also the SRE organization, that's different enterprise by enterprise. 

RD We like speculating about the future, how do you think SRE is going to change in the future? Do you think new technologies like the generative AI are going to affect how SRE operates?

BP Yeah, I think that's a great question, Ryan. We did a piece about self-healing code and I've seen lots of examples of folks asking different AIs now to write something for them and then to debug it or to look at an error they received and help them resolve it. What do you think is in store for us when it comes to the combination of generative AI and SRE?

VU Where I clearly see a value add is in things like onboarding new developers onto the existing infrastructure, for example. So imagine you've got a new developer joining the organization and in the organization they have got SRE setup. That means the developers go on call to some extent, and you need to bring the developer up to speed so that they understand the operational part of the story well enough in order to be effective at going on call. So then they need to understand what the infrastructure is in order to get to the logs, in order to get to see who is currently on call and for what, they need to be able to write queries deep enough in order to see what's going on with the services, and imagine you can do all this by chatting with a knowledgeable ops bot. I think you will be productive much faster than today. And imagine you would be able to even run simulations with that ops bot and say, “Okay, so now simulate that we’ve got a priority one outage with that particular service. So then what do I do first? So then where are the logs for that service? Who's currently on call for that service? Is that me or someone else?” So then you'll be able to find that out. So basically, usually in that operations arena you've got lots of tools and you need to know where to go and so on. Imagine all that would be taken care of for you by an ops bot with whom you would be able to have a conversation with as a preparation, and then also in the real world once an incident unfolds. I think there is a lot of potential there for sure. And then another thing is of course inspecting the code for reliability. So there have been several attempts to understand whether the code that's been written has got enough resilience, for example, implemented for a certain load in production and so on. And imagine that could be done in near real time as you write the code. Then you've got sort of the operational copilot, which I haven't heard of yet, but I think could be imagined. So imagine that operational copilot then is thinking along with you as you write code and suggesting, “Okay, so maybe here you are calling a particular service, and I'm seeing that that service actually violated its SLOs for the last half a year. Therefore, in order to be extra careful here, why don't you put a circuit breaker in there, because this is a call where I'm seeing from the experience that you might run into trouble,” and things like that. So this is where I see potential and I'd like to see progress, and I think that would elevate the industry a great deal. 

BP Very cool.

[music plays]

BP All right, everybody. It is that time of the show. Let's shout out someone who came on Stack Overflow and shared a little knowledge to help the community. Awarded July 20th to Abbas, “How do I split a comma-separated string?” If you've ever wanted to know the answer, Abbas has it for you, and earned himself a Lifeboat Badge and helped over 68,000 people. So thanks, Abbas. I'm Ben Popper. I'm the Director of Content here at Stack Overflow. You can always find me on Twitter, which I guess is now called X. You can find me there if you want to send me a DM, it's a useful way to get in touch with me. Or you can email us, podcast@stackoverflow.com. Or you can leave us a rating and a review if you like the show, because that really helps.

RD I'm Ryan Donovan. I edit the blog here at Stack Overflow. You can find it at stackoverflow.blog. And if you want to reach out to me, you can find me on Twitter @RThorDonovan. 

VU I'm Vlad, and I work at Siemens Healthineers for the Teamplay Digital Health Platform, which is the cloud platform for digital services by the company. And I recently spent a lot of time with site reliability engineering in order to make the platform more reliable. And I can be found at LinkedIn. If you just look for Vlad Ukis then you'll be able to find me easily. And on LinkedIn, I actually regularly publish summaries of my book chapters, so if you don't want to read the entire book, then you'll be able to find just chapter summaries on LinkedIn.

BP Great. All right, everybody. Thanks for listening and we will talk to you soon.

[outro music plays]