The Stack Overflow Podcast

Dev, meet Ops. Ops, meet Dev.

Episode Summary

On today's episode we chat with Tom Limoncelli, a site reliability engineering manager at Stack Overflow. Tom talks about his time at places like Bell Labs and Google, how he creates runbooks, and the secret to building a healthy relationship between developers and operations.

Episode Notes

You can check out more of Tom's work and some of his books on his website, Everything SysAdmin

Tom also wrote a great blog post for our site that explains his method for crafting a positive feedback loop between Dev and Ops using real-time documentation.

You can find Tom on Twitter and check out his books on Sys Admin and  Cloud System Administration.

Episode Transcription

Tom Limoncelli It's like telling someone that their baby's ugly, but sometimes you have to do that. But you do it in a concerned caring way that says, "We're gonna sit with you for a month, maybe two months, we're gonna help you change the way you do testing. And you know what? Afterwards--"

Paul Ford Put some makeup on the baby! We'll save it, we'll save the baby. It'll be fine.


Ben Popper Hello, everybody. Welcome to the Stack Overflow Podcast, whatever time zone or region of the world you're in. We're here to chat about software and technology, what it's like to be a coder, the joys of programming all things in between. I'm Ben Popper, Director of Content here at Stack Overflow. And I'm joined by my wonderful cohost, Paul Ford. Good morning, Paul.

PF Good morning, Ben! Here we are! Here we are ready for some more technology.

BP Exactly. Who are you? And what do you do? Let's kick off the show.

PF Ah, we're doing it? We're gonna do it? Is this a real podcast now?

BP This is a real podcast.

PF My name is Paul Ford. I'm the cofounder of a software consultancy called Postlight. And I like to think about software deep, deep down just everyday all day. It's weird actually. It's confusing, upsetting to my family. 

BP Sits in the chair with a ReMarkable 2, reading old Linux notes.

PF It's actually that bad. I was reading the LISP 1.5 manual the other day, however, I you know, my children love it. I read it to them and helps them get to sleep. Alright, so this is not about me. And, I mean, granted, this show is my form of therapy. But this is not about me.

BP So when I say Paul, when I say SRE, what, what does that make you think? What's the first thing that jumps to mind?

PF Well, you know, it used to it like when when I first heard it, I'm like, what, what is that all about? And I just thought it had something to do with SERPs, which are search engine result pages. But I wasn't like in the semantic like, you know, cognitive space. I wasn't too far off because it is a role very associated with with Google, but I believe, I believe, I believe it stands for Site Reliability Engineer.

BP I think that's right. Today, we have an expert in the area, Tom Limoncelli, who is the SRE manager at Stack Overflow. Tom, welcome to the show.

TL Thank you very much.

PF Wait, did I get it right? Every time I do this. I'm like, oh! [Ben laughs]

TL Yes, you did!

PF So that's, we've started off really well. We're doing good.

BP We're on a roll.

PF Mmhmm.

BP Tom, let's say you met somebody who was non technical at a dinner party and you had to explain your job. How do you explain it to people?

TL Okay, so SRE stands for, as we said, Site Reliability Engineer or engineering, depending on the context. And the reason I like that name is that it focuses the mission on reliability of the website, because reliability encompasses a lot of different things. It's, a lot of people think, oh, that means when the website goes down, you rush to bring it up. But that would be like a doctor being concerned with patients that are dead. When doctors work on patients when they're sick, and hopes to prevent that they're dead, right? So  we're all about detecting when the patient's sick and preventing death. And when the website does go down, that's, you know, obviously, that's gonna happen. But we want that to be the exception. So So what makes websites unreliable? You know, it could be faulty hardware or technology problems, but it could be bugs, right? So part of the SRE job is how do we roll out new software quickly and efficiently? So that bugs get fixed, right? If you fix a bug, you have to deploy it, right? Otherwise, it's not so fun. So SRE deals with capacity planning, because if you run out of capacity, that's a reliability problem. deals with bug fixing security, and all the things that you think about operations people doing.

PF Alright, so Tom, give us the fun version of your LinkedIn like, I mean, there's a lot here you have you were at a giant internet company, you've written things. Give us the path did you start as a developer? Did you start in SRE? How'd you get here?

TL I have a weird origin story. I actually have a degree in computer science, which is--

PF That is weird

TL Rare in operations and in the industry. But I got that before, I graduated college in 91. And you know, object oriented programming was this new weird thing that we was casually mentioned to a senior year. Even though my degree was in computer science, I really desired being in operations. And what made my career work well is I was the operations guy that knew how to code and, and that always set me as head even when I was doing pure network operations, like, you know, routers and switches kind of stuff. I would code up ways of configuring our routers. So I've been on a lot of companies. I was at a CAD company, right out of college, but then I was at Bell Labs for seven years. And I was like, say I was sysadmin to the stars, like a lot of the people that wrote my textbooks were my users. You know, Dennis Ritchie, you know, Rob Pike, you know, all these people. And then I worked at a small security startup with Bill Cheswick, called The Meta, and then I was at Google for seven years. And then I've been at Stack Overflow for eight years. 

PF Alright, very sensible path. So I mean, the answer is simple. Just go get a job at Bell Labs. And here you are.

TL Yeah, I'm, I don't think I'm the smartest person. But a lot of smarts rubbed off on me just by being around smart people at Bell Labs, and I'm so grateful for that. And that actually led me to writing the books that I wrote. Also, the trick to writing good book is find co authors that are smarter than you. A+ in that category. Chris and Strada have been, you know, my writing partners on some of my biggest book projects. So yeah, those are the secrets to my career.

PF Let's imagine. Okay, so obviously, you're meeting with developers, lead engineers, people like that. What? What are those meetings like? Because they've come in, they're like I wrote, my API is pretty good. It's gonna scale. Tom, what do you need? 

TL Exactly. So we ask questions, like, "what are you going to do when this fails?" And they might say, "Well, this isn't going to fail?" Well, no, the underlying hardware could fail. Everything fails eventually. And, sadly, it's kind of a downer, because in SRE, you have to, you have to always be thinking about the millions of ways that things can fail, and expressing that to developers in a non insulting way. Because it's not their fault.

PF Sometimes, sometimes, it's their fault. Sometimes the developers, okay, well, okay, please keep, finish your thought.

TL You know, there was a, there was a time where someone said, you know, we asked, What if this goes down, this adjacent service goes down, they said, well, there's no relationship to the tune. Okay, that's fine. So next week, we're going to take that other thing down intentionally, and make sure that your system is resilient to that. And they said, we're not concerned with that at all, because we know they're totally unrelated. Well, we took that down the adjacent service down, and their service did go down, it turns out there was a lot of issue. And we didn't do that to say neener, neener, you're wrong. We did that. Because we say, hey, Isn't this great? Every time we find a bug we celebrate, because if we can find it now and fix it, that's so much better than finding it at 4am. 

PF Every service on the web loves all the other services, and they form a beautiful tight knit community. And it is true, you take one away, and the whole thing just, you know, everybody, everybody thinks it's very redundant. It's not. Let me ask you to contrast so what you describe the contrast SRE to QA, right? Because I feel that these are, it feels like it encompasses QA, is connected to QA heads more towards a kind of DevOps, like where do you live in that ecosystem?

TL Sure. I think that the difference between QA and SRE, the kind of testing that SRE does is in QA, I believe every time QA finds a bug, you should celebrate because you've prevented something from a problem from production. When SRE finds a problem, generally, that means you found something that was missed in QA. So sometimes, the solution to an operations issue is the SRE needs to change a procedure or make their code a little better, whatever. But often, the work of the essary means shifting left, go going deeper into the pipeline and sitting down and saying, hey, QA, we've noticed this problem, could you add this to your test, or even deeper into the pipeline and talking to the developers and saying, you know, your testing methodology isn't very good. And you've you find polite ways to say that. At previous companies, I've seen situations where the SRE team said, you know, of all the reliability problems we've had, a big chunk of them can be summarized as this development team just doesn't have good testing methodology in general. And so what they did is they, for a month, they moved and sat with the developers and rebuilt their testing infrastructure. And which was kind of scary at first, because it's like telling someone that their baby's ugly, but sometimes you have to do that. But you do it in a concerned caring way that says, we're gonna sit with you for a month, maybe two months, we're gonna help you change the way you do testing. 

PF Put some makeup on the baby! We'll save it, save the baby, it'll be fine.

TL And afterwards, you're going to be so much happier. Because it's not just reducing operation. When you reduce operation pain, you reduce everyone's right. And the ideal SRE paradigm is you talk about having a shared responsibility for reliability, or shared responsibility for whatever your metric is sales, reliability, uptime.

PF You know, what's interesting to me is I'm listening. The olden days of advanced web development would be okay, I'm an engineer. And maybe I'm even building an API. We're not that yield. I'm building a nice Web API talks rest, and you're going to be able to put your name in here. And then QA comes along and says, I tried to upload an image and it accepted it. So unless you want my name to be an image, boy, you got a problem. Oh, thanks, QA. And then it would go out into the world and everything would still melt. Right and so now we need someone to kind of see the bigger picture the meta picture and interrogate the whole process. Because that's what it sounds like you're doing to me, which is sort of like, we have to deal with sort of really broad unknowns. Because we're dealing with large cloud systems, we're dealing with the behaviors of vendor systems, R systems, API's, platforms, whatever. And we need a more holistic way of approaching this rather than we ran the tests, everything passes.

TL Yes. And a lot of that knowledge is stuff you could only gain from having been there. So an anti pattern actually would be, everyone learns from their mistakes. So you'd say no, learning from your mistakes is a good thing. Actually, I look at that almost as an anti pattern. Because what I would rather see is I learned from mistakes, and share that knowledge all over the place. So everyone gets smarter. And now I'm preventing mistakes.

BP Yeah, this brings us to a great blog post that you wrote this week, Tom. So, you know, you talked a little bit before about yeah, the difficulty sometimes of telling a developer Hey, you know, I think we might have a problem here. Or you might need better testing, are you sure you don't need this service, and then kind of showing them where an error is. So the blog post is talking about how to create sort of a positive feedback loop between ops and development. And how to create a runbook. For people who don't know what is what is a runbook? And what is the function of that and sort of the SRE world?

TL Sure, a runbook is basically the instructions for how to handle a given situation. So you know, in our monitoring system, we might have, let's say, 200, different alerts. And for each of those alerts, you need a runbook that says, you know, if you get this alert, do the following things to bring the system back to normal. And you could also have run books for for other things. And in the the term comes from, I believe, the accounting world where they have run books that are you know, how operational things will be dealt with. But the problem with run books is how do you find balance? Like, should runbook be 1000 pages detailing, you know, every possible possibility and situation? Or should it be, you know, super short or like, you don't want it too short. You don't want it too long, because if it's too long, you've wasted time writing documentation that wasn't needed. And if it's too short, then your operations people don't know what to do in a given situation, right? The article is fundamentally about how do you find that balance? And my goal, or my strategy there is to I don't want to be the manager that says all runbook should be five pages. I mean, that's, that's like, you know, you're back in high school. And with, you know, writing assignments have to be, you know, five pages long. What if I can save in four pages, right? There is no one size fits all solution, what I'd rather do is I'd rather create the environment where there are certain dynamics that lead us to runbooks being short when they need to be short and long when they need to be long,

BP I had argued that this should be the title of the blog post. The note in the runbook was: This should never happen, if it does, call the developers. Yeah, why not just make every note that that's no problem, nothing should go wrong. And when it does just call the people who wrote it?

TL Well, there's a smidgen of truth there. Because I believe that your runbook, first of all should start by well, it should be created by the person responsible for the service, which usually is a developer, but often it's an operations person that creates a new alert rule and has right a runbook that says, when that alert happens, here's, you know, the 15 steps or whatever, but err on the side of being brief. Here's the three things that we think you should do, and then augment the runbook over time.

PF Let me ask a real basic question. Because what I have found is it's very, very hard to get people to read and pay attention to things. How do you create documents that people actually use, apply and internalize? Because God, I wish someone would tell me.

TL Well, so I, the easy answer is in in the operations world, if you get an alert, that alert should include a URL that links to on my team, our documentation repo, which is a, which is a Stack Overflow instance.

PF I mean, it better be! 

TL Well, okay. So to be honest, I have to say it is except unless it is related to how you would bring the website up--

PF No, it shouldn't be yes. Of course, absolutely. If you can't access your documentation, because Stack is down, right? That is a really bad scene. Yes, smart, good disaster plans.

BP Those are written on stone tablets, that Tom keeps locked in a sub basement.

PF And then and then when the site goes down, he just shows up with them in each hand.

TL Well, the problem I'm trying to prevent, by the way is think about this from the operations point of view, or from the people point of view, if you're in operation, so you're on call, but you get paged, and maybe you're junior and you're inexperienced and you're very nervous, like oh my god, what am I going to do here? And so you want the documentation to be very complete. You want to you want to be told--

PF It's a recipe, right? You want that sort of like okay, put a cup of flour in there. And it's all gonna be fine, right?

TL But think about it from the developer point of view, the person who wrote the document developer or another SRE, no one likes writing documentation. Let's let's admit it. No, no, no one likes that. So they want to write as little as possible. And the operations wants it to be as detailed as possible. So how do we find this balance? And my suggestion is, you want to use something that's editable. And if the developers, oh, sorry, I forgot one more aspect of this, which is, when the operations people don't know what to do they escalate it to developers, right? Well, developers hate to be woken up at 4am. They want as few escalations as possible. 

PF They hate being woken up at 10am! [Tom & Ben laugh]

TL So the way we do this is we construct the runbook. First of all, we construct as a bullet list, mostly because people hate writing documentation, but for some reason, they're okay with writing a bullet list. Okay, whatever. It's a Jedi mind trick. That gets people to write documentation. And then the last bullet in the bullet list is if you got this far, and the problem hasn't been fixed, escalate to the developers. So now the operations people who might fear escalating too much, no, they've been given permission, it says right there in print, at this point, you can escalate it. And if developers feel that they're being escalated to too much, well, physician heal thyself, right? If you're complaining, you can look at the last, you know, couple months of alerts, look at the escalations and update those documents, you know, augment them so that you'll be escalated to you less. And this is this is that feedback loop that creates with the documents start short, but they grow to the right size. And if people feel thir being escalated to too much, they can fix it, you know, or if they love writing documentation they can write it. 

BP And I liked one other thing you wrote, which is that if you're going to read the documentation, just open it in edit mode, right? Or in you know, whatever, your text editor or markdown of choices that way, you know, you're sort of sort of actively in that mental mode to fix it, if you see something. 

TL Always be editing always be improving. If you're the operations person that got paged, if you see a comma out of place, or something bigger, like, hey, these three bullets, I could combine them as as one PowerShell statement. Cool. Do that. And so many times I've seen people say, oh, yeah, yeah, well, after on Friday, I'll have time to do these updates. No, that Friday never comes right. Do the edits live, even if it just means adding a note that says, I didn't understand this. I'll fill it in later. Because that is better than forgetting when Friday comes to do that.

PF I love it. Because nerds collectively, this includes me, we're so noodgy, Stack Overflow is literally a side effect of nerd noodgyness. Like just like, "I know how to do this." But when it comes to documenting their own stuff, right, like, and this is everybody, it's like people would rather update questions, or fix Wikipedia, then then their own documentation. And that's all of us. Tell me a little bit about your team? Like how does Stack do this? How big is SRE? What kind of things are you managing? Like, what does your world look like?

TL We actually have two SRE teams, because one handles the public properties that like, and And the other handles the private Question and Answer experiences, the enterprise product that we offer. And both teams are about four or five people each. And...

PF What keeps you up at night? Actually, that's what I want to know.

TL The allegory of what keeps me up at night is mostly how do I organize and structure our work so that people can be most productive? I'm always concerned. Like, right now, I feel like we could do compound so much better. And I feel that that would help the team. And so that's what I'm gonna be spending the next couple of weeks doing. Or how do we handle when we do get alerted right now, like everyone jumps in to help, which is fantastic, and a great culture. But on the other side, wouldn't it be great if one person worked on it, and everyone else just could stay working on their projects? I think that would be more efficient. So in that light, what I'm saying is, people are good at making their own job efficient. As the SRE Manager, what keeps me up at night is thinking about how can I create a structure where the team works more efficiently? Like the examples that I just mentioned?

PF So what if someone wanted to move from developer to SRE, what advice would you give them?

TL First of all, I think that's a great move because developers bring a certain kind of knowledge to the operational world that is in great need. So first of all, they understand deeply the architecture of the system, but also they bring your coding skills so when we you know, automate something, boy, I've seen you know, automation, I've seen a lot of good automation, the SRE world, I've seen a lot of great automation that really acquired the skill level of a full time developer. In fact, I've also seen great success with developers. Not at Stack, though, I'd love to introduce that here, developers taking like a six month rotation in SRE, and then go back to their development team at the end of the six months, bringing all that operational knowledge so that their future design decisions are influenced by what they learned from their rotation.

PF Alright, so that's it. Everyone, listen to this, go explore it, explore your career options.

BP Yeah, do your rotation. It's like your your military service, you know, just step into the fire for a little bit and see what you learn. I like it.

TL Yeah. And there's some really good books like there's a classic called Web Operations. It's Jesse Robbins and John Allspaw. And the book came out a long time ago, but I think it's still relevant. If I can plug, I wrote a book called The Practice of Cloud System Administration, which is my take on actually the first half are more for developers. And the second half is more for operations, giving the devops thing. 

BP And, Tom, just one question, I guess, like, in the time that you've been here, the years you've been here, how have things changed? Like, what was what were things like when you joined? And what are things like now at Stack Overflow in terms of how we're set up? Or the hardware or software? Like, for you working in this position? What are the things that have fundamentally shifted?

TL Well, 8 years ago, Stack Overflow was, was a much smaller company, much smaller number of developers, I mean, the cloud happened in these eight years. So and even though we, you know, some of our services run in data centers, you know, our newer stuff is in the cloud. So that's, that's been a big change. We've had larger teams, so we've had to manage things differently. And I think we've also gone from a engineering focus to more of a product lead focus, which I think benefits the customers and communities significantly.

PF Well, that is, I think, for anyone listening, you know, here's you, you got it, you got the little primmer, you know what it takes, you've got the two books to read. One was written by Tom, and you know, just go in, go in and tell your developers how to run things. That's what you get to do as an SRE. That's what I like about it. That's what I'm gonna do. I'm gonna go back and tell some developers what they need to do. You know, look, I turn, I turned off that server. Now you better figure it out. 

BP Well, Tom, thank you so much for coming on. And thank you for writing the blog post. We'll put it in the shownotes. And yeah, well, we'll put in a few links to Stack Overflow for Teams, if people are interested in how they can do something similar with their documentation and their runs. Is there like a verb version of the runbook? No?

TL To run-be-vise something? [Paul laughs]

PF That's just, that's just, don't ruin it. So if if somebody wanted to follow up, like learn a little more, find your books get in touch with you. What would you suggest?

TL I have a website called!

PF Hell yeah! How have I not? I should be on here every day. This is like, I'm looking at it now. It's my favorite website.

TL I haven't read updated recently. 

PF You know what, that just makes it better. Now, I'm serious. Like I just I like to see a website with some good, what looked to be jQuery tabs. I'm just like, yep, I know where I am. And I like being here. That's what I'm gonna say.


BP They're no good lifeboats this week. But I'll read something from, I'll read something from serverfault. This question was asked not too long ago, and it's got 25,000 views/ Debug one read underscore passphrase can open slash dev slash TTY no such device or address when trying to connect through SSH. This is a hot question, still on the network still causing people problems, has an answer that is accepted. I'll share it in the show notes in case this has happened to you.

PF And then you know, the the one that I read the other day that was just delightful. On super user wire tchar dot exe files 15 times smaller when using pythons tchar library compared to Mac OS tchar. And it's just like a fun little lesson in how encryption actually works, and how you need to how you need to think about things so check that one out. If you're like me, and I just love a good compression story, can't help it!

BP Alright, everybody. I'm Ben Popper, Director of Content here at Stack Overflow. You can always find me on Twitter @BenPopper and you can always email us

PF I'm Paul Ford, friend of Stack Overflow check out my company Postlight, we're hiring, hiring, hiring and just check us out, we're nice

BP And Tom, if people want to find you on the internet and you want to be found where should they look?

TL On Twitter, @yesthattom but I don't want to a lot of technical stuff there, mostly rants and my website is

BP Alright everybody. Thanks for listening and we'll catch you soon!

PF Bye!