The Stack Overflow Podcast

Who owns this outage? Building intelligent, automated escalation chains

Episode Summary

If your organization is running code on a production server 24/7, you’re going to need a process to handle when that code—or the infrastructure it runs on—fails. No code is bug free, so failures will happen. That means that your SREs and developers are going to have to spend some time on call and ready to respond to when the application breaks down. On this sponsored episode of the podcast, we talk to Eric Maxwell, a solution architect at xMatters, about automating, intelligent escalation chains.

Episode Notes

Maxwell, a solution architect at xMatters, took a winding road to get to where he is. After a computer engineering education, he held jobs as field support engineer, product manager, SRE, and finally his current role as a solutions architect, where he serves as something of an SRE for SREs, helping them solve incident management problems with the help of xMatters.

When he moved to the SRE role, Maxwell wanted to get back to doing technical work. It was a lateral move within his company, which was migrating an on-prem solution into the cloud. It’s a journey that plenty of companies are making now: breaking an application into microservices, running processes in containers, and using Kubernetes to orchestrate the whole thing. Non-production environments would go down and waste SRE time, making it harder to address problems in the production pipeline.

At the heart of their issues was the incident response process. They had several bottlenecks that prevented them from delivering value to their customers quickly. Incidents would send emails to the relevant engineers, sometimes 20 on a single email, which made it easy for any one engineer to ignore the problem—someone else has got this. They had a bad silo problem, where escalating to the right person across groups became an issue of its own. And of course, most of this was manual. Their MTTR—mean time to resolve—was lagging.

Maxwell moved over to xMatters because they managed to solve these problems through clever automation. Their product automates the scheduling and notification process so that the right person knows about the incident as soon as possible. At the core of this process was a different MTTR—mean time to respond. Once an engineer started working to resolve a problem, it was all down to runbooks and skill. But the lag between the initial incident and that start was the real slowdown.

It’s not just the response from the first SRE on call. It’s the other escalations down the line—to data engineers, for example—that can eat away time. They’ve worked hard to make escalation configuration easy. It not only handles who's responsible for specific services and metrics, but who’s in the escalation chain from there. When the incident hits, the notifications go out through a series of configured channels; maybe it tries a chat program first, then email, then SMS.

The on-call process is often a source of dread, but automating the escalation process can take some of the sting out of it. Check out the episode to learn more.

Episode Transcription

Eric Maxwell And that's where xMatters comes in. Right? You have that escalation, I mean, there's the escalation of people, right? Because we want to target the right people, build that sense of responsibility. And not necessarily to build blame, right? We want to target someone so that they feel responsible. And if they need to escalate, then we give them the option to be able to escalate to someone else, because they're busy or that, you know, they're just not able to get to it, whatever that is, right? They can, you know, give them the tools they need. But that escalation even in the devices, right? So if it does notify me, I can have it set up with, okay, well, you know, send me an email and if I don't respond in a couple of minutes, then, you know, maybe send me a text message. And then if I don't respond to that, call me. So there's this multiple levels of escalation.

[intro music]

Ben Popper Hello everybody, welcome back to the Stack Overflow Podcast. I am Ben Popper, Director of Content here at Stack Overflow. And I'm joined as I often am by my co host, Ryan Donovan. Hi, Ryan.

Ryan Donovan Hey, Ben, how are you doing today?

BP I'm good. So today is a sponsored episode brought to us by the fine folks at xMatters, which is an Everbridge company. And we're going to be talking about SRE, planning, runbooks, and all the good things that come with responding to late night emergencies when the server's fall over, right?

RD Oh, that's fun.

BP Yeah, that's fun. So I'd like to welcome our guest today, Eric Maxwell, who is a Solution Architect at xMatters. Eric, welcome to the show.

EM Thanks, Ben. Glad to be here.

BP So let's start out just tell me a little bit about yourself, how you got into the world of software and technology.

EM So my career kind of took various paths as my background and education was in computer engineering, but actually took kind of a different path I went more into as a field engineer, support, project manager. And then eventually I got into I, did a little development in there more around automation around operations, things like that. Then I moved into product management. And then from there, I was at a company that we were moving into the cloud, we had got acquired. And then after that point, things started to move quickly. And I was in a development team, right, or as a product manager. And as we were moving to the cloud, we had to take, we were starting to take more responsibility of that, especially production operations. And there was this need, and I moved into an SRE role and managing an SRE team that was responsible for our solution.

BP So that makes a lot of sense. There's sort of that moment of organizational change of digital transformation getting acquired and going to the cloud. But what would motivate someone to leave a cushy project management role to take on the stress of being an SRE? What kind of person do you have to be to want to do that?

EM I would say, I mean, I've been a very technical person, like I said, my background is in engineering. And I've always enjoyed it. And I mean, being a product manager, I like that. And I would say it was more on the technical side, I always wanted to, you know, always understanding of what was going on. And I had this opportunity to stay within the team on a solution I was already working on and moving back into that technical role. And it just sounded interesting to me and exciting. So I took that took that opportunity.

BP Yeah. So tell us, I guess, yeah, just a little bit about the years you spent hands on as an SRE. You know, how did you learn that role?

EM As I was saying, we were moving through this digital transformation, and our company and moving to the cloud, just accelerated--

BP On prem to cloud. Yeah.

EM Yeah, just accelerated that. And it brought many opportunities. And there was a lot of learning, right, from myself and the team just moving into this. We're not just the development team anymore. We're now development and operating our solution. And we're all responsible for that. And how do we manage that? And how can we be efficient at the answer is a lot of learning how to automate and how to use the tools at our disposal, especially moving to the cloud with all the options that we add. Right? So it was a lot of learning and, and research in the beginning.

RD So when you move back to being an SRE, you know, mostly cloud based, right? What sort of things were you working with? What was your new toolset there?

EM Yeah. So I mean, originally we were in a on prem solution moving over, right? And originally, that's moving into just basically trying to migrate everything over. And that was just just moving into cloud with VMs. Right, basically just kind of lift and shift as much as we could. But from there, we started to migrate. And we still had issues with--or not, I would say issues. There were things like deployments that still took an hour or whatever, we want to deploy a new version and how do we bring that down? And moving to microservices, changing our architecture to a micro service architecture, moving to containers and Kubernetes and all those things, right. But then also with the management of that, how do we manage this operationally? You know, we're bringing deployment times, but how do we bring down our incident resolution times and how and not even just in production, but even in our non production environments, because as we started to move more quickly, in our non production environments became very important. Because if we slow those things down, that was slowing down our output, right, and our value to our customers, so bringing us how do we automate and reduce those bottlenecks that were causing us to, you know, slow down our incident resolution,. And a lot of that was this, you know, sending emails and kind of the silos, even though we're in this team, we still had silos. We had engineers, data engineers working on data pipelines in our backend engineers working on the API's and things and then our front end engineers, and how do we bring all of these together and find the right people and notify the right people when there is an issue and bring them in and to resolve those issues quickly?

BP Yeah, you make a really interesting point there, which is that there may be as much value for a company in figuring out where the blockers and the bugs and the slowdowns in the pipeline are internally as externally. Obviously, if a customer is having a problem, that's a reason for everybody to jump up, and, you know, address it, or if a user is having a problem, but in terms of the value that the company might capture over the course of a year or five years, you know, if you can help your engineers get through testing, or Dev, all the way to prod, you know, much faster, because when when an issue arises, they're able to see where it is or who should respond, that actually has a ton of value. It's not a place most people think about SRE traditionally, right?

EM Right. And I would say that's where we got really efficient in the production environments. We got really efficient at solving incidents and bringing in people. And people ask me, you know, on a day to day basis, where we focus on and most day to day, most of the issues we dealt with were non production environments. And I would say, as an SRE, we were still responsible for a lot of the operational pieces. Because I mean, we are running mere environments of our production to run testing, and all these things in and if those environments go down, or they're having issues, then that's slowing down. Like I said, bringing those things into production.

RD Even though they're not production environments, they still waste your time when they go down.

EM Exactly, they cost money, right? If you're stuck, you have people sitting waiting, there's just wasting money.

BP So the tools that you were using, you mentioned, you know, that can feel kind of siloed, you know, is sending an email to a person on one team, and maybe they start following it around trying to ping somebody on text or on a call or in a, you know, work chat, I guess that might be a natural transition is sort of like what you do at xMatters. Tell us a little about like moving from SRE to Solution Architect, and how now your role is kind of to help people who are in that SRE role you used to be in right?

EM So I mean, as far as you know, moving from that kind of, I would say, traditionally, it was always email, maybe some IM, we were using Slack, some teams were using, we started moving into MS Teams at the time as well.

BP Was it AOL IM?

EM Oh no, no, no. [Eric laughs] But, you know, it was still I would say, still highly email or even walking down to someone's desk when someone needed something. But you know, those silos, like I said, we had these various, you know, we were one team, but we had various developers, engineers focused on certain areas of the product. And when certain things happened, you know, for example, so you know, if there was an issue in the data pipeline, the data engineers and things may get notified. And they knew about that, and part of that was was, then there was really no automated way or a process, automated process to for everyone else to know, like, we could notify everyone, but then everyone starts to get bombarded with notifications, and they start to ignore them, you know, the data engineers would get notified. But then the problem was, then they had to remember that, oh, well, we found this issue. And we need to let you know everyone else. No, we need to let support know that we found this issue. And this is how long it's going to take. Because then customers start calling in if they notice things, and then it looks a lot better. You know, one is to have it, you know not to have a problem. But then it looks better if they call it and you say well, we already know. And we're going to have this resolved in the next 10 minutes or 15 minutes or something, right?

BP So sort of like a smart chain of notifications and alerts and something that keeps everybody updated on what everybody else who needs to know does know?

EM Exactly. So automating that process and integrating not just with sending emails or sending SMS and text messages. And definitely xMatters does all of those things. But also integrating with your instant messaging tools, Slack, MS teams, but even things like status pages, or even if you have proprietary type tools that you want to update and things like that we can integrate and add that into your process, automate those things, right.

RD I mean, that's where the dev workflow lives a lot of times is in that chat application.

EM Exactly. And on that note, I mean, that's taken off a lot as far as when it comes to IM, instant messaging, Slack, Teams. That's where communication is moving. I mean, it's been moving there for years, but it's now especially with a lot of people working remotely, it's become very, very important and people have figured out that email is not very efficient at those things because people kind of ignore it or don't get to it for a while.

BP Yes, Ryan and I are familiar with having to set aside time to work through your backlog of your inbox. But a Slack I usually get to in a day. A text message is what somebody sends me when I'm not responding on Slack.

EM Exactly. And that's where xMatters comes in. Right? Is you have that escalation. I mean, there's the escalation of people, one is targeting the right people, right? Because we want to target the right people build that sense of responsibility, and not necessarily to build blame, right? We want to target someone so they feel responsible. And if they need to escalate, then we give them the option to be able to escalate to someone else, because they're busy, or that, you know, they're just not able to get to it, whatever that is, right? They can, you know, give them the tools they need. But that escalation of even in the devices, right? So if it does notify me, I can have it set up with, okay, well, you know, send me an email, and if I don't respond in a couple of minutes, whatever, then, you know, maybe send me a text message. And then if I don't respond to that, call me. So there's these multiple levels of escalation.

BP Our CEO just sent out an email recently to the company, and it was about blameless accountability, which I think is sort of the concept you're describing here, you know, it helps people understand in a transparent way, you know, how responsibility flows. And you know, not to say, right, if something goes wrong, you're going to be the one to, you know, have to fall on your sword here. But just to say, like, this actually empowers you to know when you need to do the work. And when you can, you know, sort of ignore it and not have to let it take up your time.

EM And it goes to the fact that, and I know, we've all seen it, no matter your role, like if you send out an email or a notification to a group of people, everyone's gonna be like, Well, I'm busy and such as you know, somebody out of those 20 people--yeah, they'll take care of it. Right. But everybody thinks that, then nothing happens, right?

RD So now that you're sort of the SRE that helps other SREs, what are the sort of issues that people run into that you're solving for? And how do you measure how you solve a problem?

EM First, I'll say that most of the time, when I talk to a customer, they're still at that, you know, they may be in a transformation. And they're still at that trying to make their production kind of incident management processes more efficient, right? And I would say, that's probably where most people want to start, because that's the most impactful. And then we start to move into more of the development engineering processes and how we can help in this kind of non production. But right, so the most common measure is MTTR, which the traditional one is mean time to resolution, right. But there is kind of a sub set of that, which is still MTTR mean time to respond. Right, which is a subset of that resolution, right. And if we can reduce that mean time to respond, we're going to reduce the MTTR. And you'll be quite surprised in how many customers if we can reduce that in the mean time to respond, that that's actually a big bottleneck in a lot of cases that we see. Right? And it doesn't really matter if that's production, or even non production, that's, you know, that mean time to respond.

RD That's getting the the text message out to the first person, right?

EM Right.

BP Getting somebody to actually start taking action, as opposed to the action itself is what takes up the time.

EM And that usually, and I'll say, even with production that starts with their MIM process, a major incident management process. So that's when they're highly really, you know, critical, urgent issues that are happening in production. That's typically where customers want to start, right, because that's where they see the most bang for the buck when it comes to customer impacts. Right. And I was working with a customer recently, where they that's what they dealt with, right? They had a process, and most customers have a process. The issue is, is there's a lot of manual steps to that. And in this case, you know, someone may be detects, it was a monitoring issue detected that notified a group that then they would put it in incident, and then they would request it to get up, you know, upgraded to a major incident. And then a major incident manager would get, you know, the group would get a notification and someone would take ownership of it. And then they would go and pull up some spreadsheets they had on SharePoint, and look up who needs--this is the system that's having a problem or the service, and I need to find who I need to notify. And then they notify those people and they don't respond. So then they have to go look and see who the next person is to respond.

BP This is making my brain hurt.

EM Right? And then finally someone responds and it's like, Okay, now we've got to get everyone on to abridge call and you know, all these things that work on it, right? And so that's where we come in. One is, automating those processes, making it very easy. You know, even if you still need human intervention, we can give one touch operation, one touch thing. So it's like, oh, we want to declare a major incident, one button happens kicks off your process starts to you know, as far as we notify a major incident manager in this case, we were notifying a manager, we were pulling from a pool that they had, right? And then we start automatically notifying and escalating to the people based on the incident and the service that was impacted, right? And we're handling that. Someone doesn't respond, well, we'll go on to the next person. So this major incident manager is not focused on alright, let me go and wait. wasting time looking who it is I need to notify and then trying to remember like how much time has gone by let me notify someone else xMatters is handling all of that for you, automating that. But the other pieces is in those notifications. I'll say that's kind of the next step is your you build that escalation and build that responsibility notifying and targeting the right people, but then also bringing in the information, as much information as possible and not overloading them that can help them resolve that issue. I mean, that might be providing querying systems before we go out, like maybe things like, you know, if it's an incident, do we have any related other incidents related to service open? Have there been any in the past? And have there been any changes in X amount of time that are related to this service, some deployment that's happened recently, or whatever, so that that's all there, there may be links to various things that into other systems. So they're not having to go and figure out where you know, where these logs are, wherever that is, it's all provided right there, in that notification, right up front. Makes it easy for them to find.

RD How about the scheduling part of this? I mean, I remember going out with a friend of mine, you know, 15 years ago, where he bring a laptop to the bar, because he had the Friday, Saturday 8-4 am shift.

BP What did that guy do? Who did he piss off?

RD Oh, you just got to do it. But it sounds like you have a pretty tight, complicated escalation chain. You know, I think scheduling your frontline guy is easy. It's the next folks that I think the complications.

EM Yeah. I mean, as far as the escalation goes, right, and I'll say that's one thing that we've worked really hard in our product, xMmatters to make that very easy to set up, right? And as far as scheduling goes, yes. So you have the escalation, but then the scheduling for that grou. You know, within xMatters, we handle things like who who's responsible, you set up your groups, you, you can tie those two services and things, right. But those groups hold information about who's on call, what's the escalation process, but when it comes to scheduling, you have the options of setting up your rotations and things like that. So xMatters automatically does that for you, you know, based on events or schedules with, you know, however you want to do that. But then also, we make it very easy for someone to use, like, I'm going to be on vacation, or maybe I'm out sick, I can easily write from the app, or if you need to go in the web interface and go and schedule that I'm going to be out and you can designate someone that's going to replace you. And if not, it will just notify the next person, it will just basically make the next person the escalation, the primary, if you're out things like that, but yeah, so it makes it very easy.

BP You mentioned that, you know, not only does it try to have this intelligent sort of system of alerts and escalations. And as you just talked about scheduling for folks, but also to pass along relevant information. And as the scenario updates to sort of populate that information out to everybody. So in what ways does this take the place of a runbook? Or maybe integrate with folks who have like a runbook tool? And I guess, yeah, like, does it, you know, touch on other pipelines that that may be running? You know, I think part of what this is all about is the move to the cloud and microservices and containers and not knowing what upstream or downstream is causing the issue. So maybe first talk a little about runbook, and then maybe bigger picture, like how it can understand, integrate, and then communicate about, you know, these various dependencies that you have in the cloud.

EM So one, yes, xMatters is not, it's not a runbook tool, we call it an orchestration tool. So we integrate with many different tools, and runbook tools being one of those, right? And, and I've worked with many of those DevOps, Jenkins, you name it, and, you know, we could really integrate with it. So you can do things like automate, based on information coming in, we want xMatters to make a decision on okay, we need to run this runbook or maybe provide the user that's being notified with the option to do those things, right to trigger a runbook. If you don't want it to automatically happen, right? Those integrations happen. We have a tool called Flow Designer, it's a visual low code tool. So we have a lot of built in what we call steps that are there integrations to various tools. And basically, you can we have templates to start from, but you can easily just tie those things together. And they just pass information from one to the other, that you can use to build these automated flows and help build logic to make decisions on how you want your process to run. Right. So that can be you know, just as basic as have a monitoring alert, and it's critical. So I want to go ahead and create an incident, or it's actually low, it's a low urgent monitor alert. So I'm just going to send a notification and have someone just triage it and decide if an incident needs to be created on the low end, and then, you know, we can move on up to runbooks and things like that, right?

BP Where do you see this evolving in the next couple of years? And how do you think, you know, like a company like xMatters can be most helpful to clients, You know, when you talk about, like, sort of Intelligent Automation, does that involve some sort of, you know, AI or machine learning that's trying to predict what will happen based on data? And I guess, you know, thinking ahead, to sort of like as we talked about the increasing complexity of architecture for many services, you know, how will you be able to help folks as they build things that are increasingly sort of diffuse away from the monolith and towards the micro services?

EM Right, exactly. Things are getting more complex, micro services and containers and all these different things. Cloud, moving to cloud and all the options that you have. And you know, xMatters is constantly growing as a tool to to handle those types of situations, right. As things get more complex, we build more and more features and things to help simplify and make those things easier to manage. But I think overall, your incident processes are, you know, as things get more complex are necessarily going to change. But we're here to help automate and make that to simplify that process for our customers, right. And then, you know, talking about AI and ML, those are things are getting big. You know, we have lots of partners that we work with around that space, right to help bridge that gap of the complexity and simplify those things. We have features that take in that information and help within xMatters to help simplify things within the automation process, incident or even engineering processes as well.

RD There you go, that's the SRE motto, automate all the things, right?

EM Yes, exactly. Exactly.

[music]

BP Alright, everybody. Well, thank you so much for listening, Eric. Thanks for coming on. As I do at the end of every episode, I will shout out the winner today ' lifeboat badge. Thanks to Günter Zöchbauer for coming on and helping save a question from the dustbin of history. unable to locate the Android SDK. Okay, if you can't find it, we know where it is. So we'll find the Android SDK for you. I am Ben Popper. I'm the Director of Content here at Stack Overflow. You can always find me on Twitter @BenPopper. Email us podcast@StackOverflow. And yeah, if you like the show, leave us a rating and a review. Ryan, who are you? Where can you be found?

RD I'm Ryan Donovan. I'm a content marketer here at Stack Overflow. I edit the blog and the newsletter. You can find me on Twitter @RThorDonovan. And if you have a great idea for a blog post, email me at pitches@stackoverflow.com.

BP Eric, who are you? What do you do? Where can you be found online? And if folks want to learn more about xMatters, check it out or try it out, where should they go?

EM Eric Maxwell, I'm a Solutions Architect with Everbridge for the xMatters platform. You can always reach out to me at eric.maxwell@everbridge.com and go to xMatters.com and check out a free demo of the xMatters platform and start automating!

BP Alright everybody. Thanks for listening. We'll talk to you soon.

[outro music]