The Stack Overflow Podcast

Improving error monitoring with AI

Episode Summary

Tillman Elser, AI/ML lead at Sentry, joins Ryan for a conversation about improving error monitoring with AI and ML. They talk through the challenges of analyzing stack traces, the innovative use of embeddings to improve error grouping, the trial-and-error process of developing algorithms, and where Sentry’s AI capabilities are headed next.

Episode Notes

Sentry is an application monitoring software. Explore the Sentry docs or get started in the sandbox.

Connect with Tillman on LinkedIn. You can also read his posts on the Sentry blog.

Listeners, how do you handle stack traces? How do you trace the root cause? Let us know at podcast@stackoverflow.com.

Sentry user? The company would love to hear your feedback. Let them know what you think on Discord.

Episode Transcription

[intro music plays]

Ryan Donovan Hello everyone, and welcome to the Stack Overflow Podcast, a place to talk all things software and technology. I am Ryan Donovan, and today we're going to be talking about AI-related things, but a simpler –dare I say it– a dumber use case than the large language models. My guest today is Tillman Elser, Engineering Manager for AI and ML at Sentry. Welcome to the show, Tillman.

Tillman Elser Hey, it's great to be here. Super excited to talk about the stuff we're working on.

RD Yeah, I'm excited too. I love the simplicity of this. But before we get into it, we want to get to know our guests at the top of the show. How did you get into software and technology?

TE Yeah, it's been a winding road. I majored in Computer Science and Economics and I was very torn between the two. I started out my career working at the Federal Reserve Board doing economic research, loved the sort of coding and stats side of things and was a little bit less interested in the economics side of things. From there, I had a very interesting role where I was actually forecasting the amount of interest in auctions for issuing treasury debt. So it's an interesting process that a lot of people don't realize, but the way the government funds itself is we actually have eBay-style competitive auctions where banks will literally say, “Hey, I want $8 billion of 30-year treasury bills and I want this interest rate,” and what that means is that if we do a really bad job of calibrating supply and demand, so let's say we try to issue way more debt than there's competitive demand for, then all of a sudden we're going to be paying a much higher interest rate so it directly affects the interest costs to the government. So I was doing this huge project where I was forecasting essentially the competitive demand for these auctions and just loved sort of the ML side of things but hated everything else about my job, including the government and all that stuff, so moving to tech was a pretty easy transition from there and so I've been doing price forecasting, ad tech stuff, and now I'm at Sentry doing all the cool AI/ML stuff here.

RD Yeah, you're doing AI/ML in terms of security, right?

TE Well, you can think of it that way. I mean, the way I would describe it is actually slightly different. So when you hear Sentry, it definitely sounds like we're talking about security, but the way I like to think about it is, first and foremost, Sentry is a tool for developers. So it's a tool for developers who are shipping code to production and they want to do so with confidence. They want to know that their code is working, and obviously, because nothing's perfect, when their code breaks, they want to have visibility into how it's broken. So that's really where Sentry is super valuable. You make your PR, you ship it to production, and then everything catches fire, and our job is to go in and kind of save the day and help you fix things.

RD So I want to talk about the blog post you wrote that brought us here. You're applying embeddings to– is it stack traces? I love the kind of simplicity of that because anybody who's looked at a stack trace knows that it can be a winding road to sort of figure out what it is. Could you tell us about how that embedding works for stack traces and the sort of results that you get?

TE Yeah, if you don't mind, maybe I'll even take a quick step back and just talk about the overall kind of use case that we're talking about here. So Sentry is an error monitoring company. We are only useful to you as a developer if we're able to aggregate all the stuff that you're sending to us and present it back to you in a useful way. So let's imagine that your software, there's some API that's having a consistent error and it's causing 40 million errors that you're sending to us. Sentry is useless if there's just 40 million different line items in the table somewhere clogging. The whole value prop of Sentry is that we're able to take all of these errors and aggregate them into this semantic concept of an issue. Like here is one problem with your code that's this one set of stack traces. Here's another problem with your code, it's this other set of stack traces. And so the idea here is, well, this seems like a very natural application of some of the recent advancements in embeddings models. We now have these super fancy BERT architecture transformer based embeddings models and what's really special about them is that they're able to extract sort of the semantic meaning of large, complex strings like a full stack trace. And so that's sort of the intuition, which is let's take this state of the art embeddings model, let's build this representation of a stack trace that's really rich and actually captures all of the nuance that makes it challenging to categorize them, and then now that we have this great representation, we can just plug it into a really dumb, simple approximate nearest neighbor algorithm to go from there to actually do the grouping.

RD Yeah, the couple times I was asked to track down a 500 error using logs and traces, it's no fun. When you look at that trace, what are the sort of markers in there that go into building that embedding?

TE Maybe let's take a step back and think about what goes into a stack trace. So it's really a combination of a lot of different things sort of getting pieced together. There's all of your application code, which is the stuff that you wrote that is defining kind of the business logic for how your software is processing data and doing whatever it is that you're doing with your code. And then there's all of like the system stuff, which is maybe lower level stuff, maybe it's libraries that you're using. It's stuff that is code that's being executed on some level, stuff that's happening, but it isn't things that actually relate back to what you wrote and what you're trying to potentially fix. So one of the first things that we do, and this goes back to your question of what's going into the stack trace, one of the first things that Sentry tries to do is categorize the frames in your stack trace, actually. So the intuition there is essentially that there's, generally speaking, less signal, less information in these system frames, the frames that are just kind of telling you about what's going on in your computer, or whatever computer on whatever AWS or GCP server is running your code than there is in your actual application logic. So we're basically first trying to filter that out, and so that's the first step– filter out the stuff that we view as essentially noise through the lens of doing grouping. And then the second step is trying to represent what we're left with in a way that's cleaned up and useful for this embeddings model. And so there's a lot of cleanup that we do getting rid of just gobbledygook strings and then applying some business logic to handle recursion and endlessly nested exceptions. Because that was one of the learnings we had when we rolled this out that there's all sorts of crazy exceptions out there. You think of a stack trace as 10 frames that explain what's going on, and then the customer sends you 10 megabytes of random stuff. So first we try to filter out the stuff that doesn't matter. Then we try to represent the stuff in the most useful, cleaned up way, and then finally we pass it to our embeddings model.

RD You're talking about the nested exceptions. I’m having a PTS flashbacks to using any kind of Java. Those are the longest and the worst. And I was thinking also, are you just using that sort of auto-generated stack trace or is there any sort of instrumentation to gather other data?

TE Sentry does have a lot of instrumentation to collect relevant metadata, and that's the stuff that if you were to go on our UI, you'd see all this additional stuff. For the purpose of grouping specifically in the stack trace, we are feeding the raw stack trace essentially. So we're basically taking the raw stack trace, we're cleaning it up, and that's what's going into the embeddings model.

RD Does it depend on the particular programming languages you're using?

TE It does. This was a tricky thing for us actually to manage. So maybe the way to think about that is to take a quick step back and just think about what the embeddings model is even doing in the first place. Do you mind if I ramble for a moment? I'm just going to give just my take on the embeddings model.

RD Of course, of course.

TE So if you go way back in time, this is the way the first really useful popular representation of embeddings was, this TF-IDF representation, which is Term Frequency-Inverse Document Frequency. Essentially TF-IDF is just a way of taking a bag of words, which is your document– so in this case, it'd be a stack trace– take a bag of words and essentially highlight the stuff that is common in Document A, but uncommon in the broader corpus. And the intuition there is that those are probably interesting things because we're highlighting the stuff that's sort of novel to whatever document that we're looking at at this point in time. And that's great, but it completely misses all of the nuance and all of the things that you really need to be thinking of when you're trying to make a good representation of a stack trace, because it's not just the bag of words. It's the ordering of the words, it’s the context of the words. One specific word being in one specific place is super, super important. And so going back to your question about languages, one of the most challenging things is that those rules obviously depend on the language. Every language has different rules that define where a really important thing is and how it's represented. So what we are essentially using is, you can think of the model that we're using as essentially the first half of a large language model, because that's quite honestly what it is. We're basically taking a large language model all the way up until the point in which it makes a representation in an encoding or an embedding of the text, and then we're just stopping there, and we're just taking that and we're using that as essentially a feature vector to go into this system. And so in terms of how that relates back to different languages being sent to Sentry, maybe in Java, the first line of a stack trace, there's certain things that are really important to keep an eye on. In Rust, it's totally different. In Go, it's totally different. We're using a language model that's been fine-tuned on all of these different languages, so it's actually able to understand that nuance. And the end result of that is we have a single embeddings model that we're able to use for every single century SDK that we have so we don't have a separate embeddings model for each of our different languages, which is another thing that obviously I'm really happy about because it just simplifies the system a great deal.

RD Yeah, and I know some systems will cross languages. They'll have wrappers around things. Are you able to follow the stack trace across those wrappers into other languages?

TE Yep, a hundred percent. And again, it kind of goes back to this concept of if anyone has read the original sort of Transformer paper. It's called ‘Attention is All You Need.’ That's sort of something that's very central to how we are extracting information from these stack traces where it's almost like the most important thing that we're doing is like, imagine you're looking at a big stack trace, it's a bunch of exceptions. The most important thing that we're doing is almost taking a magnifying glass and zooming in on the part of it that really matters and sort of filtering out the rest. It's less, ‘are we able to combine all these different languages together?’ and it's more, ‘can we zoom in on the signal here, which is really telling you what is uniquely defining this exception and can we kind of filter out all the noise, which is all this other gobbledygook that you might have?’

RD Yeah, that is an interesting one, because again, to the Java exceptions, there is a lot of noise there because they'll start with whatever the button click is and then you have to figure out the chain of returns that got you there. Did you have to apply any sort of personal thinking about stack traces to get to understanding whatever the understandable piece of the stack trace was?

TE So we did spend a lot of time on that, and it was a lot of trial and error because it's sort of our job in building this system is just exactly what you said. Our job fundamentally is identifying what is the best way of representing the stack trace that will allow our language model to extract the necessary information as reliably and efficiently as possible. So there was quite a bit of trial and error that went into trying to identify that representation, and I'm happy to talk about it because it was kind of a fun process as well.

RD I mean, I love the process of trial and error because there's a lot of rabbit holes and false starts, and I think those are almost more interesting and more informative than the successful solution.

TE A hundred percent, yeah.

RD So can you talk about some of the thinking, some of the process, some of the false starts and all that?

TE Yeah. So we kind of had a fun sort of Sentry meme emerge from this process, and the origin of this was, so we had this very fundamental problem actually, which is that we had this intuition. This intuition is that this fancy transformer architecture, this embeddings model was going to be really good at grouping errors together, but if you think about how do we actually validate that intuition, it becomes a very challenging question because there's not a labeled dataset that we have. There isn't a golden truth of, ‘hey, here is what perfect issue grouping looks like. Here's what bad issue grouping looks like. Do we look like the perfect grouping or do we look like the bad grouping?’ In ML lingo, it's a completely unsupervised problem. We don't have any label data to work with. And so that was the real fundamental problem. If we talk about false starts and everything else, the real problem is that at the start of this project we don't even know how to measure success. We do something, there's not an eval that we can run that just says, “Oh, we made grouping 10 percent better.” And so the Sentry meme that emerged was born out of our work to kind of build this labeled golden dataset that we could use for this exercise. And so what I ended up doing was I just invited a bunch of random engineers to a meeting, and the title of the meeting was ‘Escape Room.’ And so they got really excited because they thought it was going to be one of those escape room activities or virtual escape rooms or something, and so everyone just showed up and then we just shut the door and we told them they could escape the room when they had finished labeling a hundred items on a spreadsheet. Because what we did was we basically ran our old grouping algorithm and our new grouping algorithm for several thousand different issues that we as engineers had context for because they were internal to Sentry. So now we have two data sets or two possible grouping configurations. One is using our approach and the other is using the other approach. And we essentially just hand labeled thousands of rows using engineers that actually had context and could just literally manually review it and be like, “Yeah, these two things should be grouped together. These two things shouldn't be grouped together.” So it was a lot of upfront work and frustration, but the outcome of that was super, super valuable because what it gave us was the validation dataset that we actually needed to be able to do that iteration that we were just talking about where now we have a very sort of fast flywheel of iteration where we can test X, see how it does in the evals, okay, let's try Y, see how it does in the evals. And so that was an incredibly useful– what's the word– accelerator for this work.

RD So where did you get that database of stack traces?

TE Yeah, so good question. This was all internal. So essentially the workflow that we followed was, obviously we use Sentry internally. We use our own error monitoring to dogfood our own product. And believe it or not, even our code has errors in it. And so essentially what we did is just take all of the errors that all of our back end systems have had over the past 90 days and imagine a spreadsheet where column one is the stack trace for an issue, column two would be the stack trace that our new grouping algorithm thinks is the same as stack trace one. So two different strings, but our grouping algorithm thinks they're the same, and then all the other columns are just all the individual engineers’ evaluations of whether or not they think those are the same thing, and that's how we kind of make this dataset.

RD And so you have this labeled dataset of stack traces from your own code. How do you then get to have that dataset and model applied to anybody's code?

TE Yeah, that's the really hard thing, and there is a bit of a leap of faith there. So I'll give you my intuition for why we were comfortable rolling it out based on just our own code, and there's maybe two reasons there. The first is just a practical thing, which is there is no way for us to build a dataset that captures all of our customers’ code, for two reasons. One, we don't have context. We don't know what right grouping means from the engineers’ perspective on these other projects. There's a reasonably subjective element to it where different reasonable engineers might have different opinions about what good grouping looks like. So that's one problem. And then the other problem is even more even more practical, which is just kind of a privacy thing. We don't have the permission to do that kind of work with our customers' data. So it's kind of a nonstarter. We have to use the dataset that we have. And so the question is, okay, do we think that data set is going to generalize? Do we think what we are seeing is consistent with what everyone else is seeing? And the reason why we were reasonably comfortable saying yes to both of those questions is maybe two things. One, although, yes, this is all internal Sentry errors, luckily Sentry does have a number of different frameworks and languages that we use internally. So we have Rust, we have Python, we have TypeScript, and we have all sorts of different projects. And so one valuable data point is basically just confirming that the results internally look consistent across all of those projects. We say, “Here is the threshold that we think makes sense for Python. Okay, let's see if it also makes sense for JavaScript. Okay, let's see if it also makes sense for Rust.” And based on that, we can be relatively confident that it's not going off the rails. And then the last thing that we did was do a very, very gradual rollout where– the rollout was a whole other beast, but essentially we did a very gradual rollout that involved a lot of upfront validation that the results made sense before we turned the system on in a way that would actually affect what was happening for them. We kind of did a backfill that let us see what would happen if we turned it on, and then you can look at that and basically quickly sanity check it and make sure we're not doing anything crazy, and then from there we can turn it on.

RD Is there any need/possibility of having customers do the sort of escape room labeling testing training that you've done?

TE It would be really nice, and we did talk about that several times. I think the answer is yes. We never ended up doing it, mostly because we felt that the results were working quite well, and having said that, there's always room for improvement. So for a V2 or a feature work, I do think that would be a valuable exercise, and we did talk about that, giving people Sentry credits or t-shirts or gift cards or whatever in return for labeling some of their data. The only reason we didn't do it was basically because we felt that we had good results and we didn't need to delay launch any further. Having said that, absolutely a good idea and something I can see us doing if we revisit this model.

RD Yeah. I mean, I could see a need to maybe open source the model or something at some point to get those kinds of results or give credits, but if you're having good results, it's almost like, “Well, why mess with a good thing?”

TE Exactly. Yeah, that's very much our philosophy. Because one thing that's kind of interesting is the whole AI/ML function at Sentry is quite new, which is very exciting because it means we have a ton of low-hanging fruit. There's so many different things we can do to add a lot of value to our product. Our kind of take on this is we're very much after the kind of 80 percent solutions where we make something a lot better but we don't spend six months or a year making it perfect, and then we can just kind of move on.

RD So are there upcoming low-hanging fruit that you're excited to tackle that you can talk about?

TE There's so much. There's no shortage. Maybe I'll just talk about two things that are top of mind for me that I'm really excited about. One is, going back to what I was saying before about what is the value prop of Sentry, well, again, Sentry is a useless platform if all we're doing is ingesting your data and it's kind of a glorified database. We need to be very opinionated. We need to do a very good job at filtering out all of the noise and really surfacing the things that matter. So again, just taking a quick step back, think of Sentry as this tool that's ingesting all of your data, and it's useful to you as a developer if we're able to take that data, organize it, clean it up and show you what you need to care about. And if you just think about what that means, you can frame that as an ML problem where it's like, “We need to classify this as important or unimportant. We need to group these things together.” So what's really top of mind for me on the workflow side of things is what can we do to make Sentry more opinionated? What can we do to go from a platform that does a really good job of grouping the issues that you're sending us, which is what we do today, to a platform that not only does that but we tell you what you need to care about, and that's what I'm really excited about where we can start going in the future, where we're able to say, “Hey, this issue actually matters and here's why. This issue you can ignore and here's why.” So on the workflow side, that's what's top of mind. And I know we're talking about embeddings but we also have some really interesting generative AI stuff that we're working on right now where just very briefly the sort of intuition there is that Sentry has this really unique perspective into what's happening in your system because we have all this telemetry that you're sending us, and we also have access to your source code through the GitHub integration. So we're uniquely positioned to really understand what's going on with your application when it's having problems. We can see what's actually happening, what it's actually doing, and we can see what your code looks like. The way we’re thinking about generative AI is building tools that kind of put those two things together in a way that's super useful for you as a developer to debug your code. We just rolled out the open beta of a feature called ‘Autofix’ where it does exactly that. You open up an issue, you hit a button, and we take all of the context that Sentry has, all of the code base on your side, and we put them all together and we use that to build what we think is the actual root cause of the issue and then how you can fix it.

RD Cool, and it automatically creates a PR or something?

TE Exactly.

RD Are there security guardrails? Obviously you're giving access to the code and the system to a LLM. What are the guardrails for that?

TE Yes. So I was about to make a joke, but I feel like it's a bad thing to make a joke. No, it's something we take very, very, very seriously. So Sentry, we have an extremely strong stance on privacy basically, which is that our fundamental thesis is that your data belongs to you. It doesn't belong to anyone else. And unless you give us explicit permission, we're not going to use your data, certainly not for training or fine-tuning a language model or anything like that. So from a sort of privacy and security perspective, our solution to that is we're basically moving all of our LLM inference in-house. So we're a GCP shop, and GCP has done a really good job of building out LLM inference infrastructure that is basically just within your own VPC. It's not leaving the GCP subprocessor at any point in time. So the main thing that we're doing is restricting, as much as we can, all of our inference to be in this GCP world. And then the other thing that we obviously do is, any LLM, whether it's GCP or anything else, is required to basically sign a contract with us saying they're not going to be looking at the customer data in any way. So the way I think about it, the analogy I kind of have in my head is it’s the same thing from a privacy perspective as us putting your data in Redis or Postgres or something else. It's just another computer that's processing your data. The exact same rules apply to it that would apply to any other system that we're using.

[music plays]

RD All right, everyone. It's that time of the show again. I would like to ask you all a question: How do you handle stack traces? How do you trace the root cause? If you have an answer for that, would like to share your thoughts, email me at podcast@stackoverflow.com and we may feature it in a future blog post. I am Ryan Donovan, Editor of the blog, host of the podcast here at Stack Overflow. If you liked what you heard today, you can leave a rating and review. And if you want to find me, you can find me on LinkedIn.

TE And it's been great to be here. My name's Tillman Elser, Engineering Manager at Sentry. I’m not particularly active on socials. Feel free to message me on LinkedIn if you want to follow up on anything. And if you're a Sentry user, we'd love to hear your feedback on the new grouping that we've been rolling out. It's generally available, so if you use Sentry, it's on. And if you want to try out Autofix and share your feedback, we'd love to hear from you. We have a Discord actually where we're very involved in any customer communications around that.

RD All right, we'll drop those in the show notes. And thank you very much, everyone, and we'll talk to you next time.

[outro music plays]