On this sponsored episode of the Stack Overflow Podcast, we talk with Greg Leffler of Splunk about the keys to instrumenting an observable system and how the OpenTelemetry standard makes observability easier, even if you aren’t using Splunk’s product.
The infrastructure that networked applications lives on is getting more and more complicated. There was a time when you could serve an application from a single machine on premises. But now, with cloud computing offering painless scaling to meet your demand, your infrastructure becomes abstracted and not really something you have contact with directly. Compound that problem with with architecture spread across dozens, even hundreds of microservices, replicated across multiple data centers in an ever changing cloud, and tracking down the source of system failures becomes something like a murder mystery. Who shot our uptime in the foot?
A good observability system helps with that. On this sponsored episode of the Stack Overflow Podcast, we talk with Greg Leffler of Splunk about the keys to instrumenting an observable system and how the OpenTelemetry standard makes observability easier, even if you aren’t using Splunk’s product.
Observability is really an outgrowth of traditional monitoring. You expect that some service or system could break, so you keep an eye on it. But observability applies that monitoring to an entire system and gives you the ability to answer the unexpected questions that come up. It uses three principal ways of viewing system data: logs, traces, and metrics.
Metrics are a number and a timestamp that tell you particular details. Traces follow a request through a system. And logs are the causes and effects recorded from a system in motion. Splunk wants to add a fourth one—events—that would track specific user events and browser failures.
Observing all that data first means you have to be able to track and extract that data by instrumenting your system to produce it. Greg and his colleagues at Splunk are huge fans of OpenTelemetry. It’s an open standard that can extract data for any observability platform. You instrument your application once and never have to worry about it again, even if you need to change your observability platform.
Why use an approach that makes it easy for a client to switch vendors? Leffler and Splunk argue that it’s not only better for customers, but for Splunk and the observability industry as a whole. If you’ve instrumented your system with a vendor locked solution, then you may not switch, you may just let your observability program fall by the wayside. That helps exactly no one.
As we’ve seen, people are moving to the cloud at an ever faster pace. That’s no surprise; it offers automatic scaling for arbitrary traffic volumes, high availability, and worry-free infrastructure failure recovery. But moving to the cloud can be expensive, and you have to do some work with your application to be able to see everything that’s going on inside it. Plenty of people just throw everything into the cloud and let the provider handle it, which is fine until they see the bill.
Observability based on an open standard makes it easier for everyone to build a more efficient and robust service in the cloud. Give the episode a listen and let us know what you think in the comments.
Greg Leffler The observability migration comes from, how on earth do I as an SRE figure out what the heck's going on? Right? If I get a page that my service is failing, well, is that my service's fault? Is that an upstream started something estimates traffic? Is that a downstream fell over? And a lot of times, you can't internalize all that state, even if you are responsible for a whole part of the site, like the external dependencies could be things that you've never even heard of. Right? And so observability really helps you figure out where is the problem, and it's like solving the murder mystery of what, you know, who shot experience in the foot?
Ben Popper Hello, everybody, welcome back to another episode of the Stack Overflow Podcast, a place to talk about all things, software and technology. Today, we have a very special episode. It is sponsored by the fine folks at Splunk. And we are going to be talking about observability and action, we're going to be talking about OpenTelemetry and we're going to be talking about how all these skills can come in handy, even if you're not directly focused on observability. But maybe as part of your career, as we move into this brave new world of micro services and containers and just generally more complex implementations and architecture even for some relatively small projects. I am joined as always—or as I often am—by my co-host Ryan Donovan. Ryan, welcome to the show.
Ryan Donovan Hey Ben, what's happening?
BP Today we have as our guest from Splunk, Greg Leffler, who is an observability, practitioner, practitioner of the art of observability. Greg, welcome to the show.
GL Thank you, Ben. I'm glad to be here.
BP So yeah, tell us a little bit about yourself. What's your background? And how did you end up at Splunk?
GL Yeah, it's an interesting story. I have a lot of traditional SRE, systems administration and NOC experience.
[bell sound plays]
RD NOC is the network operation center.
[bell sound plays]
GL I've always really been intrigued by complicated systems and difficult problems. And observability is one of those things that you know, there's infinite complexity in there. And so that's what drew me to Splunk. That's what drew me to this role. So I'm excited to talk more about observability. And why you need it and sort of the benefits to your career, like it definitely is not a flash in the pan.
BP Tell us a little bit about, yeah, how you would define observability. And where it sits in relation to maybe something more like a traditional SRE role.
GL It's interesting. There's a lot of people that want to own the definition of observability. And, you know, of course, Splunk has our own definition as well. The definition that I like to use is that it's really the ability to answer unexpected questions about a system, we'll say, right? Traditionally, that's going to be an application or something like that. But monitoring is what people think of when they think of looking at how a system performs, which is, I know this thing might break. So I monitor it, and then when it breaks, I do something. And observability is sort of extending that monitoring approach to every piece of data you can collect. So you're going to try to instrument everything coming into your platform, your end users, your applications, your infrastructure, synthetics, you know, any sort of thing that might conceivably be useful, you're going to instrument. And observability is the ability to turn that data into some sort of meaningful insight or some sort of next step. The real differentiator, I would say is that with an observability approach, you can solve problems you didn't know you were going to have. As you said in the intro, like the infrastructure is so complicated now that any thing, even a very simple thing is going to have a lot of services, there's going to be a lot of ways data flows, you could have stuff on premises, stuff in multiple clouds, like being able to figure out what's going on is huge. And that, to me is really what observability is about, right. It's answering what the heck is happening with everything.
BP Yeah, I mean, Greg, just tell me if I, you know, as a lay person who doesn't write much code, the way I think of the difference maybe is like SRE, you know, those are the people you call when something's gone wrong, they have a runbook. And they, you know, know how to fix it. Observability is more of the art of, we're not really sure what went wrong, or, you know, we have to locate sort of the problem so that we can fix it. Does that makes sense?
GL Yeah, absolutely. I think, you know, SREs—good SREs of course—are very focused on uptime and customer experience. But the approach you use for that is evolving right into this observability approach.
RD Yeah, I mean, these systems put out so much data, trying to keep an eye on it is so intense these days. So I think like you said, a lot of people are trying to own observability, can you kind of give us a story of observability in action? Give a sense of what it what it looks like in practice?
GL Yeah. What really is interesting is that when you think of how we're starting to move software to the cloud, and everybody's doing this, right, moving stuff to the cloud can be expensive. And some of the benefits you claim to get from the cloud are you know, you can automatically scale and you can handle arbitrary traffic volumes and you can handle availability zones falling over. All of that stuff is technically true. But you do need to do some work for your application, and you need to be able to see what's going on and how it's working. So one of the success stories we have at Splunk is a company called Quantum Metric, which is a continuous product design company. It's basically a platform for other websites to make changes in real time and to see what effect that has on customer spending. So if Quantum Metric's platform is broken, all of their customers are unhappy, and you know, it's a big pain. So the big advantage they got from adopting observability was cloud resource utilization, right? So they, like I think a lot of people, are just like, let's throw stuff in the cloud and let the cloud take care of it, which is fine until you start to get the bill. And one of the things that Quantum Metric noticed was, you know, they were spending $80,000, more than they needed to be based off of, you know, our out of the box indicators of, hey, your capacity isn't planned as well as it could be. Right. With most of the cloud providers. reserved instances are a lot cheaper than spots. But if you don't know, to ask that question of, hey, is there a cheaper way we could be doing this, like sure, you know, cloud providers will happily sell you their most expensive product on demand, because you didn't ask him for anything cheaper. So that's one thing that really helps with observability is just knowing what are some questions we can ask, right? And so a good platform will give you out of the box, what am I spending in the cloud? Or what are my read and use metrics look like? You know, those kinds of things.
RD A lot of observability comes from, you know, it's no longer one application running on one machine, you no longer have this one machine to be like, This is my resource spend, right? Everything's moving in the cloud, microservices, containers, and you can spin up containers, like on the fly. So talk about a little bit like how observability is evolving with these complexities.
GL Yeah, the whole SRE world has sort of been evolving from the pets versus cattle scenario.
[bell sound plays]
RD The pets versus cattle metaphor, applies to servers, when you feed your servers like pets, everyone is a special, you want them to live forever, I wouldn't trade my cattle, it's the heard, that's the resource. So you have no problem killing off a server when you need to.
[bell sound plays]
GL When I first started, I was responsible for a server called PHSINF1. And that was my baby. And that's what I had to deal with. And as part of my career happened, I moved to LinkedIn where I was suddenly responsible for about 800 servers to start with, and then you know, a lot more than that on the way out. [Greg laughs] So you definitely can't, you cannot keep track of everything that's going on. So this cloud migration, micro service migration is to get the benefits of moving to the cloud, you kind of have to adopt this model, right? Like, you're moving things into a container system. Well, we can say Docker, or Podman, or whatever you're using, right? But you're moving into containers, you're splitting your app up into a bunch of micro services. And you're doing this because hey, I can only deploy micros for this one micro service that I need to scale unexpectedly. Whereas my big data thing can be an XML that, you know, I don't need to worry about dynamically scaling that kind of thing. So as infrastructure becomes more complicated, if you look at, say, the Amazon homepage, there was a talk at an SRE con a few years ago, where they showed the Amazon Web page calls something like 40, microservices, just to render the stuff that's on the page. And then each of those services could call other services. And the whole thing gets really complicated. So the observability migration comes from, how on earth do I as an SRE figure out what the heck's going on? Right? If I get a page that my service is failing, well, is that my services fault? Is that an upstream started something estimates traffic? Is that a downstream fell over? And a lot of times, you can't internalize all that state, even if you are responsible for a whole part of the site, like the external dependencies could be things that you've never even heard of, right? And so observability really helps you figure out where is the problem. And it's like solving the murder mystery of who shot our experience on the foot. That's really what I think for because of all the new changes in how things are deployed.
BP So we had a colleague of yours Spiros Xanthos on the podcast pretty recently. And I want to get into something he talked about, which was OpenTelemetry. But before we do that, I just wanted to ask, you know, you're sort of saying, you're the detective here trying to figure out what's going on, you might be upstream, might be downstream, what's causing these issues? And he mentioned sort of three things, the logs, the traces, and the metrics that were kind of the core tools that you would use to figure that kind of stuff out. So can you define just sort of generally for people what those are and how they play a role in this and then maybe we could talk a little bit about OpenTelemetry and where factors into this conversation.
BP I guess, you know, one of the questions I had there was, I know, people feel the same way often about cloud vendors. And so they will sort of work with multiple vendors at the same time and sort of tried to divide and conquer and split things up a little bit. Is that possible also in the observability world, especially as you said, you know, you kind of only need to architect it once and then your data can go wherever it fits best, or wherever you're getting, you know, the best service?
GL Yeah, absolutely. I think one of the biggest benefits of OpenTelemetry is that you can write the same data in multiple streams, right? So you can say, hey, we'll send this to Splunk. And we'll also send it to an in house thing we're developing, and we'll also send it to data dog or FTE or somebody else, right, like, you know, everybody has gotten to the point where they are willing to consume OpenTelemetry. [Greg laughs] Not everybody is at the point where they're willing to admit OpenTelemetry. Right. So that's definitely something possible. And it's a big advantage of using OpenTelemetry, is you retain that flexibility.
RD I think it's sometimes a tough business model to argue. We're gonna stop this vendor lock in, we'll let you use other products.
GL Yeah, the luck that we had was that, you know, several of the people that came to Splunk, through the various acquisitions we made to build out the observability products, were very opinionated about the need for this to be open and did the work to convince everyone at Splunk. yeah, open is better. And it's better for us, it's better for our customers, it's better for everyone, because, you know, being locked in doesn't help get observability out there in the world, right? It doesn't make people—if you're locked into one vendor, and you decide to break up with that vendor for whatever reason, then you're going to let the product fall by the wayside, you're not going to get those benefits, right. And everybody benefits from better performing applications, like you know, even as users, right? You benefit from the application being better.
BP Yeah, that makes a lot of sense. Kind of going back to what Ryan said, to be agile and successful business in this day and age, you have to find a way to make open source work with your business model. Because otherwise you're not going to be offering customers the flexibility and you're not going to get that sort of scale and pace of innovation that you get when people, you know, from all over the place can contribute to the project. So I think that makes a lot of sense. It's in the business best, you know, sort of self-interest if they can figure out the math to go in that direction, for sure.
RD I think everybody loves the story on the ground. Can we talk about some OpenTelemetry in action? I know when I've used Splunk, I think we had a different logging solution. So I'm curious about how the OpenTelemetry works in the full kind of metrics, logs and traces.
GL It's one thing that we are a little bit outside the lines on is that OpenTelemetry logging is still in I think, late beta, we're saying? It's a similar, it's going to be similar to metrics and traces, which are pretty mature on most of the supported platforms, right? If you are like most large companies, and your enterprise applications are in Java, which, you know, most of them are. Instrumenting those for metrics and traces is super straightforward, right? You we give you a jar, you throw it into your apps, classpath, you change the launch command to reference that jar. That's it, you're done, right? Like, you don't have to go into your code and manually say, hey, I care about this, right? Like, we can figure out what's going on. And then the OpenTelemetry platform is super flexible, so that later on as part of the data pipeline, like you can say, okay, I actually don't care about this metric. I don't care about these trace events. And the other advantage is that now there's sort of this common vocabulary among everybody in observability, about what you call things and how you refer to them. So when we do have logs at 1.0, right, like, what you put in the logs, and what that ends up showing up, as in your observability system will be consistent, right? So you won't have to worry about even if you switch to a different vendor, or even if you roll your own, like, you'll still be able to correlate things because there's now a standard, where we've all agreed, like, hey, this is the way you should say this happened. And that makes a huge difference for logs, right? Like if you're troubleshooting something that you didn't write or aren't familiar with, like figuring out how the person who wrote it is specifically putting the stuff in the log can sometimes be a big chunk of the troubleshooting time. So you know, the standard language makes a big difference.
RD Everybody logs differently. We had a post from somebody else in the observability sphere, while back, charity majors, and this observability is becoming a big field of its own right. Do you think it's becoming a separate kind of professional class?
BP Yeah, that's an interesting question. Because I remember when we had Spiros on he was saying one thing that interested him was seeing sort of the proliferation of online certificates for this. So you know, get certified and this kind of observability. And you can, then you know, walk right up and find yourself in a position to fill demand for a lot of, you know, pretty high paying jobs. So, yeah, talk to us a little bit about sort of, like, how its evolving as its own professional class, maybe sort of the way DevOps is, you know, these days, and then maybe, I think, as you wrote for the blog, in what ways learning these skills can be helpful, even if you don't decide to I guess, focus full time on observability?
GL Yeah, I think it's one of the things that is amusing to me about this whole industry is that it's very cyclical in nature, right? And, you know, we've gone from apps run on mainframes to apps are super distributed to oh, now we have the suite your infrastructure to now they're super distributed again, right? So I think I don't want to pontificate too much on how long observability will be like a unique thing. But right now it is, and even if it doesn't remain a unique career forever, like troubleshooting SRE, problem solving, is going to be and the skills that you develop, when you learn about observability are what are going to make the difference between you and other people trying to get into the software world in general, right? Not even just operations. You know, if you aren't full time working on observability, which even those SREs probably aren't, right, like you work on observability, to the extent that you needed to solve a problem or to help guide your architecture design. But like, you know, as an SRE and you're writing tools, you're planning, you're, you know, talking to your engineers about the correct data models, you know, things like that. So it's not necessarily anyone's completely full time focus. But knowing how to do troubleshooting well, is always going to be part of any software job, right? No matter whether you're writing code, or you're deploying it, or you're the SRE, you're always going to have to fix problems. And observability is really viewed best as that evolution of monitoring, right? We're never going to get rid of monitoring as you are doing observability. You have to monitor things, right? Like we say, instrument because I guess we didn't want to say monitor. But that's the same thing. [Greg laughs] You're doing the same concept. So it's really like strengthening your sword for troubleshooting is the biggest thing you're going to get out of setting up observability, is like you're getting better at troubleshooting at finding those connections between things, being able to do that murder mystery investigation of what caused this thing to go on. One of the other things is just OpenTelemetry is cool, and it's new. And observability is sort of still cool and new. And a lot of the competing platforms to Splunk are trying to catch up with OpenTelemetry. But this whole concept of distributed tracing, of APM, of observability overall, is still pretty new to the to the rest of the world. So even if you aren't doing it as a full time role, you get to learn a new thing and play with new technology, which I had someone object to me using the word play, when we're talking about like these critical business applications, and okay, that's fair. But you know, on the other hand, as somebody that does this stuff, like part of it is playing with it, right, you get to see how it works on using it to solve a problem, but you're still solving the problem and having fun is allowed. So there's no reason not to do. So diving into something that is new, and that people haven't played out yet is going to help you be more interested in it, you can establish a niche of expertise, right? People can say, you know, hey, so and so what is distributed tracing? What does that mean? How do we use it? What do we do? What do we do with it? Because, you know, if I'm doing my job, and everybody at Splunk, and all of our competitors are doing our jobs next year, you're gonna say, oh, I need distributed tracing and being ahead of the curve to say like, yeah, you know, we do need this, it's something that's going to help. And another thing that I think really gets overlooked a lot is the observability world sort of did come from the cloud native world as well.
RD Yeah, that makes sense. I mean, with cloud native, you don't have this server in front of you, you don't have the traditional like file structure running on VMs and hypervisors. and such. And so you need a different way to see what's going on, not on the middle, but closer to the middle, right?
GL A lot of the concepts around observability, a lot of how data is structured, like both in the application perspective, and like in your own head, is based on the modern container microservice cloud model. So if you aren't there yet, or if you aren't as fully along there as you want to be, this is a great way to help you get ahead of that, right, like you're being able to pick up on how can we better get our apps on the cloud efficiently and effectively. And to make sure we're getting that value, right. And even the cloud is like, you know, we talked about the cloud migration is a foregone conclusion. But there's stuff beyond the cloud, too, right? Like, if you talk about serverless, if you talk about lambdas, you know, those sorts of things, like, yeah, that all kind of falls in the cloud umbrella because you're buying it from somebody else. But it's very different than the traditional cloud models where you have a VM or even have containers, right. So the observability, the documentation and sort of the general approach that places have to talk about observability explains some of those concepts and relies on you knowing those concepts.
BP Yeah, I guess I've never done this on a sponsored podcast before. But you want me to tell you to take your job more seriously, right? Like, don't play around so much Greg. [Ryan laughs] This is not fun and games here, we have serious business processes are relying on this.
RD I mean, I understand that I've definitely felt the thrill of tracking down the, you know, 500 server error through Splunk.
BP Do you have a favorite murder mystery that you solve? You know, if you're sitting, sitting a bar with a few people who work in this area? Or you know, if you're talking to someone who's considering going into it? Or is coming in new at the company? Like do you have a few few stories you'd like to tell about how you solve these kind of mysteries?
GL I think the hallmark of a good operations person is that they have a good war story. So you know, of course—[Ryan & Greg laugh]—it was actually one of the questions that we asked at LinkedIn, when we hired SREs was, you know, to tell us about the biggest thing you screwed up, because it really tells you a lot about a person, right? And this one, the one that I'll tell you wasn't my fault. But it was a very interesting story anyway. But the deal was, you know, we had a service that was failing, and nobody could really tell why. Right? We figured out that there was a really sharp drop off in traffic, that's was the alert we got was like, hey, this normally gets 1000 queries a second, and now it's getting three. So something is wrong. And we pull up the logs and we see no space left on device, you're like, okay, well, that's pretty easy. Let's just go delete some files and figure out what's going on. So the problem is you run DF, and you see that there's plenty of space. They're not out of space at all right? Like there's there's so much space that it was ridiculous. And so it was very confusing. You're trying to figure out, why would it say that we don't have any space when we have plenty of space. And I was still relatively green in my career. And those of you listening who are not probably already know the answer to this, but we had released a new version of the code for this service. And part of that code was every time a transaction happened, it dropped a file on the disk. And these files were only a couple of bytes, right? Because it was some sort of transactional data that we wanted to store. But there is a limit to how many files you can put on a disk. And when you get to that point, the error message the Linux kernel helpfully gives you as no space left on device even though that's not actually what's wrong. It's definitely not something we were monitoring, right? Because I don't think anybody would have ever expected we would write a few billion files to the disk. And it happened over a couple of days or, you know, but I don't think anybody ever thought that would happen. So that's definitely one that's like, it was a weird problem. The fix was unintuitive. There was a lot of late night Stack Overflow searching for why does it say the disk is full when the disk isn't full?
BP I'm glad we could help. I'm glad we could help. Putting that great war story on my back pocket for another day. And let me ask you, as we sort of wrap up here, people who are listening to this who are interested, what are some recommendations of, you know, places to go? Try this stuff out or learn it, as you said, you know, part of the fun is playing with it. So what would you recommend people interested in observability, OpenTelemetry check out?
GL Well, I wouldn't be doing my job. If I didn't say go to splunk.com. And check out the observability demo. Go to your preferred vendor and check out their observability platform, basically, everybody has either a free tier or a demo, you know, the kind of folks that listen to this podcast are gonna want to see it for themselves. So try that, right? If you don't have an infrastructure at home, if you don't have a home lab, you know, like, what are you doing here? No, I'm kidding. If you don't have a home lab, if you don't have an infrastructure at home, you can set up mini cube, you can do things to like, create your own little tiny universe, and just experience what you get out of an observability platform. Right. That's one thing that I always recommend first is, you know, the most from something you've used yourself. But of course, there's a lot of resources. You know, everybody's got YouTube videos talking about observability, including us and the CNCF. And, you know, all of our competitors as well, right, we produce lots of stuff on how to do observability, you know, the right way. And of course, the project website for OpenTelemetry is opentelemetry.io is a great place to learn about OpenTelemetry specifically. Even though Splunk is a huge booster OpenTelemetry, it is not a Splunk project, right, it is the CNCF project. And so they are responsible for that website and for the documentation and stuff. If you aren't super sure about like the whole observability universe and like what sort of things should you care about, and what sort of metrics matter, I would recommend googling the acronyms USC and RED, which are very, very common ways to evaluate the health of services, Google probably will try to claim ownership of those because they documented them in their book, right? So yeah, they're very, very common things that I would encourage you to look at if you're trying to figure out like, what are things I need to care about? Those are the the most common key metrics are going to be summed up in USC and RED.
BP Alright, well, I certainly learned a lot. And I do think this is a fascinating topic, especially because, as Ryan and I do the podcast and the blog, we are getting an ever growing number of pitches and guests who are living in this cloud native world, and this world of microservices and containers. And, as you said, almost staying ahead of the curve is learning how to do the troubleshooting, right, that will like kind of naturally lead you in the direction to be learning the things that are coming around the corner. So that's a cool way of thinking about it. I am Ben Popper. I'm the Director of Content here at Stack Overflow. You can always find me on Twitter @BenPopper. You can always email us email@example.com.
RD I'm Ryan Donovan, I edit the blog and the newsletter here at Stack Overflow. I'm a ghost on Twitter @RTho Donovan. If you want to send us a great blog pitch, you can reach me at firstname.lastname@example.org
BP Greg, tell the people who you are, where you can be found on the internet if you want to be found, and yeah, where they should go to check out a little bit more about Splunk and OpenTelemetry.
GL So I'm Greg Leffler. I am an Observability Practitioner at Splunk I do not have a Twitter presence, but you can follow me on LinkedIn. Probably the only Greg Leffler that has observability next to his name, so that's how you'll be able to find me. If you want to learn more about Splunk observability, we would encourage you to check us out. Splunk.com/O11Y. So it's, you know, a cover abbreviation you know, like we use for accessibility and internationalization. So check us out there and it was great to be here and to talk to everyone and enjoy your troubleshooting time. It may seem hard at the time but it's fun when you're done!
BP Alright, thanks for listening everybody.