The Stack Overflow Podcast

Unpacking observability and OpenTelemetry with Spiros Xanthos of Splunk

Episode Summary

We chat with Spiros Xanthos, VP of Product Management for Observability & IT Ops products at Splunk. Spiros has founded three companies, including OmnitionHQ, which was acquired by Splunk, ezhome, and Log Insight, which was acquired by VMware. We chat about how observabiilty has evolved over the years and the role it plays today in performance management, cybersecurity, and IT operations.

Episode Notes

You can read more about Spiros on his LinkedIn or Twitter.

There is some good backstory on his first company, Log Insight, here. A rundown of the acquisition that led to Spiros joining Splunk is here. There are also some interesting details in Splunk's blog on the deal, which calls out Omnition as a "a stealth-mode SaaS company that is innovating in distributed tracing, improving monitoring across microservices applications."

If you enjoy the conversation and want to hear more, Spiros has done some interesting talks that are up on Youtube here.

Our lifeboat of the week goes to Willie Mentzel, who explains how to: Round Double to 1 decimal place in kotlin: from 0.044999 to 0.1.

 

Episode Transcription

Spiros Xanthos It will essentially show you a visualization of how your application behaves. So you can think of all these components, users, services, lambda functions, interacting with everything, like visually see how the requests flow through the application. And let's say if you have a set of errors that start somewhere in a lambda function in the database, what we'll show you is it looks like this is the source of your problems. It started happening, let's say five minutes ago. And this is the way it's been propagating all the way to the your users, right?

[intro music] 

Ben Popper CockroachDB is the only book you'll ever love, because it's the only one you don't have to worry about. As a low touch SQL database that automatically handles scale, operations and uptime, CockroachDB lets you focus on developing. Get your free cluster and a free t shirt at cockroachlabs.com/stackoverflow. 

BP Hello everybody! Welcome to the Stack Overflow Podcast. I'm Ben popper, Director of Content here at Stack Overflow, joined as always, by my wonderful co-host, Paul Ford, CEO and co-founder of Postlight.

Paul Ford Hey! Here we are, we're doing it again!

BP Doing it again, Paul. So I went to the dog park this week up here in the Hudson Valley. I met somebody from GitHub, they're working on a product, it's aimed at developers, gonna come on the podcast, it's gonna be my first human networking interaction and human to human podcast recording since the pandemic began.

PF It's all, it's all opening up. It's a shocker that someone to GitHub is working on a product for developers, but you know.

BP I know, it was weird. It was weird.

PF You gotta, just gotta roll with it. Got to see what happens. That sounds really good. You know, we have an exciting guest on here today. Because, Ben, you've used apps.

BP Absolutely. I'm apping all day every day. Roblox 24/7 me and kids.

PF Me too. I'll tell you, something about apps that's really fascinating is that no one knows what the hell is happening in there.

BP Black box?

PF Terrifying, terrifying. It could be anything. And so a whole section of our tech economy that, frankly, on this podcast, we tend not to talk about, exists, just to tell you what the hell is happening inside of your stuff.

BP So this is a little bit different than maybe a guest. We've had like Tom Limoncelli to talk about site reliability Engineering, like SRE is making sure it's all running and working. And if it breaks, where they're not telling you, hey, x is happening, y is happening. This is getting overheated.

PF But it's part of it. It's part of it. Yeah, yeah. So who do we have today?

BP Today as our guest, we have Spiros Xanthos, I hope I said that right. He's joining us from Splunk.

PF Splunk!

PB Yeah, Splunk, where he is the VP of Product Management. Welcome, Spiros. 

SX Glad to be on. Thanks for having me. 

BP Yeah, I guess for people who don't know, let's step back for a second and define from a high level what is observability? Or observability and IT ops? And how is that unique from other disciplines?

PF Also, what is Splunk? I mean, we know the banner ads, but what is the company? So tell us, tell us Spiros!

SX Splunk, is Splunk is an enterprise software company, right started as a tool to collect machine data, analyze it, although it's a very started and was an is a very simple tool, it's extremely powerful, because it can really, it's really flexible and to connect any type of machine data, which you can then use to monitor for security or investigate incidents, or IT or for developers, right. So Splunk has been around for a while. And we recently expanded into observability. Actually, the way I ended up at Splunk, is because they acquired my company. So before Splunk, I started this company called Omnition, which is an observability company, which I'm going to define in a minute, right? And Splunk decided to acquire us to to expand into this space alongside five other companies actually, in the last two years. So what is observability? Observability actually comes from control theory, it is not a software term, and generally means this the ability to kind of resume and understand the state of a system by just looking at its outputs. Now, as you can imagine, when we're talking about applications or infrastructure on the cloud, we really, as developers and SREs, try to do the same thing. So all we have is are the outputs of those applications, usually in the form of what we call telemetry, logs, metrics, traces. And by capturing all that data, try to understand what the application is doing, right? So in the scenario that we'll have a failure, or degradation of the service, to try to understand where this is coming from. So we can say prevented and not kind of having like downtime.

BP Right, right, that makes a lot of sense. So just a backup, you said that term doesn't come from software. So that is more from the, you know, older world of hardware. It might be a flight engineer on a rocket ship or an airplane and they're certain, you know, degrees of robustness and reliability and redundancy I need and so this is a technique for making sure if something's going wrong if there's an engine failure, I can figure out why?

SX I guess it's more about understanding the state the internal state of the system by looking at the examples outside, right, that, of course, coupled with redundancy. And I guess failover techniques is probably how we've been flying planes all these years.

PF Is there more of a need for this now, because I mean, clouds are black boxes, right? Like, we don't know what's inside of a lot of our cloud services, and you're not allowed it, they've abstracted out quite a bit, they'll give you the logs and the reporting that they've decided, right. So is that is that way, there's more demand for this kind of approach to understanding what's happening in your systems? Or has it always been there? And this is just a new approach or an old approach? Or why does this matter now?

SX Yes, so maybe I'll give you a bit of history of like monitoring tools for infrastructure applications, all of the above, right? By the way, I have been working in this for a long time, I started my first company, which was a log analysis company, like Splunk, in 2007. So I've seen the evolution of the industry myself, of course, it predates me quite a bit as well. But let's say generally, with the introduction of maybe virtualization in the early 2000s, right, so move from physical hardware to, let's say, software, running our infrastructure and applications. But still, most of that was actually rather simple, right? Maybe my application was a monolithic application running as a single service, connected to a database. And that wasn't serving the needs of my customers, right. So the software to that I had to use, let's say, to monitor the software, my monitoring software was rather simple itself as well. Right? All I had to do is understand what's maybe what's happening within this single application, the underlying infrastructure was fairly static, VMs did not come and go as as much. So everything was simple and like straightforward. Now, as I've been moving to the cloud, and there is, of course, a whole journey that is happening, right, moving to the cloud maybe means that I take these applications, and instead of running them in my data center, I run them on the cloud, like AWS, Google Cloud, or etc. But really, these may these days means that I'm probably re architecting those applications. Or maybe I'm building new applications, that usually are built as distributed systems, what we call microservice based applications. So instead of having a single application, I might have multiple services, interacting with each other, to serve the needs of my customers. And now, these applications have a run on ephemeral infrastructure, right? We don't run on physical hardware, or VMs anymore, we usually run on containers with a common goal dynamically, oftentimes multiple times a day or an hour, right. So you end up in this situation where instead of having a simple application, you have like a whole collection of services, working with each other, running on dynamic infrastructure. So things that change very, very often, and they're fairly complex. So this results, many benefits, of course, right? Dynamic, better velocity for developers, these systems can be developed independently, and released multiple times a day, all desirable things for all of us, let's see. But when something goes wrong, it becomes like a murder mystery, right? It's it's very, very hard to troubleshoot problems in these kinds of environments. And in practice, bringing about the life of an SRE or a developer on call, when something goes wrong, multiple people have to jump on what we'll call usually a war room, and try to like figure out where the problem might be coming from right, fairly complex. So with evolution, essentially, of the application and the cloud, the tools that we use to monitor these applications have to evolve at the same pace, let's say or faster, because otherwise, you know, we can only really monitor and maintain these applications. So it says observability, I should say, is the evolution of all these monitoring and troubleshooting tools in trying to keep up with with the complexity of the applications and infrastructure, right? I mean, at the high level that it is there a specific second, I can describe in more detail. But you know, this is the high level idea of why we need observability these days.

PF So actually, one of the ways I think you could help me and help our audience. Everybody listening to this, I'm going to bet, understands, you know, the basics of logging and adding hooks to apps and talking most people at this point talk to a service like Splunk, or to one of the analytics platform as they're doing anything. So what I'd love you to talk about is a little bit because, you know, I thought we were I always think more about front end and and how you sort of instrument applications in order to understand how they're doing. And you're talking more about sort of instrumenting and understanding what's happening across a cloud platform. How do you differentiate those your mind? How would you how do you log analyze and make front ends observable versus cloud platforms? Or is that even a sensible question? You tell me!

SX It is, at the end of the day, the only reason really, we will not instrument and monitor applications is to serve our users better, right? Wherever the users might be talking about an enterprise application with few users might talking about the consumer applications, maybe with 1000s or millions of users at the end of the day want to understand what's going on? How well are we serving them right? Now we might have tools like let's say, Google Analytics, that mostly help us understand the behavior of those users. Now, when something goes wrong, we're completely out of luck, right? It's not like Google Analytics is going to tell you where the problem is coming from, it might help you understand the behavior of your users on your website. But really, that's when you're transitioning, let's say maybe from analytics, for understanding the behavior to analytics in understanding the application itself, right. So that's where it all starts. In reality, in the past, as you said, maybe I had some logs, something went wrong, I would look through logs, I will try to shirts and figure out like, use grep, or maybe like something like Splunk, to figure out what's what's happening. And maybe this was sufficient. Actually, these days, given the complexity described, really, you need to have multiple things in place to monitor the entire stack, right? Usually, a request starts from a browser or a mobile app, it flows through a set of back end services that usually run on some cloud infrastructure, they probably interact with some cloud services provided by the cloud provider. So you have to be able to trace all of this end to end, right. And kind of the telemetry we used to do that is usually logs, like one was dead. But probably we need metrics as well metrics based monitoring, because we need to do this in real time. And tracing kind of is the new trend that I think complete observability, which is, I mean, if you've heard the term APM application performance monitoring, that's kind of the old school way of doing it, the modern way of doing it is through distributed tracing, that can help you trace distributed systems.

BP So for people who don't know, can we just define those quickly? What what are logs? And what is tracing? And you mentioned one in the middle, just quickly, each one of those key ingredients.

SX Yeah, logs are unstructured data that typically, developers add one the built in application so that they can later troubleshoot what's happening, right. It could be like, totally unstructured data with maybe some structure. But I'm speaking, fairly structured, and logs are emitted from the application from infrastructure from network devices have has been the standard way of actually troubleshooting systems, let's say for the last maybe 30 years, or longer. Metric is this idea that instead of emitting totally unstructured data, I can emit a measure, let's say, right, we'll say, request arrives, I start measuring how long it took for me to respond. And I made that metric. Now, as a single measure, it's not that useful. But when I start aggregating across all my requests, let's say now I suddenly have a very, very good understanding of how quickly do I reply, right? So that's kind of what we call a metric. And a trace, or distributed trace is this idea that when a request arrives to my application, let's say at my front end, I assign a unique ID to it. And then as the let's say, that request flows through my back end services, third party services, I actually propagate that unique ID. So in the end, I get a very structured log right? A way to essentially understand exactly how that request was served. And all the systems that it had to traverse until it came back to the user. Right? So this gives you a very good understanding for every request, how exactly has it been served. And again, it might be useful to trace one single request. But reality, what is much more useful is when I start putting all of these together in aggregate, and I can see how my application behaves as a whole.

PF Let me take us in a different path. Because you told us earlier, you've been doing this for a while. While we're talking, I looked up your LinkedIn page, and you have been doing this for a while. And here's my question. How are you still in business? Weren't we supposed to have fixed this by now? Shouldn't it all be working?! My God! Why are we instrumenting everything in our cloud services to find out what's breaking all the time it? At what point are you finally out of business?

SX I think I've never but because I think what has been happening is what I described, right? Step function improvement in complexity, step function improvement, let's say in the tooling that we use to actually deal with this complexity. And I think this is gonna continue. Now, I should say that we're talking about AI Ops, let's say, right, like, machine learning based troubleshooting and like, fixing of problems in systems, like the ones we were discussing. But it's not a reality yet, actually, and might not be reality for a while. Now, the reason for that is because let's say the signal to noise ratio in the type of data I'm describing, like telemetry is pretty bad. So it's very, very difficult for a system to kind of have, let's say, good enough accuracy, so that we can trust it to take action for us when it comes to our applications. Now, the one thing that has happened, though, in the last, maybe three, four years, that is actually important is that we have actually opened standards that have emerged. One of them is open telemetry, which were quite involved actually, as a company and my own startup was one of the CO creators of it. So if I may, I want to introduce OpenTelemetry, which is like—

PF Do. That's, that's great. I'd like to know more about it. Great.

SX It's a CNCF project. CNCF is the foundation that started Kubernetes, and many other actually cloud native, popular technologies these days. So OpenTelemetry is the second most popular industrial activity project in CNCF, only second to Kubernetes. And what it tries to do is standardize the way we instrument applications and collect this data. By standardizing mean, let's agree on standards, how we describe this data and how we made it right? And then we have an open source implementation as well that can do that. So what this now enables is actually structuring the data at the source. So you can collect metrics, traces and logs, like I described earlier, but do it in a way that is vendor neutral, right? So it's not like this Splunk specific way, right? Or some other vendor specific way, as a result, because we now have this data fully structured, and we can all understand what they mean. And we can fully connect them, actually, the signal to noise ratio dramatically improves, right? So not only we can essentially have this agreed upon standard that benefits the users because they own their data, but I think we can build actually tools that can provide much more effective analytics, exactly, because the data is now structured, and we can make more sense out of them.

BP So Spiros, you've been talking about this, and it makes me think, you know, performance, you know, avoiding issues that might frustrate you know, consumers and users of a service. But another big piece of this, I guess, is security and cybersecurity. So can you talk to me a little bit about how this kind of observability or monitoring plays into that world?

SX Yes, again, in the past security and IT, let's say have been fairly different disciplines, right, you have the chief security officer, and the bunch of analysts worrying about what we call Sec Ops, security operations, you had the CIO maybe. And their team worried about IT ops. And developers actually, were a completely separate team that built obligations and throw them over the wall for somebody in it to run and for some of the SEC Ops, let's say to make sure they're secure. But actually, as most of the cloud, all of these disciplines come together. Definitely AI people ops and developers DevOps, right. So developers build their applications, they have to run them, as well, right? Be on call and all of that. But in reality, I think security is coming a lot closer to this as well, right? Because what happens is, you cannot really be have a secure application, unless you follow a similar kind of approach, like the one we're describing about who's viability, right. So you have to make sure during build time, you're not introducing any dependency that might introduce a vulnerability into your system. And at runtime, you have to continuously monitor into like for incidents and for things that might go wrong from a security perspective. So really, the same kind of tooling that we use to let's say, reason for the state of the system, in terms of availability and performance is actually quite similar to what we use these days for security. So in some sense, this discipline emerges. And we have what to call out devsecops, right, which is kind of the combination of all of the above. 

BP But what about mldevsecops? I just want to squeeze one more in there. [Ben laughs]

PF No, don't do it Ben. It has to happen on it's own. So you said something that is incredibly accurate, which is, developers tend to throw these over the wall, at which point the telemetry becomes somebody else's problem. That is no longer the case, like this is the world we live in, everybody is responsible, and we've also got security coming closer. So help people learn. Here I am today, I write my JavaScript front end code, and I write my web apps. And sometimes I do some orchestration of cloud services in AWS, or Google Cloud. And you know, I've got a lot of things over here in s3, and I've got a customer, I've got a bunch of lambda functions, I've glue that together, I've shipped my code. Now something's breaking, or now the client is asking for more information, and so on and so forth. Put me on a path, not just to like spackle some API calls in but put me on a path where in the next year, I'm going to be doing this right, I'm going to be fully engaged with OpenTelemetry, I'm going to be thinking harder about security. And this is going to be part of my deploy part of my build part of the conversation and part of code review. So you know, you get a typical engineering team, and you get to talk to them and tell them what they need to do over the next year. What would you tell them? Where should they start?

SX So I guess I should say that, I think a failure of our industry, my failure, given how long I've been doing this as well, is that we haven't built actually tools that are easy enough, right, for everyone to adopt. Now, you always have the expert developer, let's say that knows, tooling deeply to go like troubleshoot the performance of an application, right, when something goes wrong. But generally speaking, the tools will have built for monitoring and troubleshooting, having not easy for somebody to learn, let's say, right? So I think another thing that is happening, maybe alongside the observability trend is that we're actually now I think started building tooling that is more approachable, right? If I'm a new user, I can actually the tool itself can probably describe to me what the application does. So I don't have to become an expert on the tool. I can just maybe intuitively understand what's going on. So anyway, I think that's happening. And I think that's going to make generally it's going to make our life easier as developers

PF Take the opportunity to pitch like 'm sure they're there, you've got some things inside of Splunk, inside of your part of Splunk that help people, where should they look?

SX Sure. So actually, I would start by saying that if you are trying to implement observability, or better monitoring, I think OpenTelemetry is a great place for everybody to start as a way of instrumenting and emitting telemetry data, your application and infrastructure, it is pretty much supported by every vendor out there. It has many open source backends that also support this. So you can implement your own complete monitoring end to end using just open source, if that's what you decide to do. And or if you want to use a commercial vendor, you can still send the data there if one of them is Splunk observability. Right. So I would recommend that this makes sense for every developer and even every executive, let's say out there, because it gives them a lot more control and decouples them from specific revenue implementation.

PF I think we should emphasize this for people. So OpenTelemetry is both the client and also the API server, right? So you could set up your own—is that correct? Like I could set up my own GO API, and I could collect my own statistics that way or so I'm not dependent on any particular commercial vendor, if I don't want to be?

SX Yeah, I'll give you the specifics. So OpenTelemetry is a set of standards to say that define what we can collect and how it looks. And then there is an implementation. The implementation includes components that go inside your application to emit that telemetry. And it includes, let's say, a data collector as well that can collect and transmit that out. Now, OpenTelemetry stops there, right? It doesn't try to be an analysis tool. It just tries to be a data instrumentation and collection tool. But there are open source backends that accept this data today. Something like Prometheus, let's say for metrics.

PF So I asked, I asked you to pitch Splunk, and you knew exactly the opposite, which is great! [Paul & Ben laugh]

SX I'll tell you actually what we do, right. 

BP Paul was like, stop, just tell me about the open source alternative. Sorry. Sorry.

PF I know that's true, I derailed. And now I'm trying to understand I love a good open standard, it makes me feel safe and in control, just like everyone else in the industry. So it's really good to learn about that like, right, that way, you're not completely locked in for the rest of your life if you want to use these tools. So back to Splunk.

SX Yes. So tell you a bit about Splunk. Right. So first of all, given that we're trying to democratize data instrumentation collection via open source and open standards approaches, so we put all our effort in actually the analytics of what you do once you have this data, right? Generally speaking, for most companies, after a certain scale, it doesn't make sense to be in the observability business, right? Yes, you can do this with open source, or you can try to build your own tooling. But generally, it doesn't make sense, right? It's a hard problem. And it shouldn't be your priority, right? Ideally, you want to spend your time building, you know, something for your specific business. So what Splunk has done is build what we call this observability cloud, which is a combination of a bunch of let's say, products, we can collect metrics, traces and logs into a single application. That application can be used to monitor your infrastructure, your application, all the way to your end users, and even perform like incident response, like ping you if you're on call, all of that in a single user interface. And I guess the value it provides is that it's fairly intuitive, my earlier point, and also with huge emphasis on analytics, right? Trying to generally speaking, all of these tools are giving you a bunch of data. And the way you use the data is your standard hypotheses based on your own knowledge of the system or obligation, you're troubleshooting, and you just then the tool to kind of validate or invalidate your hypothesis, which is a very time consuming and not pleasant approach, what we try to do is push maybe the industry a bit forward by trying to essentially connect the data a bit better. So you don't always have to state the hypothesis, but the tool tries to guide you to where the problem might be coming from. Right?

BP I see here on your site, you know, some stuff around cybersecurity, observability, IT operations is everything we just talked about. But it sounds like right now, you're getting to the place that Paul and I love, which is that once you can do any set one thing, you can do everything. So it says everything, unlocked the power of data. So you're saying, once you start doing all this monitoring and metrics and analysis, and you're looking multiple customers, you're going to start giving people basically, you know, advice about how to optimize or transform or improve their system, their security, that's outside of the day to day operations.

SX Maybe I should say these, right, like, I think actually, once observability tooling, properly implemented, was already tooling in a sense of like having metrics, traces and logs, fully connected, being able to collect all of it, what we call full fidelity, it kind of looks a bit like magic, at least compared to the prior generation tools, right? Because effectively what that does, it gives you in my mind, a mirror image of your application infrastructure inside, let's say, a SAS environment. And then you can actually start reasoning about your application by looking at this with a mirror image of it inside the the tool, right? Which is very, very powerful, as opposed to, let's say, having very partial data and signals in the past and trying to connect the dots in your mind. So that's what we try to do, or at least that's what our ambition is as a company, right? How will we do it, I guess our users probably can.

PF Let's make up. I don't know, I've got a bunch of lambda functions, I've got some logging built in, I've got a front end, I've got a React Native app that I've deployed to 50,000 people, something is broken. I don't know what. But suddenly, the data is not getting pulled out or something, something's not happening, a database call that I was expecting to have happened, isn't happening. Walk me through, like using these tools, the old way I would do it is I kind of look at things, I put a lot of print statements in, I'd look at my log files I'd use, you know, where are things breaking? How does that change? Walk me through what the process feels like now, if you're using this approach?

SX So yes, the example you described, I probably haven't said, you said some lambda functions, I probably have some maybe VMs running part of my application, I have some users interacting with the system. Let's say I know that something is wrong, because I see that some either some of our users are complaining, or I myself noticed that my requests, let's say, I returned back, let's say errors are lower, that's a higher latency request, right. So usually, what happens is, instead of having to like shift to the logs, to try to find an error or something like this, like something like Splunk observability, what it will show you is it will essentially show you a visualization of how your application behaves. So you can think of all these components, users, services, lambda functions, interacting with everything, like visually see how the requests flow through the application. And let's say if you have a set of errors that start somewhere in a lambda function in the database, what we'll show you is it looks like this is the source of your problems. It started happening, let's say five minutes ago. And this is the way it's been propagating all the way to the your users, right. And now, let's say I do maybe have helped you isolate the source of these problems and say, to a particular service, or a lambda function, the next step usually users take is that kind of interrogating that part of the application, right? For example, say, oh, it's coming from, let's say, service a, let me see where services rate is running. Right? Show me all the containers when it's running these every container actually erroring? Or is a particular part of my infrastructure causing it right? Oh, maybe it's not that maybe I need to start looking at the types of database calls I make, right? Is there a specific database call that's failing? So then becomes like this iterative, visual kind of process where essentially, the system guides me to find what the problem is coming from.

PF It's finding, it's aggregating the data so that patterns become more visible. And then I'm able to kind of get in there and start looking through those aggregations and understanding kind of, sort of where things are separating from what I would expect, it's like, I can see where things differentiate from the norm.

SX It's a perfect description, right? Because I might have, let's say, each one of these statements in my log, but then it's up to me to kind of aggregate in my head, what's happening, right? What this tool is trying to do is put it all together. And in a way that is also intuitive, right? Because having the aggregated data by itself is not sufficient. It's a question of how do you also visualize and presented back so that it makes sense to the user? 

BP Spiros, we talked about a lot of this on the software side? Is there also a hardware component to this? I mean, it gets so complex, once we enter into the world of smart devices and connected devices, I know a lot of security threats, these days come from things that in the past would not have been a vector that could be your smart toaster, or your fridge or the printer at the, you know, giant oil company that runs a pipeline? Do you also look at signals like that? Or are you purely on the software side?

SX I mean, it's their software as well, right? Because usually, the older devices run an operating system. And that's probably where we start. In our case, in our users is usually mobile devices that start the request. And that's probably a bit more maybe relevant part of hardware that we're loading. But yes, all of the above was relevant.

BP Yeah, 'cause I was just wondering, like, let's say, you know, we have a big service is deployed to millions and is live, Fortnite or something like that, you know, they might be playing on a mobile device, they might also be playing it on an Xbox, they in the near future, they probably be able to play it on their refrigerator. And maybe, you know, it's something within one of those devices that's causing, you know, this buildup this issue? And if there's a way for you to be able to see that as well?

SX Correct. Correct, right. Let's say I'm, I have a consumer application that is served through mobile devices, right through a mobile app, there's probably a lot of back end services that serve application. But really, what I need to understand is if my users, let's say, I'm not getting the type of service I want, like either the face errors or like, it's slow. First of all, I need to troubleshoot the mobile device itself, understand what's going on there. And if I understand why, maybe my back end API's are slow, right? And once I go to the API, then I need to drill through and understand all the way backend databases where the problem might be coming from, right. And in some sense, security has similar kind of challenges in a certain distributed environment.

PF What's wild to me out of all this, where my brain keeps going is you can build a viable and good career in logging, right? Like I don't, I don't people don't talk about that very much. They do talk about security. You know, in DevOps is more and more of an option, but like, you're right, there's no way this is going away. Aggregating and understanding this data and exploring these patterns in an interactive way so that you can figure out system complexity is going to be part of a tech career for the rest of everyone's life. So now I have to go think and reevaluate every decision I've made and decide if I should be investing more time understanding logging.

BP So Spiros, if a person is listening to this, and maybe they want to change careers, or they're still in school, they're thinking about this. What would you recommend to somebody who wants to learn more about this subject or maybe, you know, find a way into, as Paul said, a career in logging, they're looking to get into life logging?

SX I mean, it depends like if somebody is a developer, obviously, all of this is relevant than you're going to be with a developer, if you're, I guess, up to speed with what's happening and observability. But if somebody is, let's say, maybe not a developer, but if you are a developer, you have too many options anyway, right. But in reality, if somebody is not as technical as enabled to write software, right, I think like actually, being an expert in logging or tools, like Splunk, can actually get you a very good high paying career, because there is a huge shortage of professionals, let's see if we can do these types of things. And you can get a certification, you can actually get started on your own and learn it. So I do think actually, it's a great opportunity, not just for developers, who in some sense, have great opportunities regardless, but even for people who might not have the ability to write software, but still can become a service or like IT admins.

PF You know, when you're in a band, it's always better, in my opinion, to play bass than to be the vocalist or guitarist, because like, everybody needs a bass player. There's never enough, right? You just it's, you can if this band doesn't work out, you can go play bass in another band. And I think about this with my kids, because they're, they're into computers, and I'm like, what am I going to advise them? And I'm like, well, Salesforce Apex programming is a good career, right? Like there's there's all these very specific things. And I've just added this kind of, you know, observability log analysis and systems for large scale analysis. It's absolutely, I'm convinced that this is the way of the future. And this is all I'm going to be thinking about as we go forward.

[music]

Alright, so Spiros, at the end of every episode, I shout out the winner of a lifeboat badge. That's somebody who came on to Stack Overflow. And they gave an answer to a question that had a score of negative three or less and with their answer the question got up to a score of 20 or more. So awarded five hours ago to Willi Mentzel, "Round Double to 1 decimal place kotlin: from 0.044999 to 0.1" we can help you with that rounding if you need it. We'll put it in the show notes. I am Ben Popper. I'm the Director of Content here at Stack Overflow. You can always find me on Twitter @BenPopper. And you can always email us podcast@stackoverflow.com with suggestions or questions. If you'd like to show please do leave a rating and review, really helps.

PF I'm Paul Ford. I am a friend of Stack Overflow. My company Postlight is a wonderful place to work. Check us out on online and also online on the internet. And also on Stack. We're a nice listed company over there on Stack. 

BP Yeah, check out the job. Yeah, check out the company page. 

PF Yeah, we'd love you to do that. That's it. That's all you got to know about me.

BP Okay, Spiros. Who are you? And where can people find you on the internet if you want to be found?

SX So, I'm Spiros Xanthos, I'm VP of product management for observability and I work for Splunk. And we also have many career opportunities, you can check out our website. 

BP Alright, Spiros, well thanks so much for coming on. We're glad to have you and yeah, I'm gonna go down and check the logs before I start this fire tonight because—[Paul laughs—see what I did there Paul? That was a country joke.

[outro music]