The Stack Overflow Podcast

When setting up monitoring, less data is better

Episode Summary

Computer scientist Jean Yang, founder and CEO of monitoring and observability platform Akita, tells the home team how her drive to improve developer tooling led her from academia to Silicon Valley.

Episode Notes

Akita is a monitoring and observability platform that watches API traffic live and automatically infers endpoint structure.

Jean, who comes from a family of computer scientists, earned a PhD from MIT and taught in the CS department at Carnegie Mellon University before founding Akita.

Read Jean’s post on the Stack Overflow blog: Monitoring debt builds up faster than software teams can pay it off.

Jean is on LinkedIn and Twitter.

Congrats are in order for Stellar Question badge winner legendary_rob for asking Adding a favicon to a static HTML page.

Episode Transcription

[intro music plays]

Ben Popper Overwhelmed by security alerts streaming in from multiple tools? The Cisco approach to extended detection and response reduces complexity to help security teams work smarter and faster. To learn more visit cisco.com/go/xdr.

BP Hello, everybody. Welcome back to the Stack Overflow Podcast, a place to talk all things software and technology. I am Ben Popper, Director of Content here at Stack Overflow, joined as I often am by my colleague and collaborator, Ryan Donovan. Hey, Ryan.

Ryan Donovan Hey, Ben. How’re you doing?

BP Ryan, you are the editor supreme of the blog, and today we're going to be joined by someone who wrote a terrific blog post for us. Give the audience a quick intro here, set us up. What are we going to be chatting about today?

RD So Jean wrote blog posts for us about monitoring debt and how a lot of organizations are sort of getting themselves behind with their monitoring and not taking it seriously, and a lot of monitoring tools are a little tough to get into.

BP Based on the number of clients we have who are advertising various monitoring services, and the number of discussions we've had about microservices running out of control, monitoring debt seems like a ripe topic. So without further ado, we want to welcome Jean Yang, who is the founder and CEO over at Akita Software, to the Stack Overflow Podcast. Hi, Jean.

Jean Yang Hi. Thank you for having me.

BP So for folks who are listening, just give us real quick, kind of a 10,000 foot flyover. How'd you get into the world of software and technology and how'd you end up founding and running your own company?

JY Yeah, so I grew up programming. Both my parents are computer scientists/ programmers. My whole family is– my uncle, my cousins, everybody. I was not sure that I would stay in technology after college. I considered doing many other things, but I took a couple of computer science classes in college and I continued to do it. And for me, one of the big reasons I was considering not being in tech was I thought the tooling was so bad. I said, “Well, why would I spend my life struggling against such subpar tooling?” And then junior year of college I took compilers, I took hardware, I took programming languages, and I realized that these tools are made by people, and people like me could fix it. So I went on to do my PhD in programming languages. At the time, there weren't a lot of developer tooling jobs out there. It was Microsoft, Google, or greenacres compilers, or maybe the government. But there weren't the bevy of programming tools, developer tools, companies that there are out there right now today when I finished. So I did my PhD to continue understanding how I could contribute to this field. When I was graduating, I again looked around at what I could be doing, and by then I had developed a goal of helping web developers build more easily, reduce boilerplate, help people focus on what actually matters about the functionality of their systems. And at the time, I again didn't see anything I wanted to do better so I took a job as a tenure track professor at Carnegie Mellon. A couple years in I figured out the thing that was more pressing to do than build more developer tools in academia, which was to help developers understand prod. So over the past decade, I had continued to run into software teams and industry that said, “All the things that you're doing at the app level sound well and good but none of it matters until we sort out prod.” And so my lens as a programming languages researcher was that development and development tools are all about abstraction, lifting the abstraction barrier, giving developers higher abstraction tools to do things. And so to me, prod was an abstraction problem. Everyone was living at the level of logs, metrics, and traces, and it isn't covering all of the needs that people have.

RD Yeah, at that level you still need to sort of figure out what's going on there. You’ve got to do some digging.

JY Right, exactly. And so to me it felt exactly like what Assembly was 50 years ago. Everyone thought that Assembly was how you had to program, you need that level of control, programming is manipulating Assembly. And then over the next decade, people embraced C. They embraced higher level languages, they embraced a lot of other stuff. I believe the exact same thing is going to happen with logs, metrics, and traces and monitoring and observability, but how it happens is very much up to companies like us.

BP So you thought about not doing technology, but then you went to school and you realized you could be the change you want to see in the world. You got into the industry. Tell us a little bit about the creation of the company and what's been going on since then.

JY Yeah, so in 2018 I took a leave from CMU with a vague notion that we’ve got to do something about prod. I didn't have a specific place that I wanted to start. My goal was just to start somewhere that people really, really need us and ultimately give people better abstractions for dealing with prod. It was a very abstract notion of what needed to happen. So I started just by going down my entire LinkedIn, calling everybody that I knew and asking, “Hey, what are your actual problems? Where are the gaps?” The initial conception of Akita was a security company, because that was a lot of my LinkedIn. A lot of my research was focused around security. That was a lot of my network at the time. We've since realized that we could go directly at being a monitoring and observability tool. That was where I had wanted to go but I wasn't sure how to get there at first until we took that initial detour. But I moved from Pittsburgh out to the Bay Area, drove across the country, sold all my furniture, and just decided this is what I'm going to work on. And since then I've built a team of both very good technologists of the kind that I knew and worked with before, and very strong product people. Technological innovation doesn't really see the mainstream until you have the right user experience, the right developer experience to unlock that. We've seen it with ChatGPT. There's this nice chart of how long it takes for technical innovations to make it into the world, and it's often decades because you really need to iterate to the right UX. But I knew that if we wanted to have impact with technically deep solutions, getting the product in UX right was a very important part of it. So a lot of the journey has been building the right team to compliment technologists with the right people to iterate on the right user experience.

RD Yeah. So let's get into the topic from the article. I know this grew out of a conversation with Charity Majors, another person in the observability space. Can you talk about what that sort of seed conversation was like?

JY So Charity and I hosted a Know Your Observability Stack Twitter Spaces where we offered to have people step up to the mic and tell us about their needs, and we would advise them on, “Okay, these are the tools we recommend.” And it came out of a conversation we had over a private one-on-one chat about how different companies have very different needs when it comes to their tooling in general, but especially monitoring and observability. And a lot of the messaging around monitoring and observability is, “This tool is the be-all end-all,” when in fact everyone layers a set of tools for their needs. Everyone, at their different stage of company needs, will pick up different sets of tools. And so this specific article came out of something Charity said, which is, “Jean, wouldn't you recommend this one specific tool as the starter tool?” And it was the tool I thought of as a starter tool. My company picked it up as our first tool. I recommend it to all of my founder friends as the first monitoring tool they should pick up. But what I realized in answering the question in that very moment was, actually, it's not the best starter tool because what we've been seeing is that my founder friends are not picking it up first. Users of Akita are showing up to our Beta and they said they're not picking up that tool first. And so with these two pieces of evidence it seems like that's the thought-to-be starter tool but there's some kind of gap. And so that's what got me thinking that a lot of how Akita came to exist was conversations with developers realizing there's this kind of monitoring gap. And someone in my Twitter network, Yvonne Lam, had said this word ‘monitoring debt’ in response to a thread I did a couple months ago about this. But that term really stuck with me because what I realized was that people talk a lot about tech debt when making decisions where you judiciously are leaving your future self a lot of to-dos to clean up the system later. Monitoring debt isn't something that gets as talked about. Everybody has a large amount of monitoring debt, so when I was first talking to these developer teams about, “What's your monitoring stack? What's your observability stack,” they would tell me their ideal. And whenever I asked questions like, “Well, are you in the middle of a migration? Are you in the middle of a configuration and integration,” they would say, “Oh yeah, we're this many weeks into it. We have this many weeks left.” When I caught up with them and they inevitably were not finished and not as far along as they thought they were and no one was talking about it. And so everyone continued walking around saying, “Well, you start with this tool, then you graduate onto that tool,” when in reality, maybe a couple people on the team have kind of set up the first tool, and the graduation tool is maybe a third person on the team, and it's really covering quite a small fraction of the system at the end of the day.

BP It's interesting to hear you talk about folks needing to cobble together different tools and the expected sort of starting point was not what you were finding among founder friends. In the time that you've been working there, can you highlight for us some of the key pain points that you feel like you've been able to solve when it comes to this debt, this feeling of being overwhelmed or unable to interconnect tools to clearly see where your problems are or where you're wasting time? What are some of the things you would tell the developers who are listening, “This is kind of what's been working for us,” and things folks might try.

JY There are three pain points that we've been working on. There are many more pain points. I won't talk about them because we're not working on them.

BP There's so many pain points. Pick your favorite pain points.

JY But I'm happy to get into them in the next question, but one is initial integration pain. So my team uses a tool that's not ourselves’. It's supposed to be an easy to integrate tool and it is easier than everything else out there, but it took five days of one of our most senior engineers on our team to integrate it because we had to upgrade other libraries, we had to make updates to the rest of our system. He was not the original person to build those parts of the system, so on about day two, day three, I got a Slack message from him saying he'd lost the world to live, this was such a painful process. And this was years ago. We had three services at the time, there really wasn't very much to do. I could only imagine what it would look like on a system with much more surface area. And then we actually ended up turning off parts of the monitoring functionality because we weren't consuming the data. So the first pain point is integration; the second pain point is consumption of that data. And so for us using other tools, what we found is that if we don't know what thresholds to set for the specific needs that we have at the moment, we end up just turning off all of our alerts and not really consuming that data. That's not the desired situation. We have a lot of layered alerts, so something gets alerted on at some point. But I've often asked, “Hey, could we have had a more specific or early alert for some of this stuff,” and there's just quite a bit of work that goes into that. And so the pain points so far are: one, integration, two, data consumption, and then three, accessibility. So right now what I'm seeing on a lot of teams including our own, is that there's people with experience in both the system itself and the tooling. They can look at graphs and eyeball them and say, “Oh, this is a good graph. This is a bad graph.” But let's say you're a new person on the team or you're a junior person without a lot of experience, what's good, what's bad? So some companies have really solid SRE playbooks, or if you're on call, this is what you do. On my team, everybody is on call, all of the engineers at least. But what I noticed is that it does take a while to ramp the team up onto, “Check this, check that,” and there's quite a lot of detail and that's a secondary and orthogonal skillset to really developing the code itself. So what we've been working on at Akita is, the first thing, integration pain. I would say we're very far along in driving that down. We've taken that situation down from weeks to months to less than a day. There are a lot of users who are able to integrate us in under 30 minutes, sometimes under 15. We've been very happy with that. Then there's data consumption pain. So a lot of the tools out there, they've taken the view of, “Hey, we're going to give you all the data. You can do whatever you want with it,” which is a great view for experts, but if you're not an expert you might want the opposite. So we've taken the very controversial view that you actually want to start with the least amount of information possible and build out from that. And so we automatically listen to API traffic and then we automatically try to infer as abstracted information as we can from that as a starting point. So that I will say is an active work in progress. I was just on a user call this morning where they were like, “Hey, you have this data but I want you to show it this way instead.” I'm on user calls all week talking to people about stuff like that. But that's the current goal we're working towards. And then that's also related to accessibility. Yeah, less data is easier.

BP Yeah, that makes a lot of sense. In the scenario of some sort of military command, you've got all this input coming in, you want to make the right decision, you need to separate the signal from the noise. Start from nothing, figure out what's most important, and then build from there. But do you have customers who come in who've already built up, exactly the title of your post, a lot of observational debt? How do you help them strip that away or hold their hand to allow them to make the jump to maybe thinking about resetting or restarting from a new baseline?

JY One of the things I wrote about in the post was that I believe that monitoring and observability, whenever possible, should be black box. And that's something that Charity and I have sparred on Twitter about, where Honeycomb is all about how everything should be fully traced and you should have full provenance on where your requests are coming from. And that's a great view. I love that Charity is pushing on that so I can push on the opposite, which is, “Hey, some people are just never going to be able to get to that, so what do we do for them?” And in an ideal world you have both kinds of tooling in your system. So an analogy that I like to make is that there's a full speed car, manual transmission, you can go around every curve very fast, but sometimes you don't want or need that, or if you're going around the streets of Amsterdam, maybe you need a scooter or a bike. And there's just different modes of transportation that are appropriate for your needs at the time and your desired speed, that kind of thing. And so if you have a lot of monitoring debt, probably you have some amount of legacy code, or code that people on the team did not write last week, and that's what I define as legacy.

BP Yes, you're talking to folks who work at Stack Overflow. We're aware, we're aware.

JY Yeah. So if that's the case, then expecting someone to go in and annotate all that is not practical, expecting them to migrate that code over to a framework that maybe does automatic annotations. So people have asked me, “Well, what do you think about these frameworks that do automatic annotations?” And I say, “That's great for everyone using those frameworks, but what's going to happen to everyone else? They're going to migrate? Migration's a real pain.” And so I think there's a growing need for tools that can give people something about their system without requiring anything. One thing I like to say is that a junior dev new grad should be able to walk into a system and within the same amount of time that they could call five APIs and get a whole app going, I want them to be able to monitor their system, know what's going wrong, know who to talk to about something that's going wrong. Ops shouldn't have such a differentially high learning curve compared to development.

RD Talking about installing the observability on your own system, it sounded like the system required so many dependencies and such knowledge of the code that it exposed and made immediate priorities of all the tech debt that you already had. Is that something you worked against, to reduce dependencies, reduce knowledge on code?

JY Yeah, that's exactly what step one of what we've been working on is. So what we're doing at Akita works without any code changes. That was something really important to us because our view was that if you have to change any code, then you have to do exactly as what you said, Ryan. Clean up tech debt in order to clean up your monitoring debt. You're already in debt. You already need help. You can use all the help that you can get. And so how Akita works is by using this technology called EBPF, or specifically BPF, Berkeley Packet Filter. We watch API network traffic passively. All you need is a config change and we're actually rolling out a Kubernetes injection method where we're a much faster install on Kubernetes as of next week sometime when we roll this out. But it was really important to us to require no code changes for our basic amount of monitoring. That is not easy to build if you also have the other requirement of reducing the data, because you can imagine it's hard but not impossible to drop something in and just start showing everything about a system. What I believe has been the harder thing for us is once we're watching all of the API traffic not to just dump all of it on a user and give them 10,000 pages of every API call you've ever made anywhere in your system.

RD Right. So I also wanted to kind of talk about the article that I first read of yours– making software for the 99% of developers. Seems to kind of tie into this. Do you think that the software that people make for developers is either overly complex in general, or is it the sort of resume-driven development of the neatest thing?

JY I think there's a lot of good software development tools out there. But there is, I believe, a premium on taking the tools from your Amazon, your Google, your Netflix, and taking that out onto the world. There's so many hot companies that are like, “Hey, we took the thing from Uber and now we're giving it to everyone else,” and that's great for a lot of things. When it comes to software development, if you think about it, there's not actually that many companies with those needs and they're never going to scale into those needs and actually the needs are very different. So if you think about Uber scale microservices and the tools that are coming out of there, Uber has tens of thousands, maybe hundreds of thousands by now, of microservices. Your run of the mill company, even a very big successful company, does not necessarily have that many microservices, so is that tool really relevant for a much smaller company? I think for the companies that are like, “We want to be a baby Uber. We're going to be at that scale one day,” sure, maybe it makes sense to invest that way, but my view is always to build for the scale you have now. You can always swap it out later. The tools are going to be quite different by the time you get to actually that scale. So if you put together kind of like a practical, non-biased view of the needs of companies out there today, combined with the advice to build for the scale and don't prematurely optimize for scale you don't have, let's go back to the basics, premature optimization is the root of all evil. If you combine that, the takeaway should be, “Hey, look, most of these companies should actually be building very different kinds of tooling,” but there's a vacuum there for a couple of reasons. One is companies of a Google, Amazon, Facebook, Netflix scale have the ability to build in-house tooling and then open source it, so that's a big reason why that tooling trickles down that way. So that's not really fixable. You're not going to have the software shop down the street building their own tooling. Someone has to build that for them. But the other thing is that I see a lot of investment in, “Oh, hey, this company is taking Google's tools and giving them to the world,” and that to me is fixable. That's people giving other people money to trickle down this tooling, and those people could be giving different people money to build for a large, growing, and decently pretty well capitalized part of the world. And so that's where a lot of my thinking has gone into how can this be different, how can we build for real needs? And I would like to say that by 99% developer I don't necessarily mean some small shop in Missouri that isn't talking to anybody else anywhere. Everything is interconnected now. I believe any startup-identifying company anywhere in the world has the potential to be very, very large. Any company that's not one of these big companies trickling down all of their tooling, I would put them in this 99% developer camp. So I have gotten pushback like, “Why would you build for such a population of people?” And I'm like, “This is most people. This is most people out there. If you're not Amazon, you're a 99% dev.”

BP I feel like inserting the .GIF here of Michael Richards drinking from the fire hose when it comes to AI and its impact on software development. Every day there's some new announcements and some new leaves to turn over. But Jean, you had a pretty interesting Twitter thread about actually taking this stuff with your team and trying it out when it comes to building what the impact was, so can you just share with folks a little bit of your experience there?

JY Yeah, I've been super excited about what GPT-3 and ChatGPT enable and what –I say my team but it's really one guy on our team– over the course of five days released this feature called ‘Ask Aki’. It's a conversational bot that tells people about their APIs. And so Aki is this little, very polite dog that can summarize, “This is what your API endpoint does.” He can automatically generate example payloads for you. He can generate tests for you to call that endpoint with. He can generate documentation. Very clever dog. I mean, it blew all of us away that you could build this. It was essentially prototyped in half a day, which was just crazy to me, but it's very much in line with why I'm excited about these AI tools. So before this, I had a few theses around AI like I think that AI is going to allow teams to prototype with AI faster. I think that the conversational interface is really good because it lets you not take any specific response too seriously, it lets you iterate, it lets you do a lot of things. But after my team member did all this, I realized that this is bigger than I even thought for UI prototyping, because anything that needs a chat you can basically whip up with a good data set and a strong engineer with a good UI sense. I won't discount Paul's role in actually doing everything here.

BP There’s a human in the loop here, okay.

JY But previously Paul would've had to hire an ML team for months to prototype this. Now Paul can do it in a day. So yeah, it was really exciting to me that, like I said, we have been curating this data set for years of the API traffic that we listen to and then processing that to be, “Here are your end points and here's what's going on with them.” So we had a lot of metadata for ChatGPT to work with already. I think you can't just throw tons of raw log data without any curation at it and expect it to just do something. Giving it some kind of semi-structured data really helps, I believe. Watch ChatGPT-5 come out and just prove me wrong on this too. But so there's that. And then I think the use case has to be somewhat amenable to a chat-like interface. I do believe the chat is important because otherwise people are expecting you to kind of get it right in one shot and I think then people take the results a little too seriously, which is not good.

BP Yeah, I think the experience I've had is similar to yours in that it's great for prototyping and iterating. Ryan's thesis is that still you need the humans around, not just for their aesthetic design or UX sense, but also when it comes to the requirements and ensuring that it works with the system you've built or the data you have, for example. Do you have any thoughts on that? Where does the rubber meet the road when it goes from how it's accelerating things to where seasoned developers are still necessary?

JY Yeah. Well, I would say that a seasoned developer was necessary to figure out how to feed the right data in the right format and what use case this really plugged into. I think that that was actually really key. And so all of the insight I would say of how to hook things up came from the developer. I guess an AI developer can maybe just iterate over the space of all millions of ways to do this and find this but that sounds very expensive to me at the moment. So I would say initial insight still comes from a seasoned developer, and then there are many limitations. So really all of the interactions with Aki don't really build on top of state, and so if we actually wanted to parametrize future interactions with more state, that's kind of something we would have to start building up ourselves. Maybe OpenAI will come out with an API for doing this, maybe not. There have also been questions around what's the long-term feasibility of business models building on top of this, because right now ChatGPT is heavily subsidized. I would say that ChatGPT is kind of the no-code of AI, which no-code is huge, but if you're going to build an app that you're going to make lots of money off of, probably it's not going to be no-code. I use Zapier as part of my metrics pipeline and my user tracking stuff– Zapier is a no-code pipeline– but my team codes because we’ve got to build other stuff. And so I would say there might always be a small part of a system that is like this, but you don't really expect your whole tech stack to be no-code for most substantial apps out there. Maybe one day, I don't know. Again, 20 years from now this could be a really dated podcast, but that's kind of my view on this.

RD Yeah. Well, in the comments of the article we were working on, you had an interesting premise that APIs are basically low-code.

JY Yeah, I think APIs are. I believe in low-code also more than I believe in no-code, and I think everyone is mostly programming in a low-code situation these days. They're calling APIs, they're calling other services. There's a lot of glue. Someone had to make the APIs though, so the code inside the APIs is higher than low-code. But we are effectively living in a world where there's a lot more abstraction, and to me that's progress. So APIs are abstracting over tons of functionality. Most companies don't roll their own payments or roll their own SMS anymore unless you have super special needs. And that's great. I think that people shouldn't be reinventing the wheel every time. That's innovation. But I think what that means is then the innovation goes at the next layer, which is a natural progress. I'm not super concerned or particularly even– well, I am excited, but I'm not like, “Oh my God! The entire world has changed.” I think every abstraction changes the world and lifts us up and ChatGPT is a big one for sure, but people are going to build on top of this now instead of what we had before.

BP Yeah. When you're coming from the perspective of, “Once upon a time, it was all Assembly,” you see the progression over time.

JY Yeah. Memory management, huge innovation, game changing.

[music plays]

BP All right, y'all. It is that time of the show. Let's shout out a Stack Overflow user who came on and helped ask a question or give an answer, share some knowledge and empower our community. Awarded yesterday to legendary_rob, a Stellar Question Badge: “How do I add a favicon to a static HTML page?” Well, you've helped over 1.8 million people with your question, Rob, so we really appreciate it. I am Ben Popper. I'm the Director of Content here at Stack Overflow. You can always find me on Twitter @BenPopper. If you have questions or suggestions for the podcast, shoot them over to podcast@stackoverflow.com. And if you like what you hear, leave us a rating and a review. It really helps.

RD I'm Ryan Donovan. I edit the Stack Overflow Blog, which you can find at stackoverflow.blog. And you can find me on Twitter @RThorDonovan.

JY I'm Jean Yang of Akita Software. You can find me on Twitter @JeanQasaur. You can find Akita Software @AkitaSoftware software on Twitter and akitasoftware.com. Would love to get your feedback.

BP Very cool. All right, everybody. Thanks for listening and we will talk to you soon.

[outro music plays]