The Stack Overflow Podcast

Data, data everywhere and not a stop to think

Episode Summary

Ben and Ryan are joined by Nick Heudecker, Senior Director of Market Strategy and Competitive Intelligence at Cribl, to discuss the state of data and analytics. They cover GenAI, the role of incumbents vs. startups, challenges of data storage and security, data quality and ETL pipelines, measures of data quality for GenAI, and Cribl’s role in the data and observability space.

Episode Notes

Cribl is a data management platform. Check out their sandbox or explore their products.

Cribl Stream is their vendor-agnostic observability pipeline.

If you’re new to the term, the observability pipeline is a crucial component of the cloud-native world.

Connect with Nick on LinkedIn.

Chapters

00:00 Introduction and Background

03:23 The Data Landscape and Generative AI

06:08 Incumbents vs. Startups in the Data Space

07:46 Challenges of Data Storage and Exfiltration

09:38 Securing Large Warehouses of Data

12:21 Data Quality and ETL Pipelines

16:05 Measures of Data Quality for Gen AI

22:04 Cribl’s Role in the Data and Observability Space

26:20 The Pros and Cons of Richer Observability Monitoring

28:11 Closing Remarks and Shoutout

Episode Transcription

[intro music plays]

Ben Popper All right, everybody. Welcome to the Stack Overflow Podcast, a place to talk all things software and technology. I am Ben Popper, Director of Content here, joined as I often am by my colleague and collaborator, Ryan Donovan, Editor of our blog, maestro of our newsletter. Ryan, you are an aficionado not just of the data warehouse, but the data lake house.

Ryan Donovan I mean, it's the new hotness.

BP If you don't have a private jet and a data lake house, you're doing it wrong. So today we're lucky to have Nick Heudecker, who is the Senior Director of Market Strategy and Competitive Intelligence over at Cribl and a long time analyst in the world of data and analytics over at Gartner on the program to chat with us about the state of data. We're going to be talking data lake houses, we're going to talk about vector databases and their role in the new world of Gen AI, and just sort of getting Nick’s take on the market with over a decade of experience looking at this. Plus, he'll get the chance to talk his book a little, tell us what Cribl is up to, things of that nature. So Nick, welcome to the Stack Overflow Podcast.

Nick Heudecker Thanks for having me.

BP So for folks who are listening, just give us real quick, what was your path into the world of software and technology and then how did you end up focusing on data? I know you worked yourself as a platform architect and an engineering manager in the world of mobile for a while and then transitioned to the research side and then began to focus more on data and analytics?

NH Yeah, it was a long path. Honestly, it started during a Navy enlistment. I was a cryptologist and that's when I really got my first taste into Solaris and Unix and I was able to parlay that into a job working as a contractor at Ford Motor Company. And that took me, oddly enough, more places around the world than my Navy enlistment did. I finished that job and then ended up joining a bunch of startups. I learned Java programming really well. I feel much better now, though, now that I'm not doing it. Long path, and then eventually I ended up at Gartner where I was an analyst covering data management. So I've always kind of had a data and analytics bend, always been interested in the data side of the equation. And now, let's see, I've been at Cribl for three and a half years. I left Gartner in October of 2020, joined Cribl in November of 2020, and it's just been a rocketship. The company has grown. We were fewer than 50 people when I started, now we're nearly 700. So it has been very rapid growth.

RD Well, it's nice to have a data expert on. We've been talking a lot about generative AI lately.

NH How could you not, right?

RD I mean, so is everybody. And it uses a lot of data, and the data, the way it's packaged has sort of shifted. So can you talk about what that data landscape looks like now, especially in regards to generative AI?

NH Well, there's some new stuff out there. Vector databases have certainly made a splash on the scene. But I think what you're seeing is what you always see when requirements change or desires change. You saw this a long time ago with things like XML databases. I remember in 1999/2000 there was this kind of surge of XML databases, which turned into a $0 billion market because the incumbent vendors added that capability to their own database products. You've already seen in, I think, March of last year, IBM has added vector database capabilities to DB2. Oracle announced it in September. So while there's always this kind of new splash of new products and companies out there to meet this very specific need– in the case of Gen AI, you see a lot of use of vector databases– the incumbents very quickly add those capabilities. We saw them do it with XML databases, object-oriented databases that kind of went away, and then you saw it with JSON, MongoDB. Hard to argue the success that company has had, very popular product, but a number of companies immediately added those capabilities to their incumbent platforms. So with databases, it's easy to kind of point at new technologies and new methods and just assume that it is game, set, match for everything that came before, but that's rarely the case. There's always room for incumbents to continue to innovate, but from the data landscape perspective, certainly vector databases, we see object storage, that's incredibly popular, especially with Cribl customers. Certainly data lake houses, I know you mentioned you're an aficionado of those. I see them as basically kind of low-rent data warehouses, and also data warehouses. All of these companies offering these products are adding the capabilities necessary to do Gen AI. And that's important because, well, that's where a lot of the data is, and that's what really makes a difference with Gen AI. It's not the algorithms, it's really the data

BP I was listening to an NVIDIA presentation yesterday talking about the AI landscape and a word that came up a number of times was ‘exabytes,’ which kind of puts the fear into you when you're thinking about where you're going to be working and storing data. Another interesting component of this that I'd like your take on, as you mentioned, will incumbents just come to match this versus what can startups do? It does seem like there are folks, for example, Pinecone, who specialize in vector databases and have been growing quite quickly. Do you think that there's some differentiation in what they're offering or just their specialization and their ability to make a name for themselves?

NH I think that there's always going to be a few companies that hang on. There was a bunch of JSON databases, sorry to keep using that example, but it's the most kind of time recent. There's always going to be some of these companies that survive. Pinecone could potentially be one. They'll get some stickiness in the enterprise, they'll develop a community around their product and they'll hang on, but I still think there's a lot of room for the incumbents. One, they've got a lot of money. A lot of these companies are flowing cash like crazy every quarter and that gives them a significant warchest either to build their own capabilities or buy some of these companies as well, if there's a reason to do so.

RD So speaking of money, the data storage and exfiltration isn't always free, so if people are going to exabytes worth of data, are there changes in how these offerings are priced and how people are trying to get around paying tons of money for the exabytes of storage?

NH I think we're going to have to see new technologies for storage if we're talking about exabytes. I did a project right before I left Gartner around DNA-based data storage, and you can store a tremendous amount of data right now at very high cost, but you can still store a tremendous amount of data in a cubic milliliter of water, or not water, but DNA material, and it's incredibly resilient. And you can introduce enzymes into those environments and then process that data there without really burning a lot of electricity. So if you want to start talking about exabytes of data, you need to start thinking about maybe Microsoft's Project Glass or DNA-based options just to make this feasible at any level. Certainly there's a need to store a tremendous amount of data but you've got to have the storage media available. And so a lot of companies, if you start talking about exabytes of data, you could very quickly be priced out. Can you physically buy enough storage to keep all that data? Can you rack and stack those drives? There's just a lot of logistical issues and laws of physics you have to think about. The money aside, is it even physically possible for more than a handful of companies in the world to have that amount of storage on hand? It remains to be seen. Not everybody's CERN.

BP And also, obviously we're talking about generative AI here. We imagine a world one or two years in the future where it's pretty common to have a product where I enter text and it produces high quality video and then I want it to save that video for me. Now everyday I'm ginning up all of this extra data that I want somebody to hold in the cloud for me somewhere, maybe I'll pay for the storage or whatever it may be.

NH And not just storage, but also network. You’ve got to move all that video around, too. Some companies have solved this problem. I think Google's done a good job of it, but that becomes another challenge.

RD So I think another challenge is securing the data. People have very valuable data, even in terms of the logging data. I've heard of logging data getting exfiltrated and people hacking systems because of that. How are people securing these large, large warehouses of data?

NH I think it's piecemeal right now because they're dealing with just so much. Companies have become accustomed to securing their business data because it's tied to a business process. So it's like, “Okay, I understand what started this data collection process. I understand where that data is going and the workflow that got it there.” So it's a lot easier for companies to think about, “All right, my business data, I know how to secure and govern that at least 80-90% of the time.” But log data is coming in from all kinds of different sources, endpoints, servers, applications, containers, and a dozen other different sources in a lot of different formats. Some developer could roll out an update to an application and all of a sudden you're going from info logs to debug logs and you've got clear text passwords in there. And so I think it's piecemeal at best. This is a frequent use case that we see at Cribl where all the data passes through our observability pipeline product and we're able to go through and say, “All right, here's PII data, let's mask that.” And by having that one point of control between your sources and destinations, we can greatly advance that conversation. Most enterprises today, though, are not there and they're struggling, I think. And also because it's not directly tied to a business process, it's kind of like the exhaust of business processes, a lot of companies don't even know what they've got or don't have, so it's difficult for them to know what to secure and how to secure it.

RD I just wonder if there is a sort of computational cost to if you decide to encrypt everything, all your data at rest, does that add sort of blockers overhead to your computation?

NH I don't have an experience there. I imagine that kind of the obvious layman's answer is, yes, there's certainly going to be overhead there, not just in data processing but also things like metadata management and how do you understand the data quality? Is that data accurate? Does it reflect a real world scenario? If it's all encrypted, it becomes very difficult to figure that out. Is this data even viable for a Gen AI use case? Every layer of protection you put in creates an additional layer of complexity when you try and get it out, and where do you find that balance?

BP I'm not sure this is an area you know about but I just wanted to ask, when you have proprietary data within a company and you're interested in leveraging it in the Gen AI space, is there a new kind of ETL pipeline, a way of looking at data quality, whether that's accuracy, freshness, relevance, or a process for doing that lift and shift and maybe some human annotation that you think folks are beginning to understand best practices and glom onto some of them as they work on this?

NH I think companies are starting from a much lower level of maturity than that question kind of gives them credit for. They're still working on the basics of data quality, and honestly, nobody gets out of bed excited to go make sure they have high quality data. It's janitorial work at best and no one wants to be responsible for it. So they're starting from a very low sense of maturity so they have a long way to go. You mentioned lift and shift, so are there new ETL pipelines that are coming out to do this? I'm not seeing anything particularly compelling. There's some startups out there that are just getting started trying to figure out product market fit, what their go-to-market might look like when they eventually find that and so on, so it's very early days for these kinds of AI ETL pipelines. But I think companies need to really deal with the blocking and tackling of classic data management. Data quality, metadata management, perhaps master data management, all of the things that we've been talking about for over a decade are still things that companies need to do. If you want to be successful with Gen AI, start with the basics, honestly. Catalog your data. Where does it come from? Who accesses it and why? How frequently does it get accessed? Apply the classic measures of data quality, completeness, consistency, validity, all that stuff. Also ensure that it also has a high level of business data quality as well instead of just the technical aspects of data quality. Does this data reflect the business world that I live in? And that's always changing. If your business isn't always changing, if your data is not always growing, you're basically going out of business. I'm pretty confident in saying that. So it's not something you can just kind of fire and forget. You need constant care and feeding of all of those disciplines. I don't know if that really answers your question, but it's kind of a curmudgeon-like response. But frankly, that's just what I've seen.

RD Are there new measures of data quality that come about because of Gen AI, and are there ways to sort of distinctly say, “This is good data.” Can you universally say that this is good data?

NH I think it's been a forcing function for people to look at, “If I'm going to train this data to make decisions or generate content, will it accurately respond to real world scenarios?” And so this has been a forcing function for companies to think, “Does my data reflect the real world? Am I training these models to act in a real world?” And I think that's something that, maybe not new measures, but it refined focus on these data aspects.

BP I do think there's an interesting question there, which is, what is the purpose of a Gen AI versus a search engine? A search engine's job– we want to bring the world's knowledge to you and help you find accurate and trustworthy sources. In a Gen AI, often what they're saying is that we want to help you create, we want to help you as a collaborator, as a thought starter, as a synthesizer of information. They have to then play politics and worry about bias in a way that's very different from search and overcorrecting for that may lead to as many headaches as under-correcting for that. But it is interesting to think about the open acknowledgement, which I think kind of kickstarted all of this, which was that no big company wanted to be the first one to dip their toe in this. It took OpenAI to say, “Here's this thing. It makes lots of mistakes. Hope you have fun with it,” and then everyone said, “Wow! This thing is great!” And then everybody else had to kind of play catch up. And so Ryan and I have been looking at a lot of different systems and they are evolving. Retrieval augmented generation is a super interesting one in the context of what you do. That's all about taking a subset of data, organizing it a certain way, thinking about embeddings and vector databases to avoid hallucinations, to make a Gen AI that's meant to be a knowledge chatbot more efficient in certain ways. And now we're getting these systems that have multiple kinds of AI. They have a Gen AI, but then they also have a planner or they have a critic or they have a fact-checker kind of built into them. And so I guess the question of accuracy, to your point, we're not mature enough even in knowing what the final use case is. There's not a lot of fully fleshed out business use cases. Klarna said they replaced their first line of call centers with AI. Well as far as I knew, most of those people were robots anyway. My experience with customer service is that most of the time I'm talking to a robot and a dial-by-number thing anyway.

NH I read Klarna's press release on that and it was basically a case study in the book, “How to Lie with Statistics.” They really shaped that data so it was like, “Okay, I'm going to make this look as positively as possible to reflect the benefits of AI,” and I found a lot of that to be highly suspect.

BP I think they're preparing maybe hoping to get into their IPO window and they wanted to put some good stuff out in the market.

NH Look, there's a lot of AI washing out there and I think there's always a presumption towards maturity. And this stuff has not been out that long. It's not the seventh inning. We're still in warm ups right now. The pitcher is on the mound, we haven't even had the National Anthem sung yet. It is still incredibly early. There's going to be a lot of regulation that comes out about what you can and can't do, data sources that you can and can't use. I'm sure the European Union is going to have some opinions followed by regulations about what data you can use from European customers and how that might vary. So we're really just getting started. It's far too early to pick winners or even identify what viable use cases might be. You mentioned OpenAI coming out and saying, “Hey, this thing is not perfect.” We have a mantra within Cribl that if you're not embarrassed by your first version, you shipped too late. You’ve got to get it out there. You’ve got to see what's going to happen, how people use it, and then adapt from there.

RD No test like reality, right?

NH Right.

BP Nick, I saw you had something up and it was mentioning where you work and some of the stats around Cribl and its growth. And you mentioned at the beginning of the episode how quickly it's been growing. So can you tell me sort of where it fits into the space of other companies that are playing around in the world of data and observability? What's the differentiator if someone were to come to you from a Snowflake or a Datadog or a Splunk or wherever?

NH So initially the company's vision was to make that observability data better. Our initial product was Cribl Stream which is an observability pipeline. We didn't come up with that term, a guy named Tyler Treat did. He's got a great blog post on it. And basically what that does is it sits between the sources of observability data– the things that are emitting logs, events, metrics, and traces– and it acts as a universal collector and receiver for all of that information, allowing you to route it to multiple destinations, enrich it in flight, filter out data that maybe you might want to redact or eliminate. Getting rid of white space, for instance, is incredibly valuable. You can also transform data. If you're familiar with Windows Classic events, they're a nightmare to deal with. They're incredibly bulky. If you can convert that into a more economical format, you can save a lot of money downstream. And you can also, once again, reduce that data if you've had things that may be events that you don't need or things that you want to suppress or sample. There's a lot of different ways you can manage that data as it goes on to its destination. That was where we started. Cribl Stream was our flagship product. We've since been expanding our product portfolio quite a bit. We're now calling Cribl ‘the data engine for IT and security.’ So when you think about Cribl, I want you to think about us in the same breath you might think about Snowflake or Databricks but for a very specific audience, and it's that IT and security audience that's always been underserved for its data needs. So we've expanded Cribl Stream to also include Cribl Edge, which is our unified agent to do data collection and routing directly from the end point. Good level of manageability for fleets and sub fleets there. We've also included Cribl Search. And to your earlier point that some of this data or a lot of this data has to be centralized in one place, well, when you're talking about petabytes or maybe even exabytes of data, you're not moving that, that's staying where it is. What Cribl Search does is it allows you to leave that data where it is and then search it in a federated way. So if you've got data in a thousand S3 buckets, and you've got data in Splunk, and you've got data in Snowflake, we can search all of those. Now, this is not SQL, because IT and security audiences do not know SQL. Not NoSQL, but they do not know it as a first class kind of language to interact with data. They know search. They grew up with Elastic, Splunk, PromQL, etc. So with one search interface, they can search against all of these different data sources, including Cribl Edge, and get their results back. And the last product that we've recently introduced, currently in beta and will be GA soon, is Cribl Lake, and this is a way to, out of the box, organize your data for whatever use cases you may have downstream, share it with other parties, and so on. So I think of this as a way to compose your own data infrastructure for all of this observability data that you're dealing with. You can put the pieces together that you need, send data where it needs to be delivered to, and in the context of Gen AI, we see a number of companies that use us to cleanse and refine data and label it before it goes into AI ops-based systems, for instance, that are probably some of the earliest AI companies out there in the enterprise.

RD Since you're in the observability space and we talked about logging a little bit earlier, do you think the move towards the sort of richer observability monitoring is a good thing or does it have drawbacks?

NH I think it's a good thing and I think it has drawbacks. Like any hyped term, the difference between the promises and the ROI are very different. And I will say this, you cannot buy observability, but it won't stop people from trying to sell it to you. And so I think there's been a lot of kind of religious style arguments of, “Oh, if you need observability, you just need traces and you don't need metrics and logs and events and so on.” And another company or party may say, “No, no, no. You don't need traces, you need logs and you need, et cetera.” You have to decide for you what you need, and often different groups within the same company need different sets or subsets of that data. So as long as you don't get caught up in the dogma of what observability is ‘supposed to be,’ it can provide massive value, but I think it's been a long time coming. I talk to economic buyers for solutions and they often don't understand what they're going to get out of an observability solution that they're not already getting out of a monitoring solution. So it's really incumbent on the buyer to understand what that potential ROI is going to look like, and keep your hand on your wallet because a lot of these investments may not pay off for you, but it is also important to experiment and see what may work over time. This is actually one of the advantages that we see by sitting with one of our core products, with Cribl Stream sitting between sources and destinations, companies can kick the tires on a lot of different products without having to commit to them. Because we can take existing data sources, route them to a new platform in the right format and make that testing very low cost. So having that flexibility and keeping your options open is super useful.

BP So as we wrap up here, I have to assume a former Navy man working in cryptography quite a long time ago and then spent some time thinking about the value of data and compute that you're holding a big bag of crypto, right? You were early, early? No, I'm just teasing.

NH Nope. And by the way, it was not that long ago. Actually it was, I'm just offended you said that. No, not a crypto fan. I will admit that, I will admit that.

BP A cryptographer, but not a crypto fan. That's okay. You’re among friends.

RD Also not a crypto fan.

NH Maybe I've got some inside knowledge that everybody else should take advantage of, but I'm not a crypto fan.

BP Ooh. Uh-oh. Okay, good to know.

[music plays]

BP All right, y'all. We are going to give a shout out to someone who came on Stack Overflow and spread a little knowledge or contributed some curiosity. How can I fix "ValueError: Expected 2D array, got 1D array instead?” Oh no, you don't want a 1D array in your scikit. This Be Shiva got a Lifeboat Badge yesterday for giving a great answer and has helped, let's see, over 90,000 people. So we appreciate it, Shiva, and congrats on your badge. As always, I am Ben Popper. You can find me in 2D on X. You can email us with questions or suggestions: podcast@stackoverflow.com. And if you like the show, leave us a rating and a review.

RD I'm Ryan Donovan. I edit the blog here at Stack Overflow. You can find it at stackoverflow.blog. And if you want to reach out to me on X, you can find me @RThorDonovan.

NH Nick Heudecker. I'm the Senior Director of Market Strategy and Competitive Intelligence at Cribl. If you want to find me, tell me my opinions are wrong or right, hopefully right, you can find me on Twitter/X @NHeudecker. And if you can figure out how to spell my last name, that's part of the challenge. And if you'd like to know more about Cribl, go to Cribl.io or join the Slack community at Cribl.io/community. There's about 7,000 folks there happy to answer your questions.

BP Sweet. Thank you so much for listening. We will talk to you soon.

[outro music plays]