The Stack Overflow Podcast

Scaling systems to manage the data about the data

Episode Summary

On this episode, Ryan and Cassidy talk to Satish Jayanthi, CTO and co-founder of Coalesce, about the growth of metadata and how you can manage it, especially in systems using generative AI. They explore the importance in providing context and transparency to data, how metadata can be generated automatically, and the future of metadata including knowledge graphs.

Episode Notes

Coalesce is a solution to transform data at scale.

You can find Satish on LinkedIn.

We previously spoke to Satish for a Q&A on the blog: AI is only as good as the data: Q&A with Satish Jayanthi of Coalesce

We previously covered metadata on the blog: Metadata, not data, is what drags your database down

Congrats to Lifeboat winner nwinkler for saving this question with a great answer: Docker run hello-world not working

Episode Transcription

[intro music plays]

Ben Popper Maximize cloud efficiency with DoiT, an AWS Premier Partner. With over 2,000 AWS customer launches and more than 400 AWS certifications, DoiT helps you see, strengthen, and save on your AWS spend. Learn more at doit.com. DoiT– your cloud, simplified.

Ryan Donovan Welcome to the Stack Overflow Podcast, the place to talk all things software and technology. My name is Ryan Donovan, I'm the Editor of the blog here at Stack Overflow, and I'm joined today by occasional co-host, Cassidy Williams. How are you, Cassidy?

Cassidy Williams I'm good. How are you today?

RD Pretty good. Getting a little cooked out in the heat here, but besides that.

CW Yeah. It's summer, it's fine.

RD It’s summer. Well, we have an excellent guest today– Satish Jayanthi, who is the CTO and cofounder of Coalesce. We've had him on for a Q&A before, but this time we're going to hear his voice and talk to him live. So welcome to the show, Satish.

Satish Jayanthi Thanks, Ryan. Thanks for having me. Hey, Cassidy.

CW Hey, good to see you.

RD So top of the show, we like to talk to our guests, find out how they got into software and technology. What's your origin story?

SJ When I started my career, I was working as an application programmer. I spent a couple of years doing C, C++, that was my entry level job at that time. And then when I started working in LA for a startup, I accidentally became a DBA.

CW As one does.

SJ So as people do in startups, they wear several hats, and one of the hats that I was wearing was, in addition to maintaining all the servers and stuff, I was also responsible for providing data to business. When they come up with a question, I had to go write a query, get that data into an Excel, give it back. And the startup grew and I found that I was not scaling with that type of request, so I started looking at how do I do this in a much better way. And that's when I ran into data warehousing, picked up Kimball's book, and then I pretty much stayed in that space for that time.

RD So big journey on data once you accidentally got into it, but today I think we want to talk about metadata. So what is metadata and how does it differ from data?

SJ Metadata provides more details about the data. It provides more context about the data. Now when we take a picture, we're actually capturing some metadata behind the scenes, because the picture is the actual data of where it was taken, when it was taken, what was the location, and all of that, details, coordinates, all of that information is metadata. So that additional context about the data is metadata. There's several definitions, data about data and so on, but that's the basic idea there.

CW So “data about the data” I feel like is a good way to say it.

SJ That was a famous definition a lot of people use to describe it. But again, it's the context.

RD Well, I'm sure with Gen AI and large language models today, there's a lot of data around that– tagging, vectors, all these other things around data. What other sorts of issues and abilities are metadata developing?

SJ So we have built analytical systems, analytic systems, for a while now, for several decades. We basically go and find out what the rules are, we go build them, we just code them. Tools that use, at the end of the day, you're recording everything according to the specs. And as these systems grow bigger and bigger, you'll see that transparency is going to be a problem. People ask, “Hey, how did you get this metric? How did you calculate this?” And it becomes a really big challenge for somebody to go and understand how that was calculated. They have to sift through the code, they have to understand all of that. That's when we've done the systems. Now, imagine we are introducing essentially a black box into this flow, which is an AI system, in my opinion. With AI, nobody's coding the rules. You're basically giving AI a lot of data and having it learn from the patterns in the data. Now that is sitting in the workflow of these systems now, so you can see how transparency, we basically made it even worse. The problem space is much harder. So that's where I believe metadata plays a role to make these systems a bit more transparent.

CW Well, that makes sense. And I suppose if it's more transparent too, it's more reliable overall.

SJ Absolutely. That's one of the things. We all know that AI is a great technology, but these systems, people are finding out that they’re not perfect. They hallucinate, they give you wrong answers, so definitely reliability is a big issue. And if you make these systems more transparent, it improves the trustworthiness of the models. It improves the accuracy.

RD So can we use metadata to improve that trustworthiness, and how?

SJ So going back to the data that goes into AI systems, basically the idea there is that you feed these systems lots and lots of data and these systems will learn automatically from the patterns, and that's how they understand how to reason things based on the facts that we have fed them, these systems. And when you prompt, it's going to look at those and spit out a token or whatever to respond to your prompt. The challenge is that it doesn't have additional context. It's almost like if you and I are playing a guessing game, and if I don't say anything of what I have in my mind and ask you to think about or guess that, it becomes very hard. But once I start saying something, some additional context, “Hey, I'm thinking about an animal,” all of a sudden now you're narrowing down the scope of what you're thinking. So metadata essentially works that way. So metadata can provide that additional context to these systems and improve the accuracy, improve the reliability, and so on. And also the preparation of the data– you have to feed high-quality data to train these models. And how do you clean that data? How do you build all of that training data set? So that metadata plays an important role there as well, in not only providing additional context to the AI models, but also helping prepare the data that goes into these.

CW It feels like a bit of a chicken and egg problem where if you don't have metadata, you can use AI to help you make metadata, but also you need the metadata that you make with AI potentially to help with AI to give context to more AI.

SJ That’s correct.

RD So last time we talked, we talked about data quality. Is there the same issues with metadata? Is there poor quality metadata? Is there such a thing as too much metadata?

SJ There could be quality issues with metadata, obviously. You can explain something incorrectly because it's explaining about the data. A lot of times this metadata is generated automatically behind the scenes, so nobody's inputting stuff manually. Not every time, but that reduces the chance of having bad metadata in the first place. That quality issue is important even with metadata, but not as much as it is with the data. Because when you capture data, it's coming from people entering manually, entering this data, wherever it's coming from. There's a lot of noise, there's a lot of potential for errors in that data.

RD One of the things I've heard about too much metadata is that the metadata ends up being bigger than the data itself and all your storage requirements are going to the metadata and transfer. You end up paying more for the metadata than the data itself. Is that actually an issue?

SJ I have not come across that issue. Usually when we store metadata about, let's say, one terabyte of data, usually the metadata is not even one gig or less. But on the other hand, it's all about how much are you describing the data. You can go overboard and start saving useless metadata, that's possible, but typically that's not the case. Typically, you see metadata being very small footprint, very small storage requirement, and very powerful because it's the summary. It's like you're describing, it’s the descriptions.

CW Cool. And so for people who are sold on this concept of, “Oh, I probably need to clean up my data. I probably need to add some more context and stuff,” what are the pitfalls that they tend to fall into commonly if they're very fresh to approaching it?

SJ So there's definitely quite a few challenges, pitfalls in organizations to use metadata correctly. There is the availability of metadata in the first place. First of all, do you have the metadata, and where is it? For example, if you have legacy systems, data silos, there may not be enough metadata there. You may not understand what the system is doing exactly in some cases. So there's a gap there– that's one. And it also requires some skill to organize all of this metadata, leverage this metadata, origin the metadata. So there is that on how to use this metadata, how to organize it. This is taxonomy and ontology and all that where basically you're organizing it in such a way that you can leverage it. But if you don't have the skill and you don't have the experience, it's going to be another challenge. A couple of other challenges are collaboration. Because metadata is gathered from multiple systems and you need to collaborate on how you can pull all of this metadata in one place and kind of leverage it, that will be another people challenge. There's several others I can think of, but these are some of the main challenges: skill, the availability of metadata, and how do you collaborate and organize all of this so you can leverage it.

RD You mentioned automated metadata generation and that there are tools that can do this, so what does the automated metadata look like and how do you automatically create additional context around your data?

SJ The automated metadata, basically what it means is performing whatever you're performing to run your business or run your workflow and so on, but there is context behind the scenes that's being captured for you. That's essentially the idea. When we take a picture, you're focused on taking the picture. You're not thinking about all the metadata that is going to be captured behind the scenes, and that's essentially the idea. We want to capture these important information without the knowledge of the user necessarily or the user being involved actively to store them. So once you collect all this metadata, then obviously it becomes very, very powerful. And you want systems that do this behind the scenes.

RD It sounds like it's almost like logging or other kinds of observability metrics.

SJ Absolutely. So these data observability systems are pretty good at that. Now data discovery systems do that, catalog systems do that.

CW So in your work day to day, how do customers tend to approach you? Are they kind of just like, “Hello, we need help,” or do they have some ideas in mind? What does a typical flow look like to help people get their data labeled better and everything?

SJ So at Coalesce, we focus on data transformations, ideally helping customers to solve this big, huge bottleneck that people have. They land all this data, raw data, into some database platform like Snowflake, for instance, and then their main goal is to take that raw data set, apply rules, clean it, validate it, and then prepare it for whatever use. One of the uses could be feeding AI systems or doing feature engineering, whatever they're doing with it, but the foundation has to be there. And that foundation is high-quality, transparent data with lineage, proper governance, all of that. If you have that level of quality in your data, then anything that you build on top of it is going to produce that quality as well. So what we do is, at Coalesce, we help our customers build that foundation the right way and the fastest way.

RD So you must do a lot of sort of data transformation pipelines, ETL, and that sort of thing. We had somebody on recently that said that ETL pipelines might be a thing of the past, that you could be transforming data in place with larger and larger database structures. Do you think that is a reasonable thing to say, or is that not your experience?

SJ So the traditional ETL is a bit different, and then there's an ELT paradigm. ELT is more like you're doing the transformations within the database platform, and that's essentially how we do our transformations. The ETL paradigm, which is, “Hey, pull all the data out of a system, or whatever system, source systems, prepare it in a separate place, and then load that curated dataset into a target platform,” that was the traditional paradigm. That is no longer feasible and also doesn't make sense because these database systems are so powerful and you're already invested into these that you might as well use these systems to do that processing as well. So that's ELT. Now, will this go away fully? I don't think that that's going to happen anytime soon. I believe we are a long ways from that. There's going to be always some transformation needed, some preparation needed, otherwise, we are jumping ahead. This is what happened when AI systems came. People say, “Just feed all the data to the AI system. It'll figure it out.” Well, sure. That's not how it works. There is an element of automation there that the AI system brings, but there's also a lot of risk if you're not feeding, if you're not preparing this data. Now, AI could be a tool in preparing this data, but it's not going to be replacing the entire thing. That's not how I see it. It's going to improve it. But remember, as our technology is getting better, as we're getting more productive in solving these problems, our data sets are getting complex. Data sets are getting huge, so your problem space is becoming much bigger. The number of sources that we have to deal with today is a lot higher than what we were dealing with 10-15 years ago.

CW It feels like there's a big pendulum swing that continues to happen with data where it starts to get too complex and people are then erring on the side of simple, but then it gets too simple and then we get more complex again. And I feel like especially now with AI and with managing a lot of data, I think people aren't sure where they should be on that pendulum swing just yet right now. Do you have any thoughts on that?

SJ Things are becoming easier in one way, which is AI helping a lot of things, obviously making it easier starting with writing emails all the way to doing anything computational. It does help, but also there is an element of complexity because of the nature of these systems, which I was referring to earlier as a black box. How do you train these systems? Now you've got to be more careful on what you're feeding these systems. So that's the complex part now. How do you train these? What if the system becomes so opinionated that it's not what we want? So there's a balance that is needed here, and that takes some time for people to implement these systems, understand the risk, understand how they work, and what's that right sweet spot. I believe that's where we are getting into. ChatGPT came last year and that was the most exciting, and now they're becoming a little more realistic in terms of what they can do. They’re still fantastic systems, but you're talking about solving business issues or business problems. There's a lot more challenge there, there's a lot more accuracy that's needed. So that's where we are today and I think people are trying to understand that nuance so they're making these systems and incorporating them in a more practical and proper way.

CW It is one of those things where a lot of these LLM systems are very cool, but there's an accuracy element where once you get beyond cool and trying to make a poem that sounds like a frog, now you need to get to the point where, okay, how can my data actually be done better? How can we save human time to offset costs?

SJ That’s correct.

RD So like you said, these data sets keep growing and growing and the metadata keeps growing and growing. What do you think the trends for the future are for metadata? What do you think is going to happen? Are there going to be new AI capabilities? Is somebody going to say, “This is the standardized metadata.” What's next?

SJ So one key area, again, is the additional context that metadata provides. There's going to be improvements there. I believe already we are seeing RAG systems, retrieval augmented generation systems where you have a vector database and we provide additional context via that to the LLM. I believe semantic metadata is going to be becoming very useful– how you define your business semantically– and then combine that with something like knowledge graphs. Knowledge graphs are representing your data in a more natural way using a graph, and you combine that with the semantic layer, which is the semantic definitions of your business, and then provide these two as an input to the LLM or any AI system. Now the accuracy will go up, way up, because now the LLM system has a lot of information at its disposal to come up with the response.

CW So if I'm coming to you just as someone who doesn't know anything about this, what is the one thing you would tell me to get started into this whole metadata data cleanup world?

SJ I would say to, first of all, understand what metadata is and how it can be used, but also start small. That's one of the biggest things I would say. Start small, figure out the power of how you can use metadata to do so many things like automation. Coalesce automates a lot of things by leveraging the metadata that it captures. So that's super critical in understanding how you can automate just by getting details about data. You can discover problems with the data, and once you discover those problems, you can automate data quality rules with metadata, you can understand the schema, definitions, and on and on and on. So understand the use cases of how the metadata applies and see if you have metadata in your organization and start small.

CW Awesome.

RD Good advice

[music plays]

RD Well, it is that time of the show again. We like to shout out somebody who came on Stack Overflow and gave us a little knowledge. We're going to shout out a Lifeboat Badge winner to nwinkler who came in and dropped an answer for, “Docker run hello-world not working.” So can’t even get the hello-world working, this one's for you.

CW You're in trouble.

RD That's right.

CW Or not, with this answer.

RD Right. My name is Ryan Donovan. I am the Editor of the blog here at Stack Overflow. You can still find me at X/Twitter @RThorDonovan. And if you liked what you heard today, you can leave a rating and review. It really helps.

CW My name is Cassidy Williams. You can find me @Cassidoo on most things, or at my blog, cassidoo.co.

SJ My name is Satish Jayanthi, cofounder and CTO at Coalesce. You can find me on LinkedIn.

RD Thank you very much, and we'll talk to you next time.

[outro music plays]