The Stack Overflow Podcast

Solving the data doom loop

Episode Summary

Ken Stott, Field CTO of API platform Hasura, tells Ryan about the data doom loop: the concept that organizations are spending lots of money on data systems without seeing improvements in data quality or efficiency. Their conversation touches on the challenges of data management, the impact of microservices, the importance of feedback loops in ensuring data quality, and how a data architecture that uses a supergraph would enhance data accessibility and quality.

Episode Notes

Hasura is a GraphQL API platform. Get started exploring here.

Read Ken’s article on the data doom loop.

Find Ken on LinkedIn.

Shoutout to Stack Overflow user liquorvicar, who earned a Lifeboat badge with an exemplary answer to Checking value in an array inside one SQL query with WHERE clause.

Episode Transcription

[intro music plays]

Ryan Donovan The Stack Overflow community has questions, our CEO has answers. We're streaming a special AMA with Stack Overflow CEO Prashanth Chandrasekar on February 26th over on YouTube. He'll be talking about what's in store for the future of Stack, and you'll have the chance to ask him anything. Join us.

RD Hello everyone, and welcome back to the Stack Overflow Podcast, a place to talk all things software and technology. I'm your humble host, Ryan Donovan, and today we're talking about doom– specifically a data doom loop where you spend more money on your data systems and it gets more and more complex and you have to spend more money until you're just paying all your money for data, and possibly if a supergraph can help dig you out of this hole. My guest today is Kenneth Stott, Field CTO of Hasura, and he's going to be telling us what this doom loop is, all about data, why it happens, and what the supergraph is. So welcome to the show.

Kenneth Stott Well, thank you for having me, Ryan. Really excited to tell you a little bit about my theories on data and how you might be able to solve some of the issues.

RD All right, I love some theories on data. At the beginning of the show, we like to get to know our guests a little bit and how they got into software and technology. So what's your origin story?

KS Well, I actually did not start there. I actually was a risk management consultant, but I had always, as a teenager, I had done a lot of software development as sort of a fun activity, but I actually got a degree in business and finance and I became a risk management consultant with Deloitte. And I got into this weird space where we did exotic derivatives valuation, and I could program so I built all the models for this business, and then we did litigation support. So several years ago, there were issues where banks and customers were a little bit in disagreement around derivatives that had been sold to them.

RD That's a very delicate way of putting it, thank you.

KS We helped sort those things out. But because I'd done that and because I'd been talking to a lot of people on Wall Street and technology and so forth, I ended up becoming head of tech for a bond sales and trading operation on Wall Street and that led me to a CTO role at a company called Koch Industries, the big hundred billion dollar company, and because they also did a lot of trading and so trading was sort of my thing. And then I went over to Enron, which is another infamous story, but I was there and I was also CTO there. And eventually I landed at the last role that I had before this role at Hasura. I was a Senior Data Architect at Bank of America for about a decade.

RD I mean, obviously there's some interesting companies in the background. Has the background in risk management helped you as a CTO?

KS Yeah, absolutely. I would just say that the broader your experiences, the better you're going to be at your job, because creativity and problem solving typically draws from other domains. In fact, there's been studies on creativity and they classify them into different groups. The biggest source of creativity is taking something from one domain and applying it to another domain. So the more you know, the more domains you've been involved in, just the better you are, and that's definitely true in software development.

RD I mean, a lot of people like to call themselves software engineering, and engineering is obviously managing potential failure points, right? So how did you get to Hasura and what do you do there?

KS Well, I had retired from Bank of America, to be honest, that was my last role. But I had had an association with Hasura as a vendor, and when they learned that I had retired, they said, “You can't retire, we have things for you to do,” and they asked me to come on board as a Field CTO. And it just sounded like a cool job, I just get to talk to all of my peers across all of the industries that I've been associated with and talk to them about cool ideas. What's awesome about Field CTO is you don't ever talk about the product. For the most part, I just talk about ideas around data management.

RD Well, those are my favorite guests to have. And speaking of ideas, pitched this appearance with the idea of the data doom loop. It's a great title for it. Can you tell us a little bit about what that means?

KS Yeah, absolutely. And it's meant to be provocative, but it seems to have provoked a few people, so I achieved my goals. As people see that data is not delivering the expected value, they come to the conclusion that they need to apply additional technology. Oftentimes they don't rationalize it, so it adds complexity, it becomes more inefficient. And again, over generations, other people identify tech data isn't delivering and they do the exact same thing. And there is a lot of evidence– so this is my idea, of course, but there's a lot of evidence, both objective and anecdotal. I'll give you a quick one. The last three years, and you can look at spending studies on data management, there's a lot of different studies, but they all have significant increases. So on average, like 10 percent increase every year for the last three years, and the same increase is expected in this year. That's a huge compounding rate of growth, right? You can also look at other studies around data maturity, and again, there's a bunch of studies, but all of them are basically slightly up or flat. So you’ve got to ask yourself, we're spending all this money, there's no evidence that it's improving anything. Why?

RD Yeah. We've had a lot of folks coming on the show to talk about databases and it seems like there is a huge specialization in databases right now.

KS That's part of the complexity, yeah.

RD Yeah. And the sort of separation between production and analytics, you have the ETL in the middle, and then generative AI is a whole other problem with vector databases.

KS Gen AI added the additional interest in other specialized databases but it also creates a real need for quality data and so that exacerbates this problem even further.

RD Yeah. Do you think it's harder to get quality data when you have complex data structure systems?

KS Yes, I do. I think that another way to think about it is typically data consumers are kind of oriented around some technology team, and those technology teams or data teams, they're trying to solve the problems for their particular data consumer. And then you've got other technology teams across a large enterprise, and what ends up happening is they start sharing data between themselves to solve the problem for their particular consumer and you end up with an enormous amount of data duplication, an enormous amount of data flows. And I've studied this because I've studied many, many organizations and how they manage data. My rough estimation– of course, every organization is different, but it's like two thirds of your spend is potentially wasted because it's not organized.

RD Right. That's where the sort of silos come in where each part of the organization has got their own database, their own data ideas, right?

KS Yes, and they never can exist in isolation so they have to start drawing data in from these other groups. And worse, this is what also happens is, as they start pulling that data in, they decide they might need to cleanse that data or transform it in some way. And then all of this gets out to these data consumers and they start seeing subtle variations in data. And now, this is even worse, trust breaks down. So now your data consumers don't believe the information I'm getting.

RD So is there a contributor to this from the sort of distributed architectures, the microservices, the sort of cloud replication? Does that contribute to the data doom loop?

KS Well, I think microservices. There's almost no technical advancement that isn't useful, it just has to be applied properly. And like all of them, some people and many people went over the edge a bit on microservices and then ended up creating their own sort of microservices silos. One of the big problems with microservices and data is microservices don't have a really good organizing principle to sort of understand it's a coded layer, and so it creates a bit of a black box between the outputs and the inputs typically, and that creates some misunderstanding and misuse of that data over time.

RD Yeah, I've also heard people say that microservices are less of a software architecture paradigm and more of an organizational paradigm as a way to organize your engineering teams. You end up shipping your organization, your org chart. I forget if that's Conway's Law or something.

KS Yeah. Hack organizations and the business organizations, and sometimes they're directly related, sometimes it's a matrix, but generally speaking, data, microservices, all these things generally align around the way an organization is laid out, which I do think might be Conway's Law but I'd have to look it up.

RD I know Conway has a lot of laws, a lot of stipulations. So when you start getting into this doom loop, is it sort of natural that you get to spending two-thirds or wasting two-thirds of your money on it or is there a pathway out?

KS I think there is. And so I want to emphasize, all of the things people do have their place. Every technical advancement, although maybe I could think of one or two that maybe we should get rid of, but most technical advancements make sense. They're there for a reason. Big data warehouses can make sense, right? All these specialized databases make sense. Microservices make sense. But what's I think missing, and to me, it kind of relates back to feedback loops and theories around feedback loops. Particularly in the data space, no one has articulated a way to kind of create a clear feedback loop so that as these things are being used, people can identify how they're being used, if they're being used appropriately, and if there are issues, how to feed that back to the data producers. Now there are people who try to do that, but again, it's not effective. So people try to do things like data quality systems, data governance systems, et cetera, but there's just huge amounts of people putting stuff in spreadsheets and committees meeting, et cetera, et cetera. And the problem with that is too many humans, and too many humans means generally low quality signals, and then second, it means that the feedback loops are slow. And feedback loops need to be effective and to be rapid, they need to have high quality signals. Now humans need to be in the loop from time to time because there's certain judgments that have to be applied, but you need to automate everything else, and I just don't see that happening. Maybe I'm missing something, but I haven't been able to discover that part of data architecture.

RD Yeah. Well, I think it's in the software development cycle, they've managed to create a pretty good feedback loop with CI/CD pipelines, build fails, or it gets out there.

KS Oh, yeah. Of course on the build side, sure.

RD Right. Is there an equivalent?

KS I'm referring to data in production and how the businesses use data.

RD Yeah. Is there a sort of equivalent that people can apply to data? Is there a way to get data to fail?

KS I'm not sure I understand your question. But I do have a solution, if that's what you're asking.

RD Well, let me see if I can clarify because a lot of the CI/CD pipelines are built around a build fail, a test fail, a failure in production, and with data, somebody has to check it and it's like, “Is this quality?” Is there a way to automate that failure or that quality check?

KS Oh, yeah, absolutely. And people do that to some extent today. Again, you've got data silos and the problem is that people do it differently. They frequently do it on ingest, and the bigger problems are when you combine data. And so I actually think data quality on egress is sort of a missing component. And so a lot of times when people think about data quality, they're thinking about big data warehouses, and as they ingest data, they apply certain data quality rules to it and then they create lots of reports and people and committees are stood up and people try to solve the problem, right? But can I get to my solution, Ryan?

RD You can get to your solution, yeah. Let's solve some problems.

KS So I like this idea of a supergraph. So a supergraph has a few different components associated with it. The first thing is it's federated, so all of those individual silos are then federated into this thing called a supergraph. And so you can imagine, or maybe you can imagine this is the way it would work, is all of those data silos would publish a semantic layer that says that this is the data that I can provide to external parties, and all of that gets consumed into the supergraph. They also publish what they believe are valid relationships between their data and other domains’ data. And so what you end up with is a semantic layer, ideally expressed in business terms, that the supergraph represents. And then it's operationalized into an access layer. And so now you have, now you can observe how people are accessing data across an enterprise, and you can observe it in semantic terms, so you've now established a language for your entire organization that people can talk about. In fact, if you've ever been associated with like a support incident or something, so something will come in from a business user to a support organization. The support organization will spend most of their time trying to figure out what they're talking about and how it relates to other stuff. So this idea of a semantic layer and establishing the lingua franca for your organization is hypercritical. But now because I've got this access layer, I can make it observable, I can see every element that goes through it, I can see how it's combined with other elements. I can do things like passive anomaly detection, and I can add in other services, like I can add in data validation services, I can add in things like composition services, I can add in aggregation services. Also, this sort of rich set of services prevents consumers from thinking they have to build their own systems to analyze data as well. So this naturally just sort of compresses everyone's sorts of desire to build their own systems and it establishes consistency. And it's a platform where you can start to establish data quality, particularly at egress.

RD Right. I mean, you're building a single point of entry to all the data, right?

KS Yeah. So now to do that, you have to do other things.

RD Yeah. It sounds like it takes a lot of sort of organization and thinking about the data you already have, which may be the problem in the first place.

KS Yeah. Well, yes and no. Let me explain a couple other things. I think it's important that it supports multiple languages on the outbound, so you have to be able to support SQL, like 60 percent of the people just want to deal with SQL. You have to support probably GraphQL if you're doing web stuff, and some people just want to deal with REST and things like that. So you have to be able to support a wide range of query languages and protocols. On the downstream side, you've got to be able to support, or the upstream side, excuse me, you've got to be able to support many, many different data sources– internal APIs, SaaS APIs, relational databases, analytical data stores, graph stores, doc stores, or NoSQL. But if you can create this, I think you've created something pretty interesting and I think it can close that feedback loop.

RD Okay. Obviously, I think you've probably worked with or talked to people creating this. How much of a lift is it for an engineering team to buckle down and create a supergraph on their data?

KS There are different ways to do it. I think to do it from scratch would be pretty difficult. I think you probably need to look at vendor solutions. There are not a lot of vendor solutions. So Apollo follows the supergraph, has a supergraph solution. It is coded which I have my own problems with. Hasura also has a supergraph solution as well called DDM. I think one of the key things, which I would just say for any software development effort, the more metadata-driven you can be, the more generalizable your solution can be and the more resilient to change it will be. So Hasura is a pure metadata-driven service, so I'm kind of a fan of that and I'm also a Field CTO of Hasura. But a metadata-driven supergraph is pretty fascinating.

RD What goes into the metadata? Is it just the semantic relationships or is there something else?

KS So it would be very similar. So you may have heard of things like data products as sort of like a new way for people to think about data, and then there's other concepts called ‘data contracts,’ which are really the metadata for data products. So that's the metadata you put into a SuperGraph. So a data contract would include the data sets, their attributes, their relationships, their relationships to other data products, whether in the domain or outside the domain. It might include SLAs, it might include quality metrics and quality tests. It might include ownership information, like who do I call when things are a problem.

RD Finding that ownership information and just gathering it, that can be a problem in itself.

KS Oh, well again, if you go back to my earlier description of these siloed data domains that are sharing information between them, if you're the consumer, where did it come from? Who started this? And if you try to follow the path, you might try to find the owner, but the owner doesn't understand how it transformed and turned into the data that they're looking at because the lineage is so complicated, all of these sorts of obscure data transforms sitting in code, it's really, really difficult. I think a supergraph is just a way to organize this information and to then observe its use. That's really the big trick. You mentioned something else. You said, you were kind of implying, I thought, that the big problem is still the data, how does the Supergraph solve that? And I do have an answer for that.

RD No, I think my idea was that this is essentially tech debt. The big problem is that they didn't think about this from the beginning.

KS Yes. So here's my theory though, because I do get this all the time when I talk to people. So they'll come to this conclusion, like, “Well, then I’ve got to fix everything and then I'll put the supergraph on top of it.” I think it makes a lot more sense to think about it incrementally. And I think you probably take current state and you add this kind of component so you can see what's going on and then you can be much more surgical in your investments in refactoring your data domains.

RD Interesting. So what happens to the traditional ETL or ELT pipelines? Do those still need to exist or can you do that as part of the supergraph?

KS I don't think it replaces most of that. I mean, it does. Certain data consumers who are extracting data from the supergraph or reading it, that could be an ETL pipeline where they're pulling it in for their system. They're just using the supergraph as an abstraction layer. I think it does open up the potential that you could maybe have less of that. And data domains still do all the heavy lifting, and all of that, that's where most of that activity happens anyway– the core data domains. So the consumers are kind of derived data domains to some extent or applications, but the majority of your data domains will be your core data domains, like all the operational systems that generate all this data across all the divisions of your enterprise, and they're going to operate pretty much the way they always do.

RD So what are you worried about, hopeful about, excited about for the future of the supergraph or data storage in general?

KS I think these specialist databases are fascinating. Graph stores, doc stores, vector stores. I think they're designed to handle specific workloads very effectively, and I think that's an amazing idea. The main problem is how do I organize all that into a single solution. And again, I think that's where the supergraph comes in. Now, again, if you look at those products I just mentioned, there's still a lot of work to be done. They don't necessarily work with every specialized data store out there. But I think the more that we can abstract away all of these specialized data stores where they handle those specific workloads into a semantic layer that can actually answer business questions is pretty fascinating. The other thing which everyone's enthused about, of course, is AI.

RD Of course.

KS And I think that AI has an incredible potential in helping us understand our data and making it better. I'll give you an example, and I always do weird little coding things just to kind of validate my ideas. So in the supergraph that I had built, when people make a request, they have to describe the request and then they make the request. I connected an LLM so that every time someone made a request, I asked the LLM to tell me if it thought that the description of the request was consistent with what they were asking for with the request. And so as stuff was coming in the door, people were just putting in junk into the description of the request, like ‘my request’ or something like that. And it was telling me things like, “This is not a valid description of the request that's in there,” which is a really fascinating idea because if you can improve the way you're describing requests, again, it can improve downstream AI and its ability to consume that data and make more sense of it.

RD Yeah, I hear a lot about the value of documenting and self-documenting in the age of AI. As a former technical writer, I appreciate that.

KS Another thing that I find really fascinating is, so RAG is really interesting, AI RAG, but what I think is more promising and maybe gets even better results and has other improvements is more of a graph with AI. So oftentimes when you ask a question of an AI and you give it a bunch of data sets, it tries to figure it out from what information you provided, but if you can also explain to it how these data sets relate to each other, it actually gives you much better results, much more accurate results and much more accurate results than RAG. The other problem with RAG is once you vectorize this data, you've lost your data security. So you now have stuff in the vectorized database that maybe someone shouldn't know about, maybe shouldn't even be part of this analysis. So you can use data without vectorizing and a graph, and you can use AI agents and tools like that to actually get very, very accurate results, less effort in vectorization, and better outcomes, and I think that's a really fascinating area that people are exploring.

RD I think that's it. Yeah, that's exciting times.

[music plays]

RD All right, everyone. It is that time of the show again where we shout out somebody who came onto Stack Overflow and dropped a little knowledge. Today, we're shouting out a Populist Badge. Somebody came onto Stack Overflow and dropped an answer that outscored the accepted answer. Today's badge is awarded to liquorvicar for dropping an answer on the question: “Checking value in an array inside one SQL query with WHERE clause.” So if you're interested in the answer, there's an accepted answer and one that might be even better. I'm Ryan Donovan, I host the podcast here at Stack Overflow. If you liked what you heard, or if you didn't like what you heard, or if you have other things you'd like us to talk about, email us at podcast@stackoverflow.com. If you want to reach out to me, you can find me on LinkedIn.

KS I'm Kenneth Stott. I'm the Field CTO at Hasura. I would encourage you to go to hasura.io and learn about some of the fascinating products there. And you can always look me up on LinkedIn. Always happy to talk to anyone about data or software development.

RD All right. Thank you very much for listening, and we'll talk to you next time.

[outro music plays]