The Stack Overflow Podcast

Is Postgres the best database for GenAI?

Episode Summary

Jeremy “Jezz” Kellway, VP of Engineering for Analytics and Data & AI at EDB (Enterprise Database), joins Ryan for a conversation about Postgres and AI. They unpack how Postgres is becoming the standard database for AI applications, the importance of managing unstructured data, and the implications of data sovereignty and governance in AI.

Episode Notes

Postgres is an open-source database. EDB offers enterprise-grade features and support for Postgres from self-managed to fully-managed, cloud-based DBaaS.

Find Jezz on LinkedIn.

Shoutout to Stack Overflow user Jonny, who won a Populist badge with their exceptional answer to quantile function for a vector of dates.

Episode Transcription

[intro music plays]

Ryan Donovan Well hello, ladies and gentlemen, and welcome to the Stack Overflow Podcast, a place to talk all things software and technology. I am your humble host, Ryan Donovan, and today we are going to be talking about Postgres. Is it the best database for Gen AI? Maybe, maybe. Our guest today is Jeremy Kelway, VP of Engineering for Analytics Data and AI at EDB and he's going to be making that case, I believe. So Jeremy, welcome to the show.

Jeremy Kelway Thank you very much. Absolute pleasure to be here. It's an honor.

RD Before we get into the Postgres stuff, we like to get to know our guests. Can you give us a brief flyover of how you got into software and technology?

JK I've been knocking around in technology since– well, my first experience of software development was when, and I'm going to age myself hugely here– was in the UK when my dad brought home a BBC Micro Model B computer, and I quickly got involved in copying code from gaming magazines and cut and paste– well effectively rewriting, cut and paste wasn't a thing then– saving them to tape and tinkering around with those. And fast forward all the way through, opportunities in a large company you might know of called IBM.

RD Sure, heard of them.

JK And I worked in Hursley Laboratories for a while doing Java technologies and middleware and working with some very smart people, fantastic experience there. Then I got into cloud databases, NoSQL storage particularly, but then into sort of a segue into the world of Postgres. And I recently joined EDB to get involved and take the opportunity to be right at the forefront of technology again. So lots of fun, lots of diverse opportunities.

RD Yeah, and a lot of database powerhouses. IBM has been a database leader for a long time, too.

JK Oh yeah, big Db2. I was actually involved in the cloud databases. A small company was acquired called Cloudigo, and I got right into helping run and build that business out inside IBM. Fantastic.

RD So I know the audience may be a little sick of Gen AI, so we'll talk about the data side. So Postgres. The question isn't necessarily why it's the best, but why is it becoming the de facto? Why is it becoming the standard?

JK I mean, thank you. It's a great question because I think there's a number of angles. And in EDB, we're pondering this a great deal permanently. It's a constant sort of influence. So we know that the use of Postgres for AI applications has increased significantly, continues to grow, and I think one of the key factors is the reputation and the stability. Performance comes into interplay data types. I know that you've talked on this show a bit about PGvector and vectorized databases. PGvector is a key part of that for Gen AI applications. I think one of the pieces that is important to us is the footprint of Postgres, and there's a number of ways you can look at that picture. Certainly in terms of enterprises and the amount of people who understand and code to and extract data from Postgres, those skills have been built and battle-tested over time. And I think the ability to apply those in the context of AI development is huge. It lends the ability for developers who aren't AI developers yet to bring skills to the table that accelerate that movement forward, not having to worry too much about the data layer. We all have to worry about it, we can talk about that with the sovereign data AI stories, but it's that ability to bring skills that you have that exist, make the learning process faster and more efficient. So I think there's an element there of how developers are enjoying using Postgres underneath. But there's also for us meeting the customers where they have existing skills. It goes to that efficiency thing. Perhaps a little bit more technology-driven, you've got Postgres in multiple settings, hybrid settings, so on premises and the cloud using the same technology that knows how to communicate with other Postgres entities. So that becomes a key factor as well– where is your data and what's it doing and which pieces of data do you want? And it goes into that how are you moving things around in your enterprise or in your data layer. There is a strong element of people understanding and trusting Postgres as a technology, which is kind of a segue into the community aspect of being able to see what it's doing under the covers.

RD I mean, I want to sort of follow up on that. It seems like the idea is that people are already using Postgres a lot. When we do our survey here at Stack Overflow every year, Postgres usually is number one for databases. People are already using it, people are familiar with it, so therefore just use it for Gen AI also. Use the PGvector add-ons, use whatever else. Do you think that you need anything else to go with your Postgres or can Postgres just be king database in any operation?

JK I mean, it's always good to hear and know that Postgres is continuing to evolve in that sense, so there's an evolutionary factor there as well. So to answer your question, there's always going to be the simple answer of it depends what you're doing, but that being a given, there are other things. I think one of the areas that's interesting to me is how AI and Gen AI applications particularly draw from multiple data sources, and where we've seen some interesting progression is how one can use Postgres and SQL commands to help organize and transform that data. So I think in answer to your question, yes, there are multiple different data types, structures, and entities, but you can use Postgres as a window into that, and it goes to that developers being able to access stuff. We can use our AI accelerator to transform unstructured data into more structured data, and that's using AI not just in the Gen AI use case, but it's allowing the preparation and delivery of data to more traditional analytics applications, transforming, say, PDF into Iceberg and Parquet files. So you can issue SQL commands to Postgres to do that and taking care, and as you say, doing the vector embeddings and the retrieval aspects of that as well. So I think the Gen AI story is always going to have multiple different data types, that's part of its power. It offers you the chance to not train your own data model which is very, very expensive, time consuming, and resource consuming, but you can take existing models and retrain if you want to, or in classic RAG sense, just add your proprietary data to it, and you can use Postgres to do that management of data coming in and out of your application.

RD I'm glad you brought up the unstructured data because the Gen AI has a lot of different data types, and with a relational SQL database like Postgres, not knowing entirely how it works could be a little limiting, especially if you’ve got to write a thousand inner outer join statements to get what you need.

JK Yeah, absolutely. And then there's bits and pieces of the story all in all different formats and places. And I think when we started to sort of talk about the data footprint needed for AI applications, it opens the scope of what you can talk about hugely because then it does get into use cases for why you're building the app and who's building it. Certainly from my visibility and my view is in the data pieces of that picture. Exactly as you say, what's the shape of data that's being retrieved, and vector is really exciting because similarity searches and technology that allows you to kind of upload an image and say, “Find me things in these data sources that look like this,” that's powerful stuff but exactly to find that, there can be a ton of computational things happening. And so not only is that asking questions about governance and where things are stored but also about what the workloads look like for these applications, and that's something else that we could talk about if we wanted to about the impact of adding AI applications into your existing enterprise and how governance, geographical boundaries have traditionally sort of dominated the discussion around governance and rightly so, and that's still a factor. It absolutely is. There's some companies that are doing regional cloud which helps with that geopolitical/geographical story.

RD Yeah, GDPR I think has been pushing for a lot of that, right?

JK Yeah, but there's a regional cloud that you can extend your enterprise out and not worry about breaching geographical boundaries if it's still within your geography. But it gives an idea of how you can start to see the picture building for larger scales of enterprise, and larger scale enterprises also have legacy data as well, data that's been around for a number of years and how do you manage that. You're building a very quickly, a very complicated picture of how you want to retrieve data that's going to come into your AI application. And whether you're doing transformations, all of that is adding CPU workload and memory workloads. So if your Postgres application is key to your transactional success, you're processing constant transactions and updates and you want to draw that real time information into your application, that's going to put some load– it doesn't have to be huge, but some load– onto your transactional database. And you don't want to risk your key systems slowing down, performance is a key factor. So we start to think of the data layer by how can you start to build a solution or a picture for us, because what problems are we trying to solve here? That one where you've got different types of data in different locations that have different severity of impact on what happens when you start to interact with them, that's painting a fascinating picture of what's going on behind the scenes for a Gen AI application because it looks very simple on the outlet. You can go and build one, you can proof of concept, you can do that. When it's actually going to that productized workload layer, you're starting to ask these really important questions about your data sovereignty as well. Where is your data sitting?

RD And typically in most systems, even without Gen AI, you have a division between production and analytics data. Sometimes one of them is a key value store columnar, and you have another one that's a big pile of data that's gone through some transformations. How does the Gen AI data fit in that world?

JK So thinking on my feet a little bit, I think I would answer that initially by saying traditionally, and even when I started in this role, I was thinking about analytics as separate to AI. And I think a lot of people are still thinking about that because if the analytics market, for want of a better term, is well established, people understand analytics, it’s been around for a while and it's definitely driving insights and decision making in a really positive way. That's what people want to spend money on, it makes a lot of sense. I'm seeing the Venn diagram of where AI comes– and I'm talking with my hands which isn't great for a podcast, but imagine the two circles of a Venn diagram condensing and overlapping– AI is definitely beginning to provide impact into which data is resulting in those intelligent applications and analytics, to take your example. How are you turning data you're extracting from sales data from your email task ID? It comes in the form of a text extraction and you can use an LLM to help pull the key data you want, you can parse it, then you can transform it via another AI application or your AI application into Parquet storage of some form– Iceberg or Delta tables or something that you are then pushing into your analytics application. And you can bolt on– Postgres would be a good example– transactional data, things that are happening in real time. So you're starting to get into that very quickly updated real time information that is also pulling and being augmented by your historical data of whatever you choose. So the hybridized search is what we're referring to it as. It's where you've got transactional data happening now and you want that as up to date as possible, but you also want to augment it or you want to pull from and use a transactional data to augment, whichever way around you want it, that core data that you've extracted and sorted and summarized via a different application. So again, agentic workloads in AI parlance is where we're beginning to see those things. And I'm still stuck in times where I think about them as microservices, that's just the way my brain works. But it's smaller, special built applications in the AI real estate that are doing specific jobs and then passing things to other AI applications to do their job. And these architectures forming for agentic workloads are super exciting and they tend to be less heavy duty, less large, huge model that you're passing data to.

RD Microagents maybe.

JK You heard it here first.

RD Just coined it. So one of the data-heavy applications we've seen and talked about on this program for Gen AI is the retrieval augmented generation, which is typically LLMs but with footnotes. How is that paradigm shaping up today? How’s it changing? What's the modern RAG look like?

JK Ah, interesting. I'm pausing to think about that because that's a broad spectrum question. It's a brilliant question, no argument.

RD Don't flatter the host.

JK Shameless. No, it is really good because it's one of the things that plays into how I and my colleagues think about AI in that sense. RAG is definitely a really nice way to get going in AI. And as I was talking about the retrieval augmented generation, how you can, rather than having to train a model from scratch, which is a huge undertaking, it's a massive commitment, you have the ability to utilize a trained model already and augment it with your data. So in terms of how does RAG impact what we're doing, I think that RAG is emerging as the most popular mechanism, I think I can confirm, and that's probably well known. It's the real time aspects that are really making it popular. So I think what we see and what we're trying to understand is how does that change the picture for people who are building AI applications? And I think we're still in that phase where the majority of use cases that we're experiencing are copilots and chatbots. I don't think it's a cliche yet because they're all doing different things, but the ability to utilize your proprietary data to get really useful information to your clients is where the power sits, and it manifests as we've got a chatbot and we've got Copilot, which is the tool that's being used in that case, but they really are the classic RAG workflows that we're seeing. And so we worry particularly about what's happening or we're trying to work out how people are interacting with it and who's worried about which model they're picking and what's happening to their data when it gets to it.

RD If you have proprietary data, you don't want to be giving that to the model to sort of inference and add to its pile.

JK Right, exactly. And you may be fine with that, but as you say, proprietary data or sensitive data, PII, all these key sort of governance factors, it's really important to be aware of where that data is going. So in terms of people building RAG applications, to your question, the lens that I'm looking through here is, are you aware of where that data is going and also what data you're passing to it? And I'm going to talk a little bit about a platform thing and it's a little bit thinking in front of the bow wave in some cases, but some way of being very confident about where your data is sitting and what it's doing, and when it comes to RAG models or what you're putting LLMs in, is there the opportunity to pull it into an air-gapped environment and just use it internally and then just pass the responses out so you know exactly where that data is going, or are you fine, as you say, just, well, this is fine, this is not proprietary data. We've actually filtered it before it goes to the LLM or it goes to the model so we're comfortable that we're not giving away any data. Understanding that picture is going to be critical, especially if you're getting anywhere near audit shaped needs for your business or your development. I think it's really fun, and I've done it myself a few times, to build your own proof of concept LLM-based RAG application. It's super fun. I did it for my D&D crew and so they could just ask rulebook questions. It's really fun. I think when I apply that model to something that has an increased level of scrutiny needed like financial transactions or something like that, it becomes really critical to know exactly where that model is and what it's doing and exactly where your data is and what it’s doing. So that's where I get into that picture of, when it comes to RAG applications, I think the real value long term is going to be knowing what's happening with that data. I may have segued away from your question which is how is RAG influencing it, but that's kind of how it's influencing me.

RD I think that's fair and you're segueing into a data sovereignty question, the sort of knowing who controls the data. And I think when I first heard this mentioned, I was wondering whether that means the country, the geographical borders like we talked earlier, or individual sovereignty. Who owns that data?

JK So yes is the answer to that. It's definitely both. Data sovereignty as we're viewing it is pretty expansive. It's the governance concepts of knowing what data is in play and where it's sitting right now, these kinds of things and the governance protocols. Observability is a key factor in sovereignty of being able to touch that and know where it is. That observability thing becomes really key in demonstrating that you're in control of what your data is doing, and critically, that you are confident that any proprietary or private data that you're responsible for, you know exactly who has access to it. So the observability equation comes in. But when we sort of put an AI lens on that, we start to get into the sort of application, like what's the AI doing, the data, and the models that it's drawing in to do that. And we think about data sovereignty as it is easier to think of it as a platform. That is part of how my brain is kind of piecing this together, because you can then decide what you want to airgap. You can see everything that's going on. You're not relying, unless you choose to, for having complex credentials to do transactions so your performance is within range. You can observe exactly what's connecting together and where and how that all builds a picture. So the true sovereignty is not just about where the data is, but it's how it's moving about and where one of its resulting resting places is.

RD And then for your application layer for drawing out the data, do you need extra layers on top to make sure you're only serving the right data to the right countries for whatever limitations there are on that personal data?

JK I mean, that is definitely an option for sure. And I think in terms of applications of how you are managing the flow of data, yes, and I think there are more than enough bits and pieces available to be able to do that effectively. I think simplification of that picture becomes really important as you scale. There's different schools of thought depending on where your evolution of your AI application for your data is, because if you've got legacy systems that have a bunch of particular types of data and structured, I mean, Db2, something along those lines where you've got structured relational data but you've also got some JSON storage, NoSQL, and you've also got some transactional data and you've also got a bunch of other unstructured data that you want to extract from, the picture and the complexity of that data footprint becomes very complicated very quickly as soon as you start looking at the different arrows that point to the different directions. So the credentials of who should have access and who shouldn't have access, that becomes a thing in its own right, the management of that. And I think if we can simplify that picture, that is going to help right back to the sort of Postgres AI developer who's worrying about how can I make sure that all these pieces fit together? If we can simplify that story, I think that's going to help a lot as the AI applications move from the majority being proof of concept into production workloads and becoming key parts of people's businesses.

RD I mean, I think people are going to start productizing, commodifying the whole AI piece, the database, the application, just going to be a turnkey RAG solution.

JK That would make sense, yeah.

RD And I'm sure lots of folks are already doing it.

JK I think we are seeing that, definitely. It's one of the reasons it's so exciting to be part of it. Things are moving so dynamically and so quickly and there's lots of different ideas about how things can fit together, and I think all of them are applicable. I don't know that we're going to end up with an obvious golden path there. I a hundred percent agree with you that commodifying that structure that application data and models piece together to make the picture simpler is definitely, there's value in doing that as you scale up.

RD Sure. What are you excited about for the future of data and generative AI?

JK Oh, I mean, just the whole popcorn meme is amazing, just watching how things are evolving is amazing. This is bonkers. Personally for me, I've ended up in data because I think that's the key factor. You don’t do anything without data. AI doesn't run with data. So I've kind of found my way down to this, and this is where I found my home, if you like. So most of what I'm excited about is particularly in the Postgres context, which I think is fair to talk about. PGvector itself is a powerful tool. I think one of the aspects I'm excited about is where that goes next. The dimensional representation is awesome and it always boggles my mind because I'm not a mathematician. But I think there are ways in which indexing could be tinkered with, and shameless shout, I get to work with some of the key Postgres community members, very, very big-brained humbling people to work with. I'm humbled when they talk about stuff. But they're thinking along the lines of how could that PGvector be enhanced or improved or augmented. For the hybrid flows that we've been talking about and I've been referencing, one of the most powerful aspects of Postgres which I think doesn't get said enough for me is how it's evolved. It's been around 30 years. It's easy to say a lot of software has been around 30 years, but it continues to grow in popularity. It continues to be not just relevant but critical to a business system. Part of that for me is because it evolves with the times. So when we talk about the data story and how Postgres and what am I excited about, I'm excited to see what the community is coming up with next that's going to be very relevant to AI. And I have a bit of insight into that so I can be a little bit excited that there's probably more coming from PGvector or for PGvector, and I think that hybrid search area is definitely an arena I'm interested in. We're seeing more and more of that distributed data picture, like how does that become efficiently managed? So this evolution is something that I'm very keenly aware of and excited about, like where are we going in that area?

RD Sounds good.

[music plays]

RD Well, everyone, it's that time of the show again where we shout out somebody who came onto Stack Overflow, dropped a little knowledge, shared a little curiosity, and was awarded a badge for their efforts. Today we're shouting out a Populist Badge where somebody came on and dropped an answer on a question that was so good that it outscored the accepted answer. Today, we're shouting out Jonny for dropping an answer on: “Quantile Function for a Vector of Dates.” So congrats to Jonny, and if you're curious about that, you can check it out. We'll put it in the show notes. I am Ryan Donovan. I edit the blog here at Stack Overflow, host the podcast. If you want to reach out to us, you can email us at podcast@stackoverflow.com. And if you like what you heard, leave a rating and review. It really helps.

JK Thank you, Ryan. I really appreciate the opportunity. And if anybody is interested in reaching out, I'm available on LinkedIn. In terms of who I work for that I've been dropping into the conversation at many opportunities, I work for Enterprise DB or ED Postgres AI. And we're putting out platforms and we're interested in and contribute very heavily to the Postgres community. It's our lifeblood and we love it to bits, but EDB is where I work. Thank you very much.

RD All right. Thank you everyone, and we'll talk to you next time.

[outro music plays]