The Stack Overflow Podcast

Where does Postgres fit in a world of GenAI and vector databases?

Episode Summary

Today we chat with Avthar Sewrathan, AI Lead at Timescale, about adapting developers’ favorite database management system, Postgres, to support a range of new technologies involved in the GenAI ecosystem, especially vector databases. Avthar details his long history with Postgres and how clients are weighing the build vs. buy question when it comes to choosing a database to support their newly minted GenAI initiatives.

Episode Notes

For the last two years, Postgres has been the most popular database among respondents to our Annual Developer Survey. 

Timescale is a startup working on an open-source PostgreSQEL stack for AI applications. You can follow the company on X and check out their work on GitHub

You can learn more about Avthar on his website and on LinkedIn

Congrats to Stack Overflow user Haymaker for earning a Great Question badge. They asked: 

How Can I Override the Default SQLConnection Timeout

? Nearly 250,000 other people have been curious about this same question.

Episode Transcription

[intro music plays]

Ryan Donovan Welcome back to the Stack Overflow Podcast, a place to talk all things software and technology. Today, we're talking about Postgres and its place in the generative AI landscape. And today, my guest is Avthar Sewrathan, AI Lead at Timescale. Welcome to the podcast, Avthar. 

Avthar Sewrathan Thanks so much, Ryan. It's a pleasure to be here.

RD Tell us a little bit about your origin story. How did you get into this world of software and technology? 

AS I'm originally from South Africa. I've always been fascinated with technology as a way to make an impact on people and as a lever to create change in the world. So that's kind of what drew me to technology. Growing up in post-apartheid South Africa, I noticed a lot of problems that I think tech can help with. I ended up pursuing a degree in computer science for my university education, that's what brought me to the States, and that's actually also how I met the Timescale co-founder. Professor Mike Freedman is the CTO of Timescale and was my professor when I did my undergrad at Princeton in Computer Science. Since then, I've been a startup founder in the consumer space building stuff in crypto and privacy. That was kind of my first foray into technology. And ever since I've been at Timescale– Timescale is a Postgres data company. We host and manage Postgres databases for a variety of use cases like time series, AI, and analytics. I've taken on a number of roles here. I was a developer advocate basically showcasing how to use a product, and I've since moved into product management and now leading the product effort for developers building AI and vector applications with Postgres. So that's a little bit about me. I live in New York City. In general, I’m very interested in tech and I try and program on the side. My skills are a bit rusty these days. I used to do all the tutorials and demos for pgvector myself a year ago, but I'm sure folks know how it is. You get into more meetings and things like that. 

RD Let's talk about Postgres in the age of generative AI. We've talked to a lot of folks about what data looks like, and some folks are talking about vector databases as a separate thing. Some have talked about how you don't need a vector database and you could use an existing database, whatever you want. Where do you stand on this debate? 

AS This is something I've definitely heard from many developers that are building AI applications. I think right now there's undoubtedly a lot of energy in the space for either building new applications that take advantage of the capabilities unlocked by LLMs or adding some sort of LLM-based capabilities to existing applications. And I think for me, the question comes down to this choice of if you adopt a new specialized technology or do you use something that you're familiar with that you already may have in your stack, in the case of Postgres, that has now additional capabilities in the form of things like extensions. Obviously, I'm biased. I work for a Postgres company, so I do obviously recommend the route of using the existing tech that you already use. I've had the pleasure of talking to a number of developers about how they make this decision, and I think there's three axes that they look at. The first one is in terms of performance. I think the reason why you use specialized technologies for the most part is because the general purpose technologies don't quite cut the mustard or don't quite get you the performance that you want. The second axis is around ease of use. Often specialized technologies come with their own query languages, they come with their own system properties, set of tools, their own quirks, and oftentimes that's a hassle to deal with when you're trying to build a production application. And I think the third axis is kind of the familiarity and the ecosystem. So in addition to the tool being easy to use itself, you want to know does everything else in my stack or the other tools that I use work with it? And I think that Postgres ticked definitely the second two boxes for a lot of developers in the sense that it's the most loved database in the world. It's the number one database on the Stack Overflow Developer Survey. Shout out to the folks that run that. It was the number one database last year.

RD And this year as well. We just released our survey results and it's number one again. 

AS Fantastic. Well, I didn't know about this year's results, so yes. There's a reason for that– it's the ease of use, the familiarity. Postgres has been around for a long time. A lot of people know the SQL query language. Where we saw at least some debate previously was around the performance and that's where a lot of people reach for specialized technologies like specialized vector databases. You've had some of them on the show previously. And what we saw our work at Timescale is to try and bridge some of those performance gaps by introducing new extensions that actually bring some of the performance characteristics and data structures and algorithms that are found in specialized vector databases to Postgres so that you don't really have any performance gap and in some cases can actually get better performance. So our goal at Timescale really was to fulfill those three criteria in the sense of making Postgres performant, easy to use, and super familiar. It already fills the last two, and our job was to try and get it to an increased level of performance.

RD I'm wondering what the bridge you have to build to compete with the specialized vector databases is, especially because I know if you're running a vector database, you've got maybe thousands of vectors per entry and then possibly linking it to unstructured data for retrieval augmented generation or something like that. How do you take a generalized SQL database and make it compete on the same level? 

AS That for us was the million dollar question or the billion dollar question. Timescale has a history of innovating on top of Postgres. The first product that the company ever came out with was an extension for handling large volumes of time series and analytics data on Postgres, and we actually had drawn upon some of that same knowledge of how you work with the Postgres extension framework, and we also had to reach into the academic routes of vector search and approximate nearest neighbor search. Eventually the solution we came up with was to take a state-of-the-art vector search algorithm and modify it and adapt it and optimize it for Postgres. And so that's how we came up with the pgvector scale extension and the specific index type called StreamingDiskANN, which is based off the DiskANN or Vamana paper out of Microsoft Research a couple of years ago. And that was designed to handle Boolean scale vector search for Bing and is used throughout Microsoft I think as well these days. But that's kind of where we had to go with merging the academic routes for things that we didn't know about. We're fortunate to have quite a strong research team at Timescale that's able to take that academic research and translate it into real world systems, but also draw upon our knowledge of being Postgres people and how you actually extend Postgres for these kinds of new use cases that it wasn't originally designed for.

RD I think that's interesting. I think anybody who is following generative AI right now is reading a lot more academic papers than they're used to. I've certainly read a bunch just trying to research blog articles. What was it about the Bing paper, the test KNN, the nearest neighbor paper, that you applied that made it competitively faster?

AS There's a couple of things that we saw. Some of them were in the paper and some of them were actually created after reading the paper and thinking about what would a good implementation in Postgres look like. The first one is this insight about the actual vector index itself. So most of the time, and in Postgres in general, in order for an index to work, the data needs to be in memory or in RAM on the machine that you're running on. And one of the insights in the DiskANN papers is that you have this vector index, and what if you didn't have to keep all of the vectors in memory? What does that mean for scalability? What does it mean for cost effectiveness? Naturally, memory is much more expensive than disk, and so if you can keep part of the vector index on disk, that allows you to scale up a lot more efficiently if you can use something like solid state disks which still give pretty good retrieval and reading characteristics. That was one of the big insights for us to say, “Okay, how can we take that and apply it to Postgres?” And that's kind of what we did in terms of the StreamingDiskANN index that allows you to keep this index on disk if you have solid state disks on the machine that you're running on. The second one was actually a type of quantization that we call ‘statistical binary quantization.’ Quantization is essentially vector compression. How do you represent a vector that is 768 dimensions and reduce it into one bit or four bits or something like that? Binary quantization is obviously two-bit representation. And so we saw that there was room to actually apply that with the DiskANN algorithm, and so one of the problems that we actually solved there was around us filtering. One of the reasons why Postgres is super popular is that your business and metadata is already inside the database, and often when you're running vector search queries for search applications, RAG applications, you want to do a vector search query but you also want to filter on some sort of metadata component. And one of the innovations that we brought is this idea of streaming filtering such that you can keep having high accuracy when you are doing a filtered search on your data and you can trust that your search criteria is always going to be met. This was some of the problems that were in some of the other algorithms. There's one called hierarchical and navigable small worlds, and that was fixed as a result of the DiskANN graph structured algorithm and that particular implementation.

RD We've talked to a lot of folks who are doing these specialized databases. Is there a way to just use Postgres for everything? Is it just extensions all the way down? Are y'all coming for key value stores too? 

AS I don't know anything about the key value store stuff, but in general, I think Postgres for everything is something that we kind of saw. It was a trend that we saw in our customer base where, again, I can wax lyrical about Postgres this whole podcast where it's been around for more than 30 years, it has a lot of developer familiarity, it has a lot of great tooling, it's very robust, and there seems to be the sentiment that if you can just use Postgres for something, you should do that. You should start there. There's a wonderful meme that I like, which is kind of the Midwood meme where you have the guy on either side, the Jedi guy, and the caveman dude on the one side that's saying, “Hey, use Postgres,” and then you have the folks in the middle that are like, “No, you have to use specialized technology.” Postgres is kind of the default database for most developers to start off with, and I think the extension ecosystem is what makes it super powerful. You have extensions, not just for time series and vector storage and such in the form of TimescaleDB and pgvector and pgvector scale, but also things like PostGIS for geospatial data handling. There's a number of other ones for full text search, also other ones that bring other kinds of data types able to be stored and queried as well. And so I think that extension ecosystem is super powerful because it allows you to then say, “Hey, I don't have to learn a whole new technology and I don't have to migrate my data elsewhere. I can just install this extension and then that'll take me for the next five years, for the next ten years before we reach the scale where now we have to adopt something specialized.” And I think at the end of the day, when we say Postgres for everything, it doesn't literally mean don't use any other database. It just means that there's a value in simplicity. There's a value in not having a super complex architecture, and using technology that you and your team are familiar with and not trying to over-optimize and there's actually a surprising amount of value that you can get from just using Postgres and that'll take you pretty far. 

RD One of the data architectures I hear a lot about with generative AI is data lakehouse, combining the warehouse with the data lake paradigms. Does Postgres have a place in that? Is it the only thing you need or is there a sort of Postgres combo you would use? 

AS The analogies of warehouse, data lake, lakehouse, I wonder where it's going to end. I personally can't imagine a lakehouse in my head. But setting that to one side, when you think about the job to be done for a data warehouse, you can do that in Postgres. For example, we have a lot of customers that store a whole bunch of events and analytics data just in regular Postgres tables, and the way they treat that Postgres database is as if they have a data warehouse rather than as a database that backs an application. It's more of something that is analytical, and those are the sorts of queries that are executed against that database. One interesting thing that we saw at Timescale is this desire to store data in object storage, and one of the features that we developed is this feature called data tiering, which essentially allows you to archive your data into object storage and then query it there. Much of what we focus on at Timescale is for people that are building user-facing applications or internal applications rather than in-house analytics. And I think that for me, what's very exciting to see in AI is this rise of people integrating AI into various applications, and that's where I think Postgres can shine, rather than traditionally where we've seen the lakehouse or the data lake shine is for people kind of training their own AI models. You want to have all that data in one place so you can then unleash your team of data scientists. I think what's exciting about Postgres in AI is actually that you can have a software engineer that is now empowered by an off-the-shelf model like from Claude or from Llama, and now they can build AI functionality just using Postgres as their knowledge base, for example, and build a RAG application using those off-the-shelf models. So I actually think application databases are kind of cool again and more powerful for the machine learning and data science space thanks to generative AI.

RD From what I've seen of the data lakehouse model, it does use a lot of object store to hold a big everything in it. Parquet, I think, is something I've seen mentioned. 

AS That's actually the format that we used for the Timescale data tiering feature. 

RD See? You're already building the lakehouse, you didn't even know about it. 

AS Yeah. 

RD So data is super important for generative AI. There's all these people looking at how to store it, how to manage it, how to train up your models. What is this focus on generative AI doing to the database landscape? Like you said, you're building a sort of underground object store. What else is the sort of shift doing? 

AS I think in general, what this means is that developers are expecting more from their database. And bringing it back to something that we started the conversation with which was around specialized versus general purpose technologies, I think one of my predictions is that vector storage and search is just going to be table stakes across any database that you use. It's going to be supported in SQLite, MongoDB has support. There's other general purpose databases that have support– MySQL. Obviously, there's pgvector in Postgres, which is what I'm most familiar with. I think that one of my predictions is that developers now are going to have the expectation that vector search is just going to be a capability that's in my database, and I can use that as a building block when I'm building an application. So, for example, in the same way that I can query and get results to a relational query, I can then now store text data or store some sort of other maybe image embeddings or something like that and then search over that. So I think what it's doing for the application developer and the database in general is elevating the sort of jobs that are able to be done in the database and kind of taking jobs that maybe previously were in either different databases or in the application level, and now kind of pushing them down into the database. So things like vector storage and search, we talked about the different kinds of applications that are being built, search applications, RAG applications. One particular thing I'm interested in is agents, because I think that a lot of agent functionality today is premised on this idea that you want to search in documents but also recognize that the emergence and the excitement about AI has been around this unstructured data. It's like, “Hey, I have this mountain of PDFs. How can I build a chat interface over them, and how can I proverbially chat with my data that way?” I also think there's mountains of data locked away in SQL tables all over the world. And I think, for example, a promise of agents is that you can actually combine answers that are locked away in PDFs and unstructured data and have agents that can actually query your SQL databases and then combine that and then provide answers to users. So I think one thing I'm very excited about is that you hear about these ideas of text to SQL and that ability to pull the applications, where with a database, you're not confined to querying it via SQL, but to have these kind of smart agents that know when to look at data that's unstructured versus structured and then synthesize that and build applications and experiences on top of that. Today we're seeing things like chat interfaces and Q&A, but you can imagine, for example, generative UI or something like that where you don't just have to have plain line graphs for analytics anymore. You can have interactive investigations into your data. This might be some years in the future, but I think that kind of stuff is super exciting. 

RD One of the other specialized database paradigms I've seen is that you have your fast production database that is kind of key value storage, kind of just pure transactions, and then you have an ETL pipeline to your analytics platform. Do you think that's something that will continue or that there's no longer a need for the fast production database, that you can just do analytics and production in the same database?

AS It is the case. And look, I think in that case, cost becomes a concern, and there's also a case where you might want to have, for example, read replicas that service the production queries coming in so that your analytics doesn't affect that. But principally speaking, I think that there's not a strict requirement to have a separate analytics and application database. Especially something that we've seen with TimescaleDB is users wanting to store multiple data types in the same database. And I think this is one of the strengths of Postgres in general, bringing it back to something we talked about earlier, the specialized versus general purpose technology. That ability to store multiple data types in the same database and then to build different applications on top of it to use that for internal analytics, to use that for other purposes, that's super powerful because each data type that you could potentially store is a separate system that you don't have to integrate into your stack, and so it's a huge opportunity savings that you're having. 

RD Where do you see the future of Postgres? 

AS It's definitely trending towards Postgres for everything. I think that's a phrase that's quite pithy and that we like at Timescale. I think what we're seeing is Postgres getting extended for a bunch of different use cases and different data types. I think Vector is one that we've talked about. There's another company that's trying to build search on top of it. Other companies that are trying to do serverless Postgres or ephemeral Postgres. I think one good thing about Postgres in general is that the core database will always be stable and then you have this ecosystem of companies around it that are innovating and that are trying things that might actually be too risky for the core development team to do on their own. And so I think that's where the extension framework comes in. And so when I think about the future of Postgres, the future of the database is still very safe. It's still very solid. I don't think the core Postgres database is going to be altered drastically. But when I think about the Postgres ecosystem and all the experiments that are going on in there, I think that if I had to make a prediction, I think Postgres is going to be the de facto database for AI applications. There was another survey that was run by a company called Retool where they found that Postgres and pgvector was actually the most popular vector database among the people that they surveyed. That's going to be hopefully the future that we live in where Postgres is powering a lot of the software that we use, and it's reliable, robust, and extended by whatever extension it needs to in order to accommodate different use cases. I think there's still room for other kinds of databases in that world, but I definitely think it's trending towards Postgres as a platform for new data experiences and new kinds of products to be built. As I mentioned earlier, you're seeing all these companies trying to extend Postgres for different use cases, trying to build XYZ on top of Postgres. I think that's something that will continue. One topic that I did want to bring up that I think is super important is the importance of open source in AI. I think it's particularly important given the new LLaMa models that are seen as par to the most powerful proprietary models that are going out today. The new extensions that we built at Timescale– pgvectorscale and pgai– they're both open source under the Postgres license. What that means is that anyone can go and build applications, anyone can go and offer that as a service on their cloud. And the reason why we did that is because we have a strong belief of the importance of open source and what that means for developers, especially with AI. And I think in the same way that you see this concern around using proprietary models and the desire to use open source actual LLMs, I think you can apply the same reasoning to databases as well, and I think it pervades the entire AI stack. And the reason is that there's always risk when you're building on top of a proprietary technology or a closed source technology, and I think what that viscerally brought home for a lot of folks is that no matter how much you trust a certain company, there is value in using open source software, whether it's the models, whether it's the database. There's also a sudden confidence that you can get for knowing that you're not going to get proverbially rug-pulled by it, and I think also there's all the other good things that we know and love about open source: the community benefits, the idea that it fosters innovation because you can build on top of other people's work. That's very important, and I think that it's a privilege for us at Timescale to be part of that with open sourcing our work on vector databases. And I think it's something that is top of mind for a lot of developers as they choose which specific AI technologies they want to adopt. I've seen open source being a high priority on many developers' checklists. 

RD We love hearing about open source here at Stack Overflow and I know a lot of developers do too, so that's a good point. 

AS I think the future of AI being built on a proprietary closed source technology seems like not the world that many people want to live in. And I think just as the internet is largely built on open source tools today, I think that that's going to continue for the AI-enabled future that we're going to live in.

[music plays]

RD It's that time of the show again. As we do, we like to shout out a badge winner on Stack Overflow. Today, I'm going to acknowledge a great question– somebody who came on and asked a really good question. A badge was awarded to Haymaker for asking, “Changing SQL Connection Timeout?” Maybe a question mark. Not a question, but it is a curiosity, so thanks, Haymaker. I'm Ryan Donovan. I edit the blog here at Stack Overflow. If you want to give us feedback, tell us topics to cover, email us at podcast@stackoverflow.com. And if you like what you heard, give us a rating and a review. 

AS My name is Avthar Sewrathan. I'm the AI Lead and Product Manager at Timescale. You can find out more about Timescale on our website at timescale.com, and you can find out more about the Postgres extensions that we built for AI on GitHub. So that's pgvectorscale and pgai on the Timescale GitHub page, so you can just Google ‘pgvectorscale GitHub’ or ‘pgai GitHub,’ and that'll take you straight there. 

RD Well, thank you, everyone, and we'll see you next time.

[outro music plays]