The Stack Overflow Podcast

How to write high-performance SQL for your Postgres database

Episode Summary

Ben and Ryan talk with Lukas Fittl, founder and CEO of pganalyze, which provides in-depth Postgres database monitoring and optimization. They talk about why Lukas considers himself a founder-CEO by title but engineer by trade, what’s important to understand about SQL, and the role of AI in database optimization.

Episode Notes

pgnanalyze helps users deliver consistent PostgreSQL performance and availability at any scale. Get started with a free trial or explore their docs. You can also find them on YouTube, where Lukas posts a weekly show called 5mins of Postgres.

Lukas was a founding engineer of Citus Data. Citus is an open-source extension to PostgreSQL that was eventually acquired by Microsoft. Find them on GitHub.

If you’re new to the topic, SQL (Structured Query Language) is a language for querying databases, introduced in the 1970s.

Find Lukas on LinkedIn, Twitter, or GitHub.

Episode Transcription

[intro music plays]

Ben Popper Big topics in data architecture call for big conversations. Big Ideas in App Architecture, the new podcast from Cockroach Labs, invites innovators to discuss their experiences building reliable, scalable, maintainable systems. Visit cockroachlabs.com/stackoverflow to listen and subscribe. Make sure to use that link and let them know the podcast sent you.

BP Hello, everybody. Welcome back to the StackOverflow Podcast, a place to talk all things software and technology. I’m your host, Ben Popper, joined as I often am by my colleague, Ryan Donovan. How's it going, Ryan?

Ryan Donovan Oh, it's going pretty well.

BP So I know Postgres is a popular database language of choice. We've had it in our developer survey for a number of years and seen it grow. I know we've had a couple articles about it on the blog, or articles that mention it within the context of other stuff. So today we are lucky to have Lukas Fittl on to the show. He's a serial entrepreneur, and at the moment, the Founder and CEO of Pganalyze, which is very focused in this area. Lukas, welcome to the program.

Lukas Fittl Perfect. Thank you for having me. Excited to be here.

BP For folks who aren't familiar with your backstory, tell us a little bit about how you got into the world of software and technology.

LF For sure. So to be clear, even though nowadays my title is Founder and CEO, I'm an engineer by trade. I got started actually way back with having an interest in game programming, although that never turned out to be an actual thing, but that really got me into programming back then with C++ when I was 12 or something. And then really for me where things kickstarted is when I left school when I was 16. So I was at a technical school, but I just felt that the pace wasn't the right choice for me, at least. And this was back in Austria, and so in Austria there are some good schools, but they're sometimes hard to reach if you're early in your own life, so to say. And so I kind of got started with ultimately working in a data center for a hosting company actually putting actual real servers into racks. Not something you do much these days anymore.

RD Somebody does.

BP Lift and shift as they say. Different kind of lift and shift.

LF Somebody does, that's true. I saw this article today where somebody was like, “Oh, an AWS data center is still like 100 people doing work,” and people were like, “It's only 100 people.” So data centers are these huge buildings, but there's not actually a lot of people, mostly servers. But so that was interesting to see how everything is made, so to say, early in my own career. One of the next things I did after that was actually working on a startup that was doing multimedia blogging of sorts, so similar to Tumblr or Twitter/X. And one of the things that we already had back then was Postgres as a database, and so really where my career led me eventually was a lot of focus on Postgres, but back then it was really as an application engineer working with this database and scaling this blogging site of sorts where we just had that central database that was so important and we had to spend so much time and energy and money on it making sure it performs well, because if it didn't, everything was slow, everything was bad. After working on that startup, I eventually got moved to the US through a company called Product Hunt where I was part of the founding team. Product Hunt is still around today, so if you're launching a new product, it's a good place to do that. Also use Postgres as a database behind the scenes, of course. And then I kind of shifted gears and got more into, I would say, the actual interesting parts and actually doing more of Postgres. And so I was working with the team at Citus Data, which is essentially a way to scale Postgres onto more than a single server. And so that was really, I would say, my personal way of actually running databases for other people. So we essentially had to manage databases as a service product where we operated databases so that you could trust that if you are a really big business and you have a lot of database needs, that you can rely on Postgres even as you scale. And Citus data is still around today as part of Microsoft. So it got acquired by Microsoft a couple of years ago and I think it has really proven itself as one of the best ways to scale out Postgres these days either on Azure, which of course Microsoft would like to see you do, but also in open source by just running your own virtual machine. And then to finish the story arc, so to say, after being at Microsoft for a few years, I left there a couple of years ago and now I'm full time on Pganalyze together with a small team where we essentially build a tool to optimize Postgres databases, identify slow queries, find the right set of indexes to create a lot more things we could dive into.

BP Very cool.

RD So Postgres is obviously number one on our database survey, but it's still using SQL, which is– I don't know how old SQL is– 40-50 years or something. What are the things that you have to optimize around to get it to be a modern functioning database and query language?

LF That's a good question. So I think SQL was in fashion and then it went out of fashion and now it's in fashion again. One way that I like to think about it is SQL is just your way of telling the database what to do. So essentially the optimization here really is what are you telling the database and what are the opportunities the database has to make choices, essentially. And sometimes I think where a lot of the problems that application engineers run into is that they don't actually write SQL, they write ORM function calls, and so the awareness that your ORM translate whatever you're writing in your application code gets translated into SQL automatically behind the scenes, and the database gets a very specific set of instructions, which may not give enough room to really make sensible choices. And so one of the things that I would say about SQL is that it's important to understand that if you're using a relational database today, you are ultimately somewhere writing SQL, even if you're not actually writing it, and so understanding that I think is very important. And then I think it ultimately comes down to that sometimes SQL is the problem, sometimes the database’s choices are a problem, but then really it comes down to query plans ultimately. So each SQL statement you send to a database gets turned into a query plan, and so when you think about writing efficient SQL or writing high performing SQL, it really comes down to which query plans the database is choosing and why.

BP Yeah. So I know in our introduction we were talking about some of the companies that Pganalyze supports, Atlassian, DoorDash, big names that are operating at big scale. The job of your company is to help them identify areas where improvements can be made for speed and stability. What are some of the things that you look to first, or you think would be interesting to developers listening that can enhance performance or insights or availability at scale?

LF So there's two answers to this. What's the problem that a company like Atlassian or DoorDash has? Ultimately, it’s a scaling problem, scaling people problem, ultimately. You only have so many experts to throw at a problem. There's too many engineers and not enough Postgres expertise, for example. But very specifically on how to approach these types of problems, I think it really comes down to having a good methodology of how you're assessing these issues. There's, for example, these diagrams or these stories around people not even understanding basic latency, so the difference between a RAM access being really, really fast and a disk access being significantly slower and over network gets even worse. And so I think maybe before we even dive into database performance optimization, we need to spend a moment to remember that ultimately what you're doing is a function call of sorts over the network where you're instructing that system to do something for you. And I think a lot of times the unfortunate truth is that in a lot of companies that skill set is not very well known or shared or trained for people to really think in a good structured way about if this is API call slow, is the code itself slow, should I profile the code? I think all of that is where I would start at, and then we really get into the details of now you know that maybe you have a trace, for example, in your APM tool, and you know that the trace shows that this SQL query is slow, and you know your database sits on the same network, so one millisecond round trip time is expected, but it shouldn't take that long. And so then you know if it's saying 150 milliseconds query runtime, subtract network latency, that's still way too slow. And so then really we're getting to the database side of this. And so there the approach generally would be, “I know this query is slow, let me run what Postgres calls an explain plan.” So you put explain or explain analyze in front of a query, and then it would give you the breakdown of which part of the query execution is slow. For example, it might be that you have a sequential scan on a particular table or an index scan on another table, and sometimes these can be slow because they're being called a lot because you're doing a nested loop join, things like that. And then it becomes very specific, but that's really, I think, where most of the optimization work comes in.

BP Nice. So you first start with sort of an audit, given your expertise, and then get into more of a technical examination and perhaps some tweaks there.

LF Right. And I think, for example, I, of course, would love to see everybody use our own product, but the truth is, you've got to choose which product to look at. So, for example, if your application code is slow, you can look at pganalyze as much as you want, you're not going to see a slow query there. So I think it's really important to understand which part of it is actually slow, and if it is the database, then we can get into questions like, is it the query plan? Maybe sometimes it's actually not the query plan, maybe your database is just overwhelmed. And so we have a dashboard where you can just see overall what's going on, which queries are running, which activities on the system, stuff like that.

RD Can some of the slowness come from their individual database design itself like the schema? Because relational databases are all connected. You can have many, many tables.

LF For sure. I think a lot of people forget to do certain things, so forget to add the right indexes, for example. It's very common. But I also think that oftentimes people think that it's the settings, for example. So I think one pattern I see is a lot of people think that my database configuration is not efficient, maybe I forgot to change this parameter or this setting. And whilst that is true often, I think people overly think that. So in a sense that really the problem is that most often you're asking the database to do something which is very IO intensive, and so even if you're tweaking the cache setting, for example, the fact that you're doing something IO intensive is a problem. So the fact that you're asking it to look at a thousand rows or a million rows, that is the issue and not necessarily particular settings. And so I would say in general, understanding that fundamentally the database uses a bunch of files. Well, in the case of Postgres, at least, each table is a file or multiple files, and in that file, you have certain substructures called pages in Postgres, and so Postgres will fetch one or more pages to get your answers for you. And if you're fetching a lot of these pages, you're doing ultimately a lot of file IO, you're ultimately reading a lot of files. And in the case of a cloud provider like Amazon RDS, you're actually transferring that data over the network as well. And so all of that stuff happens behind the scenes. And so what comes in then, talking about the database design and how you're structuring a database, very narrow tables will fit more data into those pages. So if you have a table that has five columns that are all integer columns, that's going to be much faster to load if you load a lot of them versus if you have very wide tables where there's a lot of columns that are text, for example. Really when you're designing your schema it is helpful to think about the query pattern and how much of your data you're querying or have to query all the time.

BP And so let's say you went in and did some of this work with a client. It sounds like a lot of what you're talking about right now is latency. If the work went well and you were able to help them to rewrite some of their schemas or to change the way that things move around the network, what benefits besides speed would they see? Can you quantify that in terms of cost savings or just the ability to then scale the app with your customer growth?

LF Yeah, it's interesting how that changes over time also. I think a couple of years ago when we were in the tech boom, so to say.

BP The good old days.

LF Good old days, yes. Money was free flowing. People were worried about not being ready for a huge scale event that was coming. And that still exists, but I do feel that nowadays people are more worried that their database is expensive. A lot of times people pay tens of thousands of dollars for even a single database and some of these big companies have hundreds of databases, so their bills I'm sure must be in the millions or higher. And so really, it does often come down to cost savings, because you're looking at this really big expense in your cloud bill and It's kind of a magic box. You know CPU utilization is that high, but you don't really know why it is that high, and so I think ultimately, oftentimes optimization is driven by why are we spending so much on this database. And it's slow to add.

BP Right. I like that because when I started as a technology reporter in 2010, and all the way up through let's say 2022, we were in this ZIRP economy, and it didn't matter if you were profitable or not, you could still go public. And everybody wanted to see enormous user growth and being profitable didn't have to be part of the equation. Now as the pendulum is swinging in the other direction, show me you’re cash flow positive so you can get your next round of venture capital, it makes a ton of sense to want every server bill to be as small as possible and to ensure you're getting the most out of those machines.

RD I'm curious– I hear of a lot of companies, especially for production data, just using a simple key value database. Is there a downside to that or are the Postgres databases you're trying to optimize somewhere else in the pipeline?

LF That's a good question. There's many ways to look at this problem. With Citus, for example, if I think back to Citus Data, we mostly worked with system of record databases. So the Postgres databases that these people were trying to scale were their main data store that they were trying to scale. I think if you start your project with a key value store and you're willing to restrict yourself to that, then it's probably easier to scale because it's simpler in a sense. I would say if I was in that position, I would be wary of choosing cloud vendors and locking myself into particular behaviors that cloud vendors have. But so, generally speaking, the way I would think about that is, if you're trying to scale Postgres you're ultimately already on Postgres or you're on a relational database. And so the benefits you get often is that if you have dependencies between your data and you're trying to kind of join one thing against another thing, if you're using a key value store, you do a lot of that on your application side. And so that's where it becomes really inefficient and really slow, because you're ultimately pulling a lot of data back into the application to then do these operations that otherwise the database can do for you.

RD Right. It's a lot of nested API calls.

LF Right, exactly. And a lot of data transfer.

BP I did a piece a while back for the blog talking about what was sort of a trendy thing to do at the time to go on Twitter and say, “We decided to get off the cloud and we're building everything locally now and look at the millions of dollars we've saved.” To what degree do you think there is some merit to that, that people might invest a little bit more as they once did in their own hardware and continue to work with the cloud on a hybrid model, obviously, but not need to rely on it for everything?

LF I think there's a lot of merit in that, especially to the point that keeping the flexibility to be able to do so. I think for a lot of people it's not the right choice. So we're on AWS ourselves, and if you ask me, “Should we run our own data center?” No, this is a waste of time, a waste of money. We don't have an ops team that can do that. But if you are a large company, is it better to go all-in on a cloud vendor, or should you rely on the core standard, in the case of Postgres, the core Postgres features that you can use in other cloud providers or your own on-prem data center? And I'm personally a strong believer in the latter. I think it's a lot better to keep that flexibility, even if you do utilize cloud providers, to not lock yourself into those value added services to the point where you could no longer move or could no longer change to virtual machines. Oftentimes the markup that cloud providers add is quite high on the managed services, and so being able to utilize virtual machines can be very beneficial as well.

BP Yeah. It's interesting, I heard you say a few times that data transfer was one of the expensive things, and that was what came up with the on-prem piece, just that there are certain things you might be doing internally, I don't know if that's training a model or updating things internally for some kind of optimization and it's the transfer that really gets you. So doing that stuff on-prem and other things in the cloud can be a big advantage.

LF Yeah, and I would also say one example could be if you have your development team and people want to be able to develop locally with a full copy of the database– full copy as in a fully functioning copy of the database. So they want to be able to just run a Docker container with Postgres in it. That's one of the big benefits of Postgres. It just runs locally easily. You don't require a test account for your developers, necessarily.

RD A lot of what I'm hearing from you today is that database indexes are very important to good performance. Can you give a few tips on how does somebody start out right with a Postgres database?

LF You mean in the context of indexes specifically or just in general?

RD In the context of indexes, yeah.

LF So overall, think of indexes as ultimately a function of your workload. So whichever queries you're running against the database should probably in some case have an index. Now the fallacy there is, if you just create an index for every single query, you'll end up with a lot of indexes. Now the reason that is a bad idea is because each index needs to be updated on writes, so each time you're issuing a write to a database, each index needs to be updated. And so if you had a hundred indexes on your table, that's very slow most of the time. And so the most basic way of thinking about this is, what are the things that you're mostly querying for together? So let's say you have a customer ID, a lot of times you're going to query for a customer ID so it makes a lot of sense to have a customer ID index. Where it gets more challenging I think is in Postgres, for example, you can store geospatial data with PostGIS. And there, with indexing which you have to understand, is the different index types. I'm sure we won't have enough time today to really dive into that, but the gist of it is it's looking at what you're querying for. Let's say you're doing some search like, “Give me all the restaurants in Manhattan,” and you have that sorting your data with PostGIS somehow, then you will probably want an index to support that query. And so thinking of the right indexes for your queries, and then which potential index types, just gen indexes and such, can support those queries, I think that's really the most important way to look at it. And then do tests. Actually benchmark it, actually do an explain plan, see if your index gets used, that's really how I would approach it.

RD It almost sounds like you're designing a key value store overlay on it.

LF I mean, you're designing a cache of sorts. Indexes are ultimately caching data structures and you're just choosing which cache to create of sorts.

RD I think I want to talk about the thing that everybody else is talking about, the AI-assisted database. It's interesting that you talk about optimizing and there's already the sort of abstraction layer above writing a bunch of joins where you have the ORM. What's the next level of optimization that you can get from AI?

LF Potentially a lot, potentially not very much. As with AI, if you ask ChatGPT to create your indexes for you, I'm sure you'll get an answer, but is it the right answer? I'm not sure. Overall, I would say my hope is that the tedious part of the job gets automated but there's still a human element to it. So if you ask me, is the database performance engineer, the data engineer, is that role gone in 10 years? I don't think so, but hopefully it's less of a, “I'm running an explain on individual queries and then I'm manually adding indexes or testing different indexes.” So I think, to me, it's really a question of how we can get that tedious portion automated and how can we give you a better way to get an overview. So a lot of the challenges I see is that people have thousands of queries on their systems that are subtly different and they need to understand which index to create. My own philosophy and what we do at Pganalyze and really a lot of what we do today with this is around indexes and around index selection. So index selection and database space is kind of its own sub-problem, but really when we think, I would say not of AI, but maybe of smart, like how do we make the database smart so that I don't have to do all this unnecessary work. And so then what we've designed in Pganalyze is essentially a model, a system that lets you automatically choose indexes based on certain criteria you define, but a really important aspect– it's still transparent. So you can still go in and understand what's going on because it's ultimately a deterministic system. So I'm very skeptical, I would say, thinking of LLMs. I feel like LLMs are the extreme of, “I don't understand why this response is very human.” It's a great human response, but I don't understand it, and I think that's not good for database optimization. You need a much more transparent system than that.

BP Right. You need an audit trail there for the AI. Another thing that's come up over and over for Ryan and I recently is that companies are needing to figure out how to add vector databases because that's what LLMs work with. How does that fit against what Pganalyze has been doing and how would that sit side by side with a Postgres database?

LF I'll touch on the Postgres side actually first, because one thing we touched upon earlier was database choice, and I want to do a quick tangent there because I think why I personally choose Postgres and will presumably for the rest of my career keep using Postgres is because it's a community project that has had longevity over a long time. So Postgres this year turned 27 years old and it has been able to essentially keep growing and evolving into whatever the situation needed. So roughly 10 years ago MongoDB and JSON document stores were all the rage, and so Postgres added JSON-B as a data type specifically to index JSON fields. Now what's fascinating there is that, with vector databases with Postgres with Andrew Kane, who's just one of many people working on Postgres-related projects, but Andrew Kane created pgvector, I think two years ago. And so he had the foresight to actually start really early on realizing that vector embeddings-type work would be necessary and he's actually created this extension that last year we saw essentially each major cloud provider integrate pgvector into their managed offerings because they're all like, “Oh, people want to use AI and they want to use embeddings,” and so they just wanted to have a story around that. And I think to me, what it really speaks to is that power of Postgres being its extensibility and being its community, because it's not a corporation that said, “We need to be AI.” It's somebody in the community who said, “I'm going to create this extension because I personally have a need for this. I want to do this,” and then people adopted it. And so I think that's just overall a great story of how Postgres has this adaptability, this flexibility.

BP Yeah, that's a great answer and I think it really ties back into probably why we see it so much on Stack Overflow, which is that it's community-created and people have to come together to share knowledge and figure things out if they're going to be building new features that help to future proof it.

LF Right, exactly.

[music plays]

BP All right, everybody. It is that time of the show. We want to shout out someone on Stack Overflow who came and shared a little knowledge. I've run out of lifeboats, but I found a Postgres question quite amusing. “Permission denied in Postgres. I missed a bit early on with my permissions,” then you can see here some of the things that are going wrong. “This question is really hard. I made a mistake with permissions for a folder and the whole system is broke. It's really easier to delete the server and start a new project.” That answer was accepted. So sometimes it’s best to turn it off and turn it back on again. We’ll put it in the show notes.

RD Nuke the site from orbit.

BP Exactly. All right, everybody. Thanks for listening. I am Ben Popper, the Director of Content here at Stack Overflow. You can always find me on Twitter @BenPopper. Email us with questions or suggestions for the show, podcast@stackoverflow.com. And if you liked the show, if you enjoyed it, please leave us a rating and a review, because it really helps.

RD I'm Ryan Donovan. I edit the blog here at Stack Overflow. It's at stackoverflow.blog. And you can contact me in my DMs on X/Twitter @RThorDonovan.

LF I'm Lukas Fittl, Founder and CEO of Pganalyze. If you want to hear more about Pganalyze you can go to our website, or you can also go to YouTube where I host a weekly video show called 5 Minutes of Postgres, where I talk about all the interesting things that are new with Postgres. If you want to find me personally, I'm on Mastodon on the Hachyderm server as Lukas.

BP Awesome. All right, everybody. Thanks for listening, and we will talk to you soon.

[outro music plays]