On this episode: Roie Schwaber-Cohen, Staff Developer Advocate at Pinecone, joins Ben and Ryan to break down what retrieval-augmented generation (RAG) is and why the concept is central to the AI conversation. This is part one of our conversation, so tune in next time for the thrilling conclusion.
Pinecone is a vector database that lets companies build GenAI applications faster for less cost.
Read our primer on retrieval-augmented generation (RAG) or explore RAG and Pinecone.
Follow Roie on GitHub or LinkedIn.
If you need a handy guide to what’s what in the AI space, check out Stack Overflow’s Industry Guide to AI.
[intro music plays]
Ben Popper Hello, everybody. Welcome back to the Stack Overflow Podcast, a place to talk all things software and technology. I'm Ben Popper, Director of Content here at Stack Overflow, joined as I often am by my colleague and collaborator, Editor of our blog, maestro of our newsletter, Ryan Donovan. Ryan, we are entering the era of Software 2.0, as Andrej Karpathy would tell us and some guests on the podcast have mentioned. We're going to get into what that means, and one of the things that means is that there's a lot of new kinds of tools, technologies, and job functions emerging within companies. We have a guest today who's going to talk to us about a bunch of that stuff. Ryan, do you want to introduce?
Ryan Donovan Sure. So our guest today is Roie Schwaber-Cohen, Staff Developer Advocate at Pinecone, and we're going to be talking all about retrieval augmented generation. It seems to be the go-to technique/technology that everybody wants to control the wild hallucinations of their LLMs. So we're going to get into how it works, how to get started, and talk about some advanced techniques.
BP Cool. Well, I'm all for reducing hallucinations, at least during work hours. RAG is not my favorite– actually, I've gotten used to saying RAG, it's kind of fun– but Roie, tell us a little bit about yourself. How'd you get into the world of software and technology and how'd you end up focused on this particular part of the industry?
Roie Schwaber-Cohen So I've been a software engineer for about 15 years, maybe more, I don't know, I stopped counting at some point, and started working at Pinecone about a year and a half ago. It’s been a completely insane ride so far. Pinecone, for those who don't know, is a vector database that kind of exploded last year and I'm a Staff Developer Advocate there. Before I started at Pinecone I worked at other AI companies, but they had nothing to do with generative AI and LLMs or anything of that sort. It was more like the old kind of traditional AI kind of thing. But at Pinecone I've kind of gotten exposed to this new world that really relies mainly on the world of embeddings which is fascinating and I think we'll talk about it a little bit. It kind of opens up a lot of possibilities to think about data in a really different way than people traditionally think about it.
RD So let's talk about RAG, let's get to the RAG time. I think one of my first exposures was a little five-line piece of code that was like, “This is how you implement a RAG in its simplest form.” And it was very abstracted and I was like, “There's got to be more underneath. This is abstracting a lot of stuff.” So can you give us a sort of overview of what exactly you need to implement RAG?
RS So RAG is a very abstract concept: retrieval augmented generation. There's first the generation part, which we know what it is. It's some large language model that knows how to generate text. The question is, what is this large language model going to generate? So we talked about hallucinations a little bit, and we mentioned Andrej Karpathy and he has a pretty cool post on Twitter– sorry, on X– that talks about hallucinations in LLMs and people's expectations of LLMs. And in fact, what he says is that LLMs actually don't hallucinate sometimes, they always hallucinate. They’re in this constant dream state, and what we're able to do with prompting is kind of guide this dream state along to do things that we want them to do. But the fact of the matter is is that if you think about it that way, it requires you to stop thinking about them as kind of the source of truth, but instead sort of like a natural language interface or a reasoning mechanism that sits on top of your source of truth. And then that begs the question of what would be that source of truth? How do you get to the content that the LLM will use to produce a reliable answer, a faithful answer? The answer to that is retrieval. Now, retrieval is an ambiguous term– retrieve from where? Retrieve what? So a lot of people are actually experimenting with doing retrieval from SQL databases, graph databases, different documents. Basically, you're relying on your ability to retrieve a subset of documents that are highly relevant to the interaction the user is having with the LLM in order to force the LLM to respond based on that subset of documents. So the assumption is that you probably have a knowledge base that has a lot of different things in it, and the question is how do we get from that very wide knowledge base into a subset of documents that are going to be contextually and semantically relevant to the thing that the user is intending to get back from your application as a whole? And so retrieval can take, like I said, a lot of shapes, and one of the common things that we see at Pinecone is that people can actually leverage the use of embeddings to sort of bridge that semantic world that users are in. Users are speaking in natural language that they want to interact with their system in natural language, and then we can extract the meaning of what they're asking for using embeddings, and we can use those embeddings to then query a vector database. And that basically means that we can retrieve semantically relevant content, so even if the user doesn't use particular words that appear in the documents that we're searching for, the value of embeddings that they're able to extract the actual meaning, regardless of the particular surface forms that are used. And then we're able to retrieve the documents and stuff them in the context window and then, again, guide the LLM to say, “Hey, use this context as I've given it to you to respond to the user's query.” That's the basic flow. It requires a lot of work to get this whole machine set up. So at first, you have to take your knowledge base which is essentially, let's just say, a set of documents, and you have to create embeddings for those documents. There's a question of what portion of a document do you create an embedding for? Do you take a full document and embed it, or do you first break it down into smaller segments and embed those segments and how do you go about that process? But essentially you end up with a set of embeddings that have references to those documents or subset of documents, and then once you have that and you have your user's query embedded, you're set basically for your RAG application to work.
RD Talking about how you break up the text, I found you through a blog post you wrote about chunking strategies, sort of looking into thinking about what's the best way to break up a text to sort of make it a findable and effective way for a RAG application. So can you talk a little bit about what is important about chunking and the ways you can kind of go about doing it?
RS So chunking is going to be a huge factor in basically deciding what piece of content is going to appear within your context based on the end user's query. It could be too much information, not enough information, so you want it to be just right. This may be a good moment to talk a little bit about the context window itself before we kind of delve into chunking. So the context window has limits. The context window is essentially the size of the prompt that we can give the LLM. So we can say just a line of text, no problem, but now if we want to take, say, our entire database and just plop it into the prompt, that's going to be a problem. Because number one, context windows have limits. They can have up to hundreds of thousands of tokens which could represent hundreds of pages of data, but in most cases that’s not the entirety of the data set. That's one limit, and the other limit is that, in fact, there's been a paper called ‘Lost in the Middle’ which kind of shows that contexts lose their coherence and their ability to actually retrieve data accurately from basically the middle of the context onward. So even if you have a really big context, the effectively accurate context is much, much shorter than you would think. So that kind of forces you to think about what exactly do I want to have in the context and where do I want to have it? And so that kind of brings you back to that I don't want the entire document, so even if I can take an embedding and embed an entire document and stuff it in the LLM and get back the result, that might not be what I want, because then I might not get an accurate response from the LLM even if the context is there and it's semantically relevant. And that's the reason to start thinking about how I want to break my content into smaller chunks so that when I retrieve it, it actually hits the correct thing. Another reason to break your content into smaller chunks is that when you actually are doing the retrieval step, you're taking a user's query and you're embedding it. And now you are going to compare that with an embedding of your content. Now, if the size of the content that you're embedding is wildly different from the size of the user's query, you're going to have a higher chance of getting a lower similarity score. So it's going to be harder for the vector database to basically say, “Hey, these two concepts are similar.” Let's imagine that my query is, “What's the best way to get a flight to Tokyo?” and then I have documents that are full-on descriptions of Japan and Tokyo. There might be some correspondents there, but that's not actually what I'm looking for, I'm looking for something much more specific. And so you would want to actually limit the size of those chunks to make sense to kind of correspond to the size of the user's query and what they're intending to do with it. So in terms of how you'd go about chunking, there are basically two camps or two main categories of chunking. One would be sort of programmatic in the sense that it doesn't really look at the content itself, it just says, “You're going to give me some value and I'm going to do my best to build you chunks out of that value,” and that's where the recursive type chunking strategies work. They basically try and build chunks up to a certain size, so they keep on starting from a point of the text and they continuously add more tokens to that chunk until they get to an upper value, and they're also trying to keep some overlap between chunks so that there's nothing completely lost between the text, between the chunks. So that's a really straightforward way to go about it. Another more naive way even to go about things is to do sentence chunking. So literally, you take a piece of content and you say, “Okay, find all the periods and create sentences.” That actually might work for a lot of use cases. It actually might be very effective for a lot of use cases, because the question is, what is the coherent semantic unit that would be applicable to your user’s request, and when you do the retrieval, what are you going to get back and how is the LLM going to compose a response based on the content you're retrieving?
BP So it's interesting to me to hear you say that length is very important. I've never heard that before. In my experience learning about this stuff, we've always just been sort of focused on that it's going to create these different vectors and they have numerical representation that is assigned with their semantic meaning, and therefore, when it's pulling back, it can begin. The amazing thing about LLMs is through that and a little matrix multiplication, suddenly you feel like you're having a reasonable conversation.
RS So what I was trying to basically explain is that when I have a user's query, if I embedded, let's say, a full chapter of content instead of just a page or a paragraph, what will happen is that the vector database is going to find some semantic similarity between the query and that chapter. Now, is all that chapter relevant? Probably not. Maybe, but probably not. And the more important question is, is the LLM going to be able to take the content that you took back and the query that the user had and then produce a relevant response out of that? Maybe, maybe not. Maybe there's confounding elements within that content, maybe there aren't confounding elements between that content. It's going to be dependent on the use case. What we found for the most part is that you would have better luck if you're able to create smaller semantically coherent units that correspond to potential user queries. Then you could potentially have multiple matches, but then you have a lot of room to play with because you have these two knobs that you can play with when you're retrieving content back from the vector database. One is the top number of results that you're going to get back, so you can say top k is 5 or 10 or 1000 or whatever, and the other element is the similarity score. So you can basically say, “Listen, give me the top 1000, but only give me those that have a similarity score that is 0.9,” just giving a number out. And what would that do? That will whittle out all the things that might be similar but not similar enough. In that way you can sort of guarantee that what you're getting back is at least semantically relevant. And you can, again, control the amounts by fiddling around with that top k knob.
RD So you mentioned not embedding a whole chapter, but some of that chapter may actually be sort of relevant. Is there a way to explicitly link embedded text when you're retrieving?
RS Yeah. So that's another huge thing that is a huge part of what vector databases are there for. So with each vector that you’re embedding, you're also associating it with some metadata. So that metadata could be, for example, if you've chunked your text or even not. It would be the reference to the original document that that embedding came from. It could actually even include the text itself, it can include categories, it can include user information if you'd like. It could really include anything. It's kind of like a JSON blob that you can then use to either filter out things, so you can reduce the search space significantly if you're just looking for a particular subset of the data, and you could use that metadata to then link back the content that you're using in your response back to the original content. And in fact, we have a demo up that demonstrates that. It basically walks you through that entire process where you point it to some URL and then it goes and retrieves the content from the– sorry, retrieve is the wrong word– it crawls that URL and then creates segments, small chunks out of the contents of that HTML page that you're crawling. And then it does the embedding, upserts, and then when you ask a question, it responds with the actual answer and reference to where exactly in the page that question is based on. And the way that you achieve that is exactly by leveraging metadata that's associated with these vectors. That actually leads me to the second category of chunking strategies which is more content aware. So the one way that is, to me, the most effective in dealing with structured content is using Markdown. So almost every type of content– HTML, PDFs, et cetera, can be formatted as Markdown and converted into Markdown, and that just means that we can maintain the semantic content that an author put in to indicate to us what are the semantically coherent units. So I have paragraphs, I have headings, I have these hints within the file itself that tell me what's the segment, where does it start and when does it stop. And by leveraging that, I can assure that it would make sense that one unit is internally coherent. It's not cut in the middle, I'm not going to have to combine chunks together to make something make sense, and that pieces that actually need to stay together, for example, code examples, that they maintain their coherence as well. So if you just took a piece of code Markdown and gave it to their cursive text chunker, you would get back broken code. It will just break in the middle because it would just reach the amount of tokens that it needs to and it will just stop. Whereas a Markdown splitter would know, “Hey, I'm looking at a code segment here. It cannot be broken down. I'm going to need to embed it all as one unit.”
[music plays]
BP All right, everybody. It is that time of the show. Let's shout out someone who asked a great question on Stack Overflow and earned themselves a badge. A Famous Question Badge to Gelso77 for: “Angular: How can I clone an object in TypeScript?” Asked four years ago, viewed 10,000 times. If you've got that question, we've got an answer for you. As always, I am Ben Popper. You can find me on X @BenPopper. Email us questions or suggestions, podcast@stackoverflow.com. And if you like the show, leave us a rating and a review.
RD I'm Ryan Donovan. I edit the blog here at Stack Overflow. You can read it at stackoverflow.blog. And if you want to reach out to me on X/Twitter, my handle is @RThorDonovan.
RS I'm Roie Schwaber-Cohen. I work at Pinecone, I’m a Developer Advocate there. To check out my work, you can just go to pinecone.io and all of our stuff is there. You can find me on Twitter and LinkedIn as well, and I'm happy to talk.
BP Sweet. All right, everybody. Thanks for listening, and we will talk to you soon.
[outro music plays]