The Stack Overflow Podcast

Chunking express: An expert breaks down how to build your RAG system

Episode Summary

This is part two of our conversation with Roie Schwaber-Cohen, Staff Developer Advocate at Pinecone, about retrieval-augmented generation (RAG) and why it’s crucial for the success of your AI initiatives.

Episode Notes

Build GenAI applications faster and cheaper with a vector database like Pinecone.

New to retrieval-augmented generation (RAG) and other GenAI topics? Our guide is a good place to start.

Learn more about RAG and Pinecone.

Connect with Roie on GitHub or LinkedIn.

Episode Transcription

[intro music plays]

Ben Popper Hello, everybody. Welcome to the Stack Overflow Podcast. If you are listening today, we are airing Part 2 of our interview with Roie Schwaber-Cohen, who is a fantastic Developer Advocate over at Pinecone– Staff Developer Advocate, I should say. And we’re chatting all about LLMs, retrieval augmented generation aka RAG, how do you do your chunking, how do you do your embedding, how do you get them to behave, stop hypnotizing, and help you out as a natural language operating system or interface, a really powerful tool. So if you didn’t catch the first episode, it’ll be in the show notes, it came out Friday. Otherwise, tune in, enjoy the conversation, and hope you have a good time. 

[music plays]

Ryan Donovan I'm thinking over here that we're almost looking at a way back to semantic web and XML. HTML is already a Markup language, so is there a future where people are just sort of hand marking up everything with ‘this is the title of a book’ sort of thing?

Roie Schwaber-Cohen Yeah, I can totally see a world where we make our content easier to understand for machines. I think that all the pieces are there. Semantic web is already there, so it's not that huge of a leap to go from here to there. We do have to remember that it's an important piece of the puzzle– chunking is an important piece of the puzzle. By no means is it the most important piece of the puzzle. Even if you completely messed out your chunking, there's a lot of ways for you to still sort of recover. So of course you'd want to do that well, but I feel like people are sometimes a little bit overexcited about the power of chunking. So at the end of the day, it really boils down to can you get away with just doing simple recursive text automated segmentation or do you need something a little bit more elaborate, in which case just use Markdown segmentation, and that's basically it.

BP So it was interesting what Ryan said. I used to have a colleague join us on the podcast– Paul Ford– always bemoaning the unfulfilled dream of the semantic web, and now we have semantic agents, some of whom can search the web, and maybe they can start to relate these things to us and then pull those links together without having for us to go and annotate it. Certainly if it's grabbing a database within your company or something, you want to have ways to assist it, but the ones that are out there doing the common crawl are beginning to make sense of those relationships, which is interesting. 

RS For sure. 

BP If you don't mind, I would love to work through a thought exercise with you. Let's say you're at an organization and they want to add Gen AI. The first thing to say is, “All right, what's the business case here? Do we really need this? Is it going to serve any purpose?” And you could come up with something, and then the next step would be to go to a playground. Do you think the next step would be to go to a playground? You could have your engineers, and if you're lucky enough to have an ML data team, data science team, they can go to a playground, test out different models or roll their own model and see what they get as they do a bunch of prompt engineering. And then at that stage, what would you look at as the factors that would make you decide between, “All right, we want to train our own model from scratch on our in-house data, that way we're way more in control of the legal and governance risk. We know what data it was trained on. It's not copyrighted.” Would you want to say that we're going to take an open source model and we're going to use that but we're going to use our own data. It has a little bit of its own LLaMa 2 or something, but we're going to add some of our own. Or would you say to just take something existing and figure out how to do RAG because that's really going to let us use our own data in-house and we're going to constrain the answers to this data that we understand. Out of those three, what would make you pick any of them? What's right for what size company would be my first question.

RS I think that you have to recognize the challenges that each one of these approaches kind of entails. So creating your own model, I think, is the kind of highest cost, highest effort type of endeavor, because it requires you to have loads and loads and loads of data, and of course, the human resource to actually start the model and train it and maintain it and do all the work around that. It's not an easy task to accomplish even if you have the right team, because it's not just about the team, it's about the data, it's about the labeling, it's about all the things that are required in order to actually make a useful large language model. For the most part, I think that if you're not a Fortune 10, probably that's not a very reasonable path for you to go. Another way to kind of approach things is to go the fine-tuning model, which is, you take a model off the shelf, you say, “Okay, this is good up to a point, but my company is actually dealing with a various specific subset of petrochemicals and LLMs don't know anything about those petrochemicals and I want to teach them about that so that they can respond in a certain way,” and then you go ahead and create that subset of documents that are labeled with your specific data set. Again, super high cost and requires some specialty, and the benefit that you get back from it, to me, is a little bit dubious because you still don't get the benefit of saying, “Hey, when I give back a response, this is how I built it. This is how I got to it.” The LLM is still hallucinating and doesn't stop hallucinating just because you fine-tuned it. It's just hallucinating better, I guess. It's dreaming about petrochemicals now instead of just dreaming about sheep. 

BP It's in the right ballpark. 

RS It's in the right ballpark. 

RD Trying to do an inception there. 

RS Right. The reason why RAG is so effective is it allows you to, with very, very low cost to you, produce a system that is explainable to the point where you can understand why an answer is what it is, and you can challenge that. You can basically say, “Hey, you answered this and you gave me these references as your response, it doesn't make any sense.” I can give the thumbs up, give the thumbs down, and improve the system as it goes. That is, I think, the main selling point for RAG beyond just the cost and the complexity and the fact that it's super easy to use for teams even without any ML engineers whatsoever. Another thing that I think is important to recognize is, with the rise of open source models and compute kind of becoming cheaper and cheaper, you're seeing more and more and more providers that are just competing for your GPU hours. You don't need to have an in-house data science team in order to build a very effective AI application and generative AI applications for sure. That's to me the main reason why RAG is probably the best course for most people. Of course there's edge cases. For example, if we go back to the petrochemical example or if you're doing molecular biology or things that an open source large language model would have no hope of knowing about, then you have to do some level of fine-tuning, even if it's just to be able to create the embeddings correctly. Because if you're just going to give it obscure microbiology or molecular biology terms, it's not going to know how to place them correctly within the vector space and you're just going to see weird stuff going on. 

RD You have to teach them the new language, right?

RS Correct. That said, another feature that a lot of vector databases have and Pinecone definitely has is hybrid search, where you're able to leverage both what's called ‘dense embeddings,’ which are the embeddings that you know, and sparse embeddings, which are embeddings that represents the keywords without using these very, very dense embeddings, and then you can combine the more traditional kind of semantic search methods with the more modern embedding based search, if that makes sense, to deal with those very domain-specific fields when you're using very domain-specific terminology. 

RD So RAG is probably the thing that anybody getting into Gen AI is going to want, at least in the beginning. What are the more advanced techniques? I want to make sure we touch on that. 

RS For sure. So the question is, what are the things that RAG isn't good about and what are the challenges that you're going to run into where you're going to want to start thinking about improving the naive flow of a user asks a question, we embed the question, we search the database, and we respond. First is that the user doesn't know what they don't know and they don't know how to ask the question. So the user might ask, “Plan a trip to Tokyo for me.” Where do you go with that? Do you take that query and just search that in your vector database? Is that going to give you all of the information that you need to do a good job of planning a trip for the user in Tokyo? So one technique that people apply, and this falls in the best known methodology of RAG fusion, which kind of combines a whole set of different sub-techniques that solve different parts of the problem. So this part of the problem is solved by what's called ‘multi-query generation.’ So I start with that a user asks the question, “I want to plan a trip to Tokyo,” and we basically tell an LLM, “Take this query and break it down into multiple queries that are distinctly different from one another.” And then it would be like, “What museums can I visit in Tokyo? What restaurants can I visit in Tokyo? What are some activities that happen during nighttime? What are some day trips?” It will be a subset of questions that we can then take and use them and embed them and then query our knowledge base. Our ability to effectively answer the user's question just goes up significantly, and that is the first step. We just take the query at face value and we start doing what's called ‘multi-query generation’ and/or query expansion. We try and understand how to improve upon the user's query in conjunction with our knowledge of our own knowledge base to make that query more effective. So that's one thing. Then there's the question of, so how do I make sure that what I retrieve is actually relevant and it's actually not just junk? The naive way is, I have the query, I request the thing, and then I assume that all the things that came up are going to be correct, but maybe they're completely not. And so one of the methods that you can use here is what's called ‘re-ranking.’ So there are actual models, for example here, and we have a post about that by James Briggs on our website that talks about how you take a result set that comes back from a vector database, your top k whatever, you pass those results into a different model that basically says, “Based on the query, here's an actual better ranking for the results that you got back,” and that gives you a better chance of giving the user exactly what they want. So we're basically adding more steps in between to augment the results further after we've retrieved them from the database. Another technique that's called ‘corrective RAG’ basically looks at the result set that you got and determines whether it's adequate or if it has to be modified in some way. So, for example, if it finds that you got back a result set but it doesn't adequately answer the user's question, it might start using tools from the world of agents, like for example, web search, to augment the response to say, “Hey, I didn't find this, but I found this content but it's not enough. Maybe use this content and maybe use this query to build a web search query, retrieve back the content, and then based on that, build the response.” Same thing if the answer that you got back is ambiguous like, “I don't really quite know if I really answered it or didn't really answer it. I can actually apply the same logic to improve upon the question.” So that's what's called ‘corrective RAG.’

BP And you mentioned you had worked in more traditional AI fields before coming over to work on Gen AI. My understanding is that in some of those sort of agent-based or multi-step or chain of thought type approaches, you're using LLMs for certain things, but you're also using more sort of rules-based, classical symbolic AI to render some of those decisions or help it through that sort of chain of thought?

RS You could. Most of what I've seen so far is mostly about retrieval in the first place. So the question is, how do you go from an LLM and a fully openly semantic query that is expressed in natural language to a more traditional system, like for example, a SQL database or a graph database? Knowledge bases as a concept started with graph databases, basically triples that have facts in them like, “Entity one is related to entity two with this type of relationship. Joe knows Bob, Bob is friends with Joe, Bob works at Costco.” These are facts that can be represented on a graph. And then you have the concept of reasoners that can basically say, “Who works at Costco?” And then the reasoner would say, “Oh, it's Bob. Based on this knowledge graph, I can say that it's Bob.” The question is, how do you go from this very open-ended natural language Lalaland into the most strict, “Okay, now I'm in a graph and I need to ask questions of this graph.” And part of what I've seen that's very, very interesting work, both from OpenAI but from others as well, from LangChain as well, is that people are basically leveraging LLMs to build queries in those target languages. So I'll ask the question where does Bob work, and then the LLM would basically be given the question, “Please build me the Cypher query that would correspond to this question,” and Cypher is the sort of lingua franca of graph databases used primarily by Neo4j, but not only has become kind of the most popular language for any database that supports what's called ‘openCypher.’ And basically that query will be converted into that more structured world of, “Here's my Cypher query and now I can provide you the insights.” So you get back a graph, and that graph will give you the answer that you look for, and then you give the LLM that context. That is actually the context you retrieved. It's not semantic, it's actually a graph. Now, these techniques are pretty brittle, so it's really hard to ensure that your query is always going to be faithful to the schema of your graph or that your graph will actually have the information that you're looking for, which is why I think that, in the end, the more effective methods are going to be a combination of both vector databases and graph databases and potentially other SQL-based databases and document-based databases all working together to produce really good content and really good context for the LLM. The same thing goes for other types of databases and there are extensions and modules in LangChain that actually do this. They go from having this open ended question which goes to SQL. So now you have this ability to connect your existing SQL databases to add more logic into your system in ways that are more traditional. And beyond that, to go back to your question about AI 1.0, things like classifiers and reasoners and whatnot, I haven't seen that much in the sense of being part of the RAG pipeline yet, but I can totally see, in the world of re-ranking, in the world of making sort of decisions along the way like deciding whether or not a question is sufficiently accurate or not, more traditional machine learning models kind of coming into play, but I haven't seen it proliferate in a really substantial way yet.

[outro music plays]

BP All right, everybody. It is that time of the show. As we always do, let’s shout out somebody who came on Stack Overflow and shared a little knowledge or curiosity and was awarded a badge. To Andreas Frische: “What is the syntax for Typescript arrow functions with generics?” Well, it was a question that was asked 8 years ago and it’s been viewed 415,000 times, so a lot of people have benefitted from this curiosity. Appreciate it, Andreas, and congrats on your badge. All right, everybody. As always, thanks so much for listening. We hope you enjoyed Part 2 of our conversation. We will stick Part 1 in the show notes if you want to catch the first half. I am Ben Popper, I’m the Director of Content here at Stack Overflow. You can always find me on X @BenPopper. You can always hit us up with questions or suggestions, and if you like the show you can leave us a rating and a review.

RD I'm Ryan Donovan. I edit the blog here at Stack Overflow. You can read it at stackoverflow.blog. And if you want to reach out to me on X/Twitter, my handle is @RThorDonovan. 

RS I'm Roie Schwaber-Cohen. I work at Pinecone, I’m a Developer Advocate there. To check out my work, you can just go to pinecone.io and all of our stuff is there. You can find me on Twitter and LinkedIn as well, and I'm happy to talk.

BP Sweet. All right, everybody. Thanks for listening, and we will talk to you soon.

[music plays]