The Stack Overflow Podcast

Semantic search without the napalm grandma exploit

Episode Summary

Ben and senior software engineer Kyle Mitofsky are joined by two people who worked on the launch of Overflow AI: director of data science and data platform Michael Foree and senior software developer Alex Warren. They talk about how and why Stack Overflow launched semantic search, how to ensure a knowledge base is trustworthy, and why user prompts can make LLMs vulnerable to exploits.

Episode Notes

Last month, we announced the launch of OverflowAI from the stage of WeAreDevelopers. To learn more about AI-driven products and features in the works, check out Stack Overflow Labs.

Among the projects Alex works on is a semantic search API and the new search experience on Stack Overflow for Teams.

LLMs can be vulnerable to jailbreak attacks like the napalm grandma exploit.

Kyle is on GitHub, Linked, and text-based social media.

Michael is on LinkedIn.

Alex is on LinkedIn.

Shoutout to Lifeboat badge winner Pushpendra, who scooped Error: Invalid postback or callback argument from a churning ocean of ignorance.

Episode Transcription

[intro music plays]

Ben Popper Hello, everybody. Welcome back to the Stack Overflow Podcast, a place to talk all things software and technology. I am your host, Ben Popper, joined as I often am these days, by my co-host, Kyle Mitofsky, Senior Software Engineer here at Stack Overflow.

Kyle Mitofsky Hey, hey.

BP We have a great conversation for y'all today. We're going to be talking a lot about the announcements we made recently at We Are Developers concerning the launch of OverflowAI and a bit about how we got here. So we have two great guests today: Michael and Alex. I'll let them introduce themselves in a second and tell you a bit about what they do here, and then we're going to dive into a conversation about what we've been building, how it's going to impact hopefully Stack Overflow the public platform and Stack Overflow for Teams, and then a bit of where all this is headed with some questions about some of the underlying technologies. So Michael and Alex, welcome to the podcast.

Michael Foree Thank you, Ben. Good to be here.

BP Michael, why don't you kick it off. Tell people just really quickly who you are, what your role is here at Stack Overflow, and a little bit about what your specific sort of function was as we were thinking about launching OverflowAI and what all that is going to mean for Stack Overflow as a whole.

MF Yep. So my name is Michael Foree. I'm the Director of Data Science and Data Platform here at Stack Overflow. And for part of OverflowAI, I've been leading data scientists, data engineers, machine learning engineers, and trying to hold together the glue of all the data stuff that's flown around, making sure the data is being stored properly, that it's being sent to the right machine learning models, that they're being built out in a timely fashion, and helping to ideate on what type of models we're going to need to support the different OverflowAI initiatives.

BP Nice. And Alex, how about yourself?

Alex Warren Hi, I'm Alex Warren. I'm a Senior Developer here at Stack Overflow, just like Kyle. I'm also a Tech Lead. I'm Tech Lead of one of the teams that has been building bits of OverflowAI.

BP Very cool. And what team are you the lead of Alex? What is the focus of your team, let's say?

AW The focus of my team. Okay, so my team is responsible for building a bunch of services and UIs for OverflowAI. In particular we are building an AI gateway, so that's an API that other teams across the company can use to connect to large language models. And we're also largely working on search APIs, so there's a semantic search API we’re working on, which we'll talk about in a minute I expect, and there's also the front end and back end of what we call the search experience, so the new search experience on Stack Overflow for Teams, we are building the front and back end of that as well.

BP Very cool.

KM So maybe let me set the stage here with my viewpoint on how folks are using LLMs at large and what the level of– there's a Wired series, it's like, “Explain at five levels: beginner level and then intermediate level and advanced and expert.” And so in the world of LLMs, I think there's kind of three different levels of depth and complexity you can take on in leveraging an LLM. The first is just we're going to solve it at the prompt level. We're going to use an off the shelf model. Everything off the shelf, but I'm going to prompt it and I'm going to get something back. And there's a lot of ways to leverage an LLM just by prompting it and doing it that way. I think the second level of depth you can get is using embeddings. So you put your own content there in a way that the LLM understands, and then you can retrieve those, you can augment them, you can do certain stuff. And then the third level of intensity, I think there's a funnel and the most people are doing it at the prompt, some people are doing embeddings, very few people are building the actual models themselves. I was curious if that's the right framing, but if it is, I think in our project we did the first two, is there a place you gravitated between those first two in terms of where you were picking up knowledge, where you were working on stuff, and then is there any interest in us pursuing the third? Are we ever going to look at a whole darn model, or is it mostly the first two where we can take advantage of a lot of the ways that technology applies to software?

AW So I can speak to a bunch of that stuff. So just as a bit of background, like a lot of developers about six months ago, I hadn’t touched anything to do with AI whatsoever. So when we started on this project, we were starting from very rudimentary knowledge and we had to upskill really fast. So it's been a nice learning journey for me, and I think what I have discovered actually is, like you said Kyle, at the prompt level, there's a lot of stuff you can do actually there. I would say what I have been focusing on is basically those first two things you talked about: prompts and embeddings. We definitely use embeddings, so we can generate embeddings for content that's on Stack Overflow and Stack Overflow for Teams. By generating those embeddings, we can do better search basically, which then lets us take those search results and we can use that in prompts and do some pretty cool stuff.

BP Michael, aside from those two, do you think that we have a way we might get into the third piece of that, like our own models or our own training? Or does it make more sense for Stack Overflow to let other companies work on those foundational aspects or pick things off the shelf and focus instead, as Kyle was saying, on making sure our prompting and our embeddings are great so that people can make use of the incredible data that our community has and will continue to create?

MF I certainly do think that there's a future where Stack Overflow fine-tunes our own large language model, and there could be a handful of different business reasons why we would want to fine-tune our own LLM. My personal desire is that absolutely, I want to fine-tune a large language model and I want to do it just for fun because I can. But when it comes down to how am I going to convince my boss to let me do this, I think that there's a lot of lift, as you say, in having better prompts going to a foundational large language model. There's a lot of lift in the retrieval aspect where you pull in your own content and you augment what the large language model is generating. There's even opportunity to game and A/B test different large language models. All of these things are much, much easier to do than fine-tuning your own large language model, and so it's difficult for me to predict if we're going to start fine-tuning our own for the enhancement of OverflowAI.

BP Gotcha. So let's talk a little bit about what we have been working on and a little bit of what you shared in your fireside chat that you did together at We Are Developers. I think the beginning of this is a search comes in and then we're going to generate vector embeddings to look up in a vector database. At the end of that, we might rank and merge the results. We also have a lexical database. We shared a great article about this from David Haney and David Gibson, two other folks in our team; I'll link to that in the show notes. But let's start at the beginning of that journey. How did we go about thinking about a vector database? Let's talk a little bit about that and how we would've constructed it, and then from there, let's move to our approach to embeddings and how that’s working at the moment.

MF So we do have this rather in-depth architecture. And I want to preface this, there are some members of our audience who might not be fully aware of what you can do with large language models. The first approach that anyone should do, anyone listening to this should absolutely find access to a large language model and just start prompting it and asking it questions and see what happens. And you'll very quickly find that it gives some really interesting information. Sometimes it's right, sometimes it's wrong, sometimes you don't actually know the difference because you're asking for stuff outside of your depth. And here at Stack Overflow, one of the things that we wanted to do is we wanted to seek to deliver high quality information to our users and leverage the expertise of the contributors that we already have on our platform. And going against a straight large language model, irrespective of what that large language model is, we very quickly found that the quality that we wanted to deliver just wasn't there and the community involvement also wasn't there, and so we decided to leverage our own content to augment what large language models do. And so the very first step of that is we overhauled search, we overhauled our ability to discover content on stackoverflow.com. The traditional search solution is a lexical search, think Elasticsearch. It looks at words and it says, “Oh, you search for the word ‘run.’ I'm going to help you find the word ‘run.’ I'm going to get fancy, I'm going to help you find the word ‘running.’ You search for Python. I'm going to look for the word ‘Python.’” Semantic search kind of takes this on its head and it says, “Oh, you searched for Python. That's probably not a snake. That's probably a programming language, so I’m going to help you find programming languages about Python.” If you search for ‘run,’ maybe it's dog running and it's this type of aspect. Or maybe it's not dog running, maybe it's computer running, so I'm going to help you find the concept of, “Is the computer on and functioning?” And so semantic search changes the content discovery paradigm that we have to work with, and it helps us drive great new capabilities with discovering content.

BP For the ‘explain it to me like I'm five,’ being the resident non-engineer in the house, it's been described to me as like a vector database. Think about it as a 3D space where these different embeddings are getting stored and they're, as you said, being grouped together based more on their meaning, like two related subjects or topics, as opposed to keywords that you would have to group together in the search. One of the things that you pointed out at the beginning that's really interesting is that anybody can run up against ChatGPT. You can pretty easily connect to that API, but the quality of the results, as the model itself will tell you, includes a lot of hallucinations because it's reading the entire internet and trying to guess what you want to hear in response. What we're doing instead is using the Stack Overflow database as our corpus and hoping that that will get us more accurate results. Kyle, I know you wanted to dig a little into this. I'll throw it over to you for the next question.

KM Yeah, sure. So one of the ways that we take semantic search is via embeddings. That's where we take the content, the gist, the smell of a particular set of text, and we store it as a vector and we put it in a vector database. As someone who really only touches SQL, can you compare a vector database to a SQL database? Can I just select star from it? How do you touch it?

MF Yeah, so let's talk about that a little bit. So the first is that many people think of a vector database as taking a string of words and storing it in a different database and that's not quite right. The first step is that you take the string of words and you convert it into numbers. You don't convert each word into a number, you take the entire string, and you understand it’s semantic understanding, it’s language understanding, and you convert all of that into numbers. And so you might have a thousand different numbers, and number one might mean something like, “Is this set of words about Python, yes or no?” And not really yes or no, it's on a scale of 0 to 100%, is this about Python? And the next one would be, “Is this about Java on a scale of 0 to 100. Is this about programming? Is this about petting your dog, or so on and so forth, rinse and repeat a thousand times. And so you have these thousand different numbers that get stored in your favorite data store. And I say data store, I didn't say database and I didn't say vector database. You can store these numbers however you please. You can store them in a hashed array in memory that disappears as soon as you turn your computer off if you so desire.

KM And I plan to.

MF And Kyle, honest to God, a year ago, this is what we were doing at Stack Overflow when we were testing this stuff out. We were just storing this in memory, because why not? It got the job done, and for the longest time there weren't a lot of vector databases out in the market for us to choose from. A lot of other people are storing their vector embeddings just in memory or in a text file or in a normal SQL database, and it gets the job done. But this is where the SQL database or the in-memory solution starts to break down. We have tens of millions of questions. If we were to store all of that in memory, it would take up a lot of space. I don't know how much, I haven't run the calculations recently, but we would run out of space. We can't do that. We could store this in a normal SQL database in your favorite relational database. But here's the next step that you're going to do with a search. You're going to get a search query that comes in, you're going to convert this search query into your vectors, and so you've got a thousand numbers, and then you want to search against your existing embeddings on all thousand different numbers. So Kyle, have you ever indexed a database on a couple of different columns?

KM Sure have.

MF Have you ever indexed a database on a thousand different columns?

KM No.

MF I've never done that either. I don't even know if you can. I haven't tried, I don't want to try. So that's basically what you have to do if you're going to turn a SQL database into a vector database. Disclaimer– Postgres is working on this. They have an add-in where you can turn a Postgres database into a vector database. I haven't actually tried it, but they're working on this. But one of the things that vector databases specialize in is they don't index on all 1,000 different columns and allow you to search on all 1,000 columns. Instead, they use K-nearest neighbor tricks, and they cluster the different data into a bunch of different smaller clusters. So then you take a pretty good guess at, “Oh, here's this new embedding that comes in. I'm not going to search against all 10 or 50 or a hundred million rows. I'm only going to search against a couple of different clusters. I'm going to narrow it down to my favorite clusters, and then I'm going to search against those and I'm going to do it really, really, really fast.” And there's a lot of different vector databases that have popped up over the last six months and the field is changing drastically. Day over day, there's new advancements that are coming out.

BP I'm pretty sure we talked about this in the blog post. I just wanted to get it out there that we chose Weaviate. Is there anything you can say about that and why we chose that one, why it works particularly well for us, or why it's suited to our needs?

MF Yeah. Weaviate does a couple of different things that we were looking for. One is, it's self-hosted. We didn't need a self-hosted solution, but hosting it ourselves relieved a lot of different problems around security, compliance, making sure that the other place that hosts this data is good. Weaviate also comes with a vector database as a service option, so if we so chose in the future, we could ask Weaviate to host the vector database and we just pay them some money. They also make it easy for us to– the term isn't re-index, but you can imagine when you add or change a bunch of data inside Weaviate, inside the vector database, you have to put it back in the clusters that it's expecting. And so they have a pretty straightforward way of doing that and we want to update our data every so often and Weaviate makes it simple to do that. And then it had pretty good user community support.

BP Right. And I think the last one that was mentioned is that we leaned pretty heavily into PySpark for some of this work and it needed to have that native Spark connection. So from the tech stack we chose, it was a good match, right?

MF Yes. Valid contribution, yes.

BP Valid contribution received. So we've talked a little bit about vector databases and embeddings and some of the reasons. Can we talk a little bit about what RAG is, why we would choose to lean into that, and how we think it will help with the search results? Alex, maybe you want to take that one?

AW Sure. So I think a lot of people when they are starting to play around with these large language models jump straight into prompt engineering, and you can get some cool results back when you ask an AI a question. But what we're doing here is rather than using the knowledge that's in the large language model itself, we want to use the knowledge that's either on public Stack Overflow itself or for our business customers in their Stack Overflow for Teams instance. Because obviously we want our customers to be able to ask questions of their data, and that’s data that the large language model doesn't even have. And even if it did, large language models are not very good at knowing where this data has come from. So one of the key problems that we're looking to solve on both public and on Teams is just that kind of trust aspect of where does the data come from and can we be sure that this data is true. So the way that it works is, say they say, “How do I add to a list in Python?” What we can do is a semantic search like Michael was just describing, and we can find the closest question that already exists in the particular dataset that we're searching to that. We then fetch back the answers to that question, and we've got various ways that we can choose which answers to pick out of the search results that come back. And then what we can do is we can take those answers which we know where they've come from and then we can construct a new prompt that says something like, “Here's the context. Here's a bunch of data I've got back from Stack Overflow. Please answer this user's question: How do I add to a list in Python?” And so then the large language model is going to use hopefully just the data that we give it to answer the question, and we know where that data has come from, so we can have more faith that that is the right answer and a safe answer, rather than using what the large language model reckons to be the answer. So that’s the key difference.

KM What I really like about Results Augmented Generation, RAG, is that LLMS kind of offer these two features. One is just semantic understanding of text, and also pretty impressively for just regular consumers, a relatively large repository of knowledge. And so they've offered these two features, but if we just lean on them for the knowledge repository, then we're prone to hallucinations, we're prone to all these other problems. And what RAG allows us to do is just take one of those features at a time. Just say, “I just want the engine. I will bring my own knowledge repository here. I'm not going to lean on you for the knowledge repository.” That seems like a really hard job to put on one piece of technology to know everything, all know the stuff that our people are asking about. One challenge that that sometimes presents though is this last mile problem, where if a user searches “How do I make a button blue?” and we have a Stack Overflow post that is semantically kind of similar to that, “How do I make a button red?” we're going to surface that post and say, “Oh, that's pretty darn similar. Probably what you're doing is setting the background color in HTML or something. Here’s how to style it. Here’s how to use CSS.” The answer is really darn close, but if we just summarize that result, it’s probably going to answer a slightly different version. Do we have any ways of solving that last mile problem of re-contextualizing the provided answers that we know are good quality sources, given the problem that the user is actually trying to face at this point in time? Do we do anything there to kind of take those results back and summarize them?

MF Yep. So there's a couple different ways that you can use RAG-style generation, and one that we're using on public platform is to say, “Oh, this is the results that we got back from our search. Combine these together and present them to the user.” Something else that we're toying around with, and we're doing some experiments right now, is that we give the language model a little bit more leeway and we say, “Hey, these are factual things. This is how you make a button red, it's a fact. But can you change this to address what the user actually asked for?” And what we're getting here is it's supposed to address that last mile question of, “This is the specific problem that was addressed. Can you make some changes?” What we want to investigate is if this introduces any extra risk or errors when you go and change what we're giving, because we're giving you a little bit more leeway than what we had before.

BP And Michael, that's kind of in a hidden prompt that the user doesn't see. But when the user asks a question and it goes through our system, there's sort of a long prompt that we've written in there that, as you said, is instructing the model to try to give back an answer structured in a certain way. Some of the interesting things I think you noted is that if you don't know the answer, and I'm not sure how it knows if it doesn't, but don't be afraid to say when you don't know or solve that last mile problem and try to synthesize a little bit to get back to what they would want– change red to blue. And I think a really important point you made also is, can you give this answer in a way that also includes the sources, because one of the things we're committed to with OverflowAI is continuing to have attribution and recognition for the community members who provided the knowledge in the first place. So maybe talk a little bit about that hidden prompt layer.

KM I have a hot take on hidden prompts. So I think we've taken on a lot of tech debt as an industry looking at LLMs, and let me back up and explain how. When we prompt an LLM right now there are kind of three roles that you can have. You can have a user role, which is the prompt. You can have an assistant role, which is the response based on that user prompt. But there's also, in a lot of LLMs, a system role that is not the assistant going back and forth. It's the system, it's the “You are a Stack Overflow GPT. You are trying to answer this problem.” And I think those system prompts are underleveraged to say, “Here are the rules that you must abide by. Here’s the rules of the game. Every single prompt that follows this should be scoped and defined within those.” The system prompt is something that the user has no access to. One of the tech debt pieces is that these have not been prioritized very high by existing models. When you send a system prompt, it's very easily overridden by the user prompt. So it doesn't do things like prevent against jailbreaking, and now there's this arms race in the user prompt where we take the user prompt and we wrap it and we say, “Here’s how I want you to respond to the user prompt.” But the jailbreaking problem is that somebody's going to say, “Ah, ignore the previous thing I said.” And all of that happens within the user prompt, and it's kind of created this arms race within there, whereas I think the long-term shelf stable solution for this is really strong system prompts where that gets to override individual instructions within the user prompt. So I think as an industry that’s where we're moving. I think the weights are being played with those systems in evolutions of LLM models where they will weight the system prompt higher. In the meantime, we have a lot of different back and forth on the user prompt and hidden prompts and trying to solve that issue of making sure we get out of the system what we want.

AW Yeah, I agree. These APIs are improving all the time, and when we were starting out with this project, what you are just describing is usually exposed via– certainly on OpenAI– it's via the Chat Completions API. But I think that wasn't even released until fairly recently because when we started there was only the raw Completions APIs, which is literally just string in, string out. There’s no structure to that. So as soon as you start copying user data in there, then it's essentially just your classic sort of SQL injection type thing like if you were building a PHP app in 1997 or something like that. There's no concept of creating a template that then user data can be safely inserted into. So we're starting to see that now. As you say, we’ve still got problems of just being able to simply get the prompt to ignore the system-level stuff. So I'm sort of hoping, like you, that the industry just moves forward because then actually a bunch of problems that we're facing will just kind of go away because people who are paid to solve those problems can solve them for us, and then we can just consume these new nice, secure APIs.

KM Exactly, right.

BP Yes, let's hope so. I mean, napalm Granny has gotta be one of my favorite things that I've learned over the last year as a heuristic for thinking about life, but I also have been seeing research coming out recently on jailbreaking saying to what degree is this an inevitable feature that's baked in and something we can't really engineer our way out of. One of the things that I wanted to ask each of you before we conclude is, in brief we gave a roadmap announcement at We Are Developers and there's a lot of exciting stuff there now. Folks can go to the Labs page and sign up to be alpha testers. Thinking about where we are now, what are you most excited for in the next six months to a year? What is it that you're working on now, and where do you hope we’ll be as this stuff continues to roll out to the public?

AW Well, I'm excited for a lot of it and this stuff is just changing all the time. As I say, over the last few months we've seen new models come out, new APIs come out. This stuff is transforming and the thing that's really exciting to me personally is just being able to ride this wave. We're seeing the industry transforming all around us and it's quite rare that I've seen a transformation like this come along where I've actually got to be building stuff at the same time that it's happening rather than catching up years later. “Oh, what was that that happened a couple of years ago? Oh, yeah. I remember this,” and finally getting a chance to play with these new toys. So I think for me, I don't even know what's going to happen next. This stuff is changing so rapidly. I don't even know what to look forward to, but I know that this stuff is changing a lot, so I'm just excited to be here for the ride.

BP Alex, I think you make a great point there. This feels like, “Oh, I just happened to be somebody who's working at an early internet company in 1996, or I just happened to be someone who's working at a mobile app company in 2007,” and so it's just exciting to get to be part of something that's really rapidly changing and is sort of infusing itself throughout the whole industry. And I think Stack Overflow, with our dataset, is really uniquely positioned to do something interesting here. So I agree. I feel kind of privileged to get to work on this stuff. Michael, how about yourself? What are you thinking about? What are you looking forward to in the next 6 to 12 months and what do you hope people will get to see out of Stack Overflow?

MF I think Alex put it really well. The industry is transforming in front of us and we get to be a part of this. I think that the future of Stack Overflow is going to be unrecognizable five years from now. And I think that over the next 6 to 12 months we’re going to put into practice a lot of the stuff that we’ve been playing around with and testing. I think that we're also going to innovate and find new things. It's difficult for me to guess at what's going to happen 12 months from now.

BP I like that. You won't recognize us. I will say reports of our death have been greatly exaggerated. By the time you hear this podcast, there will probably have been a blog post a bit about our traffic stats and what they really look like. So I hope you would join us in the journey because actually we have an opportunity for people to really interact with helping us shape the future of Stack Overflow the platform in a way that hasn't been true in a long time– trying out features, giving feedback, participating in research and all that good stuff.

[music plays]

BP All right, everybody. As we always do this time of the show, we’ve got to shout out a Stack Overflow user who came on and shared some knowledge and helped to save a question from the dustbin of history. Congrats to Pushpendra on your Lifeboat Badge, came in and saved a question with your answer. The question is, “Error: Invalid postback or callback argument.” Not really sure exactly what the question means, but if you had issues with postback or callback, Pushpendra has got an answer for you and has earned themself a Lifeboat Badge and helped over 30,000 people. So we appreciate you contributing some knowledge, the AI will make the most of it in the future. I am Ben Popper. I'm the Director of Content here at Stack Overflow. You can always find me on X –I hate saying that– @BenPopper. You can always email us, podcast@stackoverflow.com with questions or suggestions. And if you like the show, leave us a rating and a review. It really helps.

KM I am Kyle Mitofsky. You can find me on Twitter –still going to call Twitter– @KyleMitBTV. And as always, you can find me at Stack Overflow at User ID 1366033.

AW Alex Warren, Senior Developer. If you want to find me, actually no, I'm not on social media these days, so you can't.

MF My name is Michael Foree, Director of Data Science and Data Platform. You can find me on LinkedIn.

BP Thanks for listening, everybody, and we will talk to you soon.

[outro music plays]