The Stack Overflow Podcast

OverflowAI and the holy grail of search

Episode Summary

Product manager Ash Zade joins the home team to talk about the journey to OverflowAI, a GenAI-powered add-on for Stack Overflow for Teams that’s available now. Ash describes how his team built Enhanced Search, the problems they set out to solve, how they ensured data quality and accuracy, the role of metadata and prompt engineering, and the feedback they’ve gotten from users so far.

Episode Notes

OverflowAI is a GenAI-powered add-on for Stack Overflow for Teams that does the heavy lifting of discovering and distilling information into a coherent answer. It encompasses three modules: Enhanced Search, an upgraded search experience; Stack Overflow for Visual Studio Code, an IDE extension; and Auto-Answer App for Slack, which automates access to essential team knowledge.

Read about why OverflowAI is a big step toward integrating GenAI offerings into knowledge communities and dig into what’s launching and why it’s valuable.

Connect with Ash on LinkedIn.

Big props to Stack Overflow user Jennifer M., who earned both a Great Question badge and a Famous Question badge by wondering How to combine the sequence of objects in jq into one object?.

Episode Transcription

[intro music plays]

Ryan Donovan Maximize cloud efficiency with DoiT, an AWS Premier partner. With over 2,000 AWS customer launches and more than 400 AWS certifications, DoiT helps you see, strengthen, and save on your AWS spend. Learn more at doit.com. DoiT– your cloud simplified.

Ben Popper Hello, everybody. Welcome back to the Stack Overflow Podcast, a place to talk all things software and technology. I'm your host, Ben Popper, joined by the full content crew, Ryan Donovan and Eira May, and it is an exciting week. We are sharing with folks a lot of the details about OverflowAI. We announced this a while back at We Are Developers. We've been working on it and Alpha/Beta testing it, getting it into the hands of clients and partners to hear what they think, and now folks who are Teams customers can check it out. It's got a lot of different features that we're really excited to show off. So our guest today is Ash Zade, who is a product manager and was one of the folks integral to working through: What are we going to build? How are we going to build it? Okay, we've built it– now how do we make it great? So we're excited to have him on to discuss all these things. Ash, welcome to the Stack Overflow Podcast.

Ash Zade Thank you.

BP Eira, why don't you cue us off with just the big picture? What are we working on and why?

Eira May So we have been working on a way to combine the power of generative AI with the suite of tools and capabilities that we already have for clients in Stack Overflow for Teams. And so we're launching a new product that encompasses three modules that's going to make a pretty significant difference in terms of how people are engaging with the platform and the access that they're able to get to information that's validated by human sources and also AI-generated information.

BP A combination of our knowledge management roots and our Gen AI future.

EM That’s right.

BP But Ash, when you joined the company, I don't necessarily know if this was on our horizon, but I think I remember you saying once it was sort of a hard left into then this becoming a focus. What have you been working on and how do you view the big picture of what we've been trying to accomplish?

AZ So I think the big picture– and it took me a while to come to it– the big picture has not changed. We are still trying to solve for the same problems and take advantage of the same opportunities in the product, but now we have these superpowers that make things that before maybe weren't feasible, either technically or because the investment would have been so high, to achieve, now we have these superpowers to actually do it. So overall, I think that's the picture. And like I said, it took a while to come there. I'm like, “What does this mean to us? What does this actually change? Do we have to redo the whole product?” And we basically ended up with, “No, no. Now we actually have these things that'll help us do the things we wanted to do earlier and then some other things it's not really going to help.”

RD I think, like a lot of companies we've talked to, everybody's been exploring how to get Gen AI superpowers into their product. Can you talk a little bit about how our thinking evolved and the directions we went in and how we ended up on the path we're on now?

AZ It's a fun story. So Ben alluded to kind of a hard turn. This was four or five months after ChatGPT kind of blew up on the scene, and I was tangentially keeping an eye on it, but things were moving so quickly and I'm like, “Okay, when this matters, I'll know.” And apparently it was that day. So there was a bit of anxiety that there's this brand new thing, I know nothing about it, and we have to start using it. And I actually give a lot of kudos to my tech lead, Alex, who said, “This is an opportunity we can take advantage of. Right now, the expectation is, ‘Let's experiment, let's figure out what this thing can do and then let's overlap that with what we plan to do anyway.’” And so we basically started experimenting, and specifically the module we worked on is Enhanced Search, so we looked at it and we said, “Well, how can this help with search?” So we pointed it I think to the home improvement network on Stack Exchange in terms of a data source, and then we just used just ChatGPT. We just used a free API and started to hit that data source and prompt it and use that for search. And quickly we started to learn, “Oh, this is neat. It can actually speak to me and give me an answer in a language that I understand.” And then we also saw, “Oh, it's also making up a lot of things.” And so obviously it's important for search that we do the former and not the latter, and so we experimented quite a bit. We have a fantastic team of data scientists who along the way were explaining to us, and I think a few of them have been on the podcast, explaining to us, or specifically to me, how this thing actually works. It took them four or five rounds of explaining it to me for it to click, but once it started to click, the experiments plus their explanations kind of led us to, “Here are the four tasks that this thing is great for that we can use it for, and then here's some other areas you should probably avoid.”

BP I'd love to hear that, because we were recently talking with folks, and as you mentioned, there was a lot of experimentation, a lot of almost hackathon-style quick sprints to see where do we find value, where do we think we can move quickly with our existing tech stack without having to reinvent the wheel? And that really guided us in terms of, “All right, well, these are the opportunities we're going to pursue, at least for sort of the V1 of OverflowAI.” Talk to us about what sort of the feature set that we settled on, which is what we announced earlier this week, and why those particular features either deliver the most value or it was a combination of value plus our ability to execute on these.

AZ I think that's accurate. The third thing I would add is: What are we already working on? So like I said, the problems hadn't changed, our roadmap generally hadn't changed, we just have new tooling. So the first one– enhanced search. Obviously, search is a really big part of our product. And if you think about it, when someone has a question, they're really searching for an answer, so search is key there. And what leads them to posting a question is if they don't find an answer. So to me, search is the entry point. Now, I'm biased because that's the product portfolio I own, so search is obviously the most important thing despite my bias. So search is the entry point, so search has to work really, really well, and this is something we were already working on. So that's split into two initiatives. One we call ‘improved search,’ which is exactly that– let's make sure search returns the most relevant results. So that's number one. And then what we can do once it returns results, let's enhance the experience once you have results. And so we split off into different initiatives, like I said, improved search. And one of the big things I learned from our data scientists is, the experience with something like Gen AI writing out an answer or providing you an answer to a question, the majority of it comes down to a really good search system, the most relevant search you can have, because that's the first step. If you get that wrong, it's going to answer a question you didn't ask. It's not going to be relevant. So it was really important to get the search piece right, so one team started working on that, and then on the enhanced search side, luckily we had a lot of great research of, what does it mean to look for an answer? What are those steps involved? And then we looked at AI and thought, “Well, how can we improve that?” So tangibly, when you have a question, generally you open a search box, let's just say Google. You put in a phrase, you'll open the top, I don't know, two, three, maybe five, it depends how much of a crunch you're in. Generally, I'm lazy, I open the first thing and cross my fingers like, “Hopefully this is it and I don't have to read more things,” but generally, you have to open a few tabs. And what ends up happening is, you end up reading a few different answers written by a few different authors, different writing styles, different levels of detail, lots of differences. And in your mind, you have to aggregate this into one answer, and that's if you're lucky. Generally what happens is that one of these answers has a section that you don't understand, which then you have to open a new tab and search for that part and work your way down to understand the fundamentals of what's in the answer and then work your way back all the way up. And so we have this documented, we have this process mapped out, and we said, “Okay, where can we shortcut this?” And then this is where the enhanced search piece came in. So search is going to give us really great results, and the first thing we're going to solve for is how many results do we look at? Traditional search, it's up to the user and how much of a time crunch they're in, their tolerance for reading. I'm lazy, I always do one and I hope that's the thing, and if it's not, then I go two, three, four, five. But the risk there is that you're going to miss out on a high quality answer if it's number four, if it's number three. And so we decide now what's relevant. We have the tech for it. So part of improved search is semantic search where we can say, “This content semantically matches your question to this degree,” and there's a scoring system. So we will determine the most relevant content, and then what we'll do is we'll take all the answers, and they're going to be written by different folks, different ways, all those things, and we're going to give it to AI and say, “Using this source content, rewrite it in this way.” And now we've shortcutted all those steps, and now you're coming to the enhanced search piece.

RD We had a technical deep dive into our enhanced search a few months ago and I think that was an earlier version of it. How has our thinking changed in terms of reordering the k-nearest neighbors, how we combine answers, how we determine what that perfect relevant answer is?

AZ So the experience today is not going to be the same experience our customers and users have 3, 6, 9, 12 months from now because enhanced search, this iteration, is our best guess. So as product builders, tool builders, engineers, data scientists, we've guessed that if we combine them this way, if we prompt AI to rewrite it in this level of detail with this kind of formatting and this level of sourcing and attribution, we're guessing we're going to get you to that really high quality answer in one step, but what we depend on is user feedback, and this is true of every AI system that's doing something similar, and this is why you'll see things like thumbs up, thumbs down. So as more customers and users use enhanced search and give us feedback, we're going to look at that feedback in two different ways. And I'll be specific about the type of feedback– it asks for things like how accurate was this answer? How detailed was it? What about the length? So all these feedback dimensions that then we will use to pull on some levers. Some of those levers are going to go to improved search, so this is the accuracy piece. The prompt isn't really responsible for accuracy, the prompt is the person just writing from the source data. The source data is dependent on improved search, so accuracy will feed into that initiative. The other feedback dimensions are going to feed to product design, engineering, and data science where we can pull on some levers. So if we're getting, “Thumbs up– length works for me,” we know we've kind of solved for that. If we get, “Thumbs down– level of detail,” then we start to investigate and we start to look if we are providing too much, not enough? My gut feeling is not enough because if you're not getting your answer and you're having to open the sources– and we track these events, so we see this flow. Someone does a search, they get an enhanced search result, they give us a thumbs down, and then they open up the sources. So we can look at this and start to infer a few things and then just rapidly iterate. Let's play with this length and watch these charts. Ideally, everything goes up and to the right, so thumbs up and everything's going well. But really, that's how we're going to improve this thing with the eventual goal being that instead of people on this side pulling on some levers, it feeds back into machine learning models. And tangibly there, something like folks doing thumbs up accuracy, “Oh, well then this content we brought up, let's weight it really well.” Thumbs down accuracy, “Whatever we source, let's kind of bump it down in the results.” And over time, it'll just fluctuate and get better.

BP I love that we might be moving to that machine learning-assisted iterative improvement where the data can just evolve the app as the feedback flows in. You mentioned at the beginning hallucination, and a lot of what you were focusing on here was, “Okay, how many questions do we have to look at? Is it something that the reader is going to be able to easily understand?” I guess one thing I'm curious about is, how did we go about improving accuracy? Did we switch from one approach to a more retrieval augmented generation system where we said, “When you get your answer, take it only from accepted Stack Overflow questions and make sure you're only using things that you can have a citation for.” From the very beginning, what was the sort of roadmap or what was the path we followed to improve accuracy?

AZ So there's a few things going on. One is the chain of operations of, how do we get to the point where the answer is written? I can't take credit for any of this, by the way. I work with so many smart, intelligent people who have made these awesome choices that have so many positives beyond accuracy. I'm sure we'll get to privacy and those kinds of things, and this architecture kind of addresses both. So one of the things I learned from our data science team was to treat AI task-based. By that I mean, don't give it ten things to do at once. Just like people, especially myself, I'll just speak about myself, I am not great at multitasking. Give me one thing, I will do that, then I will do the next thing. If you give me three things, I'll try to do them, but something's going to get messed up in there, maybe. So treat AI the same way. Give it really specific tasks. Now, you can have ten things you want it to do, but give it to the system one at a time. And so this is how our flow is set up. When you do a search in Stack Overflow for Teams using Enhanced Search, this is the chain of operations. We do a very traditional search using our new improved search system. So we're not using LLM yet. We do a search, we have improved search which returns results which are even more relevant today than they were before, then we take the top results– and we've defined ‘top.’ And this is one of those things we depend on user feedback on the accuracy, but we've made a really good guess on what ‘top’ is, and ‘top’ is just based on semantic scoring. So we say, “Take up to five results of anything scoring higher than,” I'm just going to make it up, “0.8.” A perfect match is 1.0. If you find five things or more, take the top five of 0.8 or higher, so the most relevant things. So that's step one, no LLM. The next thing we do is, for each one of those pieces of content returned, look at the answers, pick anything that's accepted, upvoted to top upvoted, and if you find a question that has multiple answers that are upvoted, use them all. Because what we've also learned is that sometimes, especially myself and I'm going to be very candid, when I get stuck on things I'll open Stack Overflow, I'll read the first accepted top answer. I'll go, “I don't really understand that. What about you, second answer? Is there something I can copy and paste? What about you, third answer?” So human behavior, I'm looking at the top answers anyway, more than one. And I'll go on a tangent later about how we know that's the best approach versus a different one, but we take the top answers. Still no LLM. We're just trying to figure out what content do we give the LLM. Once we've done that, then we go to the LLM, “Per question returned, summarize the answer or multiple answers into one answer per question.” So in this example, let's say I've searched for, “How do I implement SSO?” You're probably going to get a lot of results. So let's say I have three Q&A pairs. For each one of those, we group each of their answers together into one summary. So we have one summary, one summary, and one summary, so now we have three summaries for three questions. Then we take those and then we combine them into one answer. Rather than just throwing this pile of data at it and saying, “Do what you need to with this,” we summarize individually so it maintains accuracy and then we summarize together. And all through this we source it, so when we bring it back to the end user, it says, “Here's what I found in this question. Here's what I found in the second question. Here's what I found in the third question,” but it's all comprehensive. And what's really neat is it flows well. It's conversational. So it's as if you had someone next to you explaining to you what it found. And then therefore we're only using the LLM for two things: summarize per question and then summarize across questions.

RD How do we ensure those summaries are accurate?

AZ Part of this has to do with the prompting. So we say, “Only use the source data, don't inject anything,” and we set the temperature to zero. So zero creativity, be boring, don't inject anything in there. And there's accuracy and consistency. Let's break it down a few ways. There's consistency– if I search for how do I integrate SSO, if you search it, if Ben searches it, we should get a consistent answer. Now if you treat AI like a person, it may not use the exact same phrasing, but it should be accurate. So we want consistency of accuracy, but it won't necessarily give you word for word the exact same answer. That's just the nature of AI. Then there's consistency across how do I integrate SSO versus how do I implement SSO, so there's going to be differences in the query. So consistency is a challenge, but I think we have to redefine consistency a bit too. And so accuracy with the prompt, we basically say, “Only use these things. Don't inject. Your temperature is zero,” and we found pretty good success there. And also the earlier steps where we're only telling it, “Here are the three things to use,” we're limiting the potential for hallucinations and we found a lot of success there.

BP One thing I'm really curious about is, in this version of search, there's metadata that you mentioned– look at accepted answers, look at the ones with the most number of votes, and don't necessarily limit yourself to just the top answer because if you take the top two, you might have even more context. We have this other sort of metadata within Stack Overflow: recency, comments, and tags. Do those play any role?

AZ This is going to be a yes and no answer. So I mentioned a tangent earlier. We have another module for OverflowAI, an IDE extension in VS Code. There, intentionally, we're only taking the top accepted answer, and there's a few reasons. One is, on the product side, look at this, we get to A/B test. This is great. We get to see, is it better to use the top answer or the top answers? That's one. So that's one way how we're going to decide actually over time which is better for the user and feedback. The other one was on the metadata and how we use those. This is where the machine learning re-ranker part will come in. One of the other things we did early on– again, kudos to data science and engineering. This was not my call, I'm not going to take credit, but it's a great call that was made. Let's limit the variables here. Let's start with what we need before we jam in all these things to come up with an experience. And by that, I mean having a re-ranker on top of improved search with enhanced search and all these things adding into it, it's very difficult to understand what is impacting what in this experience. And so you need really good search for enhanced search, so we’ve got to do that. We're getting good results back. Let's wait for some feedback and look at the data. And then let's ship, let's get folks using this, let's get that feedback loop going because it might be good enough. Maybe we don't need to rely on these things where with traditional search and traditional models, one thing we had to unlearn was that traditionally this is how search works. In this new world, does it have to work that way? Well, let's find out. Let's implement what we need to. Let's get it to an experience that we like and we'd be proud of. Let's get some feedback that it is solving for the problems I mentioned earlier on how you get the quickest valid answer as quickly as possible and saving you time. Let's see how much progress we've made there, get some feedback and then start to implement some of that metadata potentially into our re-ranking model. My gut feeling is that we will get there, but let's learn along the way.

RD Now that some of these search projects are out in the wild, what kind of feedback have you gotten? Do people appreciate this? Do people miss knowing the esoteric ways to get the right keyword in?

AZ It's been very interesting in that way. I'd love to say it's been such a great success and I'm on everyone's shoulders daily and the team is, I think it's two things. We can look at it and intuitively say, “This is a better experience. I don't have to wade through a traditional linear list of results. I don't have to make all these judgments. This other thing is going to make these calls for me. And it's in code, which is supposed to be smarter than my human brain.” So intuitively, we all think this is the way to go. In some cases, it is definitely a better way. In other cases, we still have some work to do. So, for example, if I ask, “Are we SOC 2 compliant?” Immediately, it comes back and says, “Yes, we are.” It's really easy. These are typical things. Yes/no questions, when did this feature come out, even going to the point of, what is supported by this API? Those kinds of queries and searches work really well. Where we have some work to do is things like, “How do I implement SSO? Give me a code snippet.” Actually, the answer is not just a yes or no or a list of things, it's step by step by step. And the reason I say that we have some work to do is we just need to collect more data. The yes/no questions are very easy to run tests, to have user feedback, but to have someone in their IDE query for code snippets, and in this example, this is going to internal Stack Overflow for Teams, so this isn't public data. That's why it's a bit more difficult. You're going to your own internal company's data, you're asking for a code snippet. The developer has to do that, get the code back, implement it, go through the development pipeline, and then they'll know, “Hey, this helped or not.” So there's a lead time there. But intuitively, it's definitely working. Also it's enabled us to do other things. So I'm talking about enhanced search, and I'm assuming everyone's kind of imagining it in a web browser, but what we're able to do as well, because search is now semantic, it's improved the experience in chat ops. So we're doing things in Slack and MS Teams are maybe even more exciting than enhanced search. So in Slack specifically, we created a module called Auto Answer, and it's really simple, but it's really valuable. There are specific channels, support channels, services channels, where team members will go and ask questions. “I got this error, what should I do? Are we SOC 2 compliant? Have we released this feature?” And it's generally a support team addressing these things and they’re repetitive questions. And traditionally, you would try to document these answers somewhere. Hopefully there's behavior change where folks are going to search there before you ask here but we know that's not happening. I'm in Slack all day. I've gotten my hands slapped. “Did you check Stack?” You're right, I should have done that. Behavior change is difficult. Auto Answer is great because when someone posts a message in one of these channels, we do a search right away. It does not rely on people having to remember a command. It does not rely on support members saying, “Did you search first?” We do a search right away, which we were always able to do, but what improved search unlock is that we can search semantically. In Slack, nobody types with keywords. Nobody goes in a channel and just types, “SOC 2 compliance.” Everyone would say, “Are you finished writing your message? Did you hit enter too early?” But you can go in there and conversationally ask a question, semantic search supports it. And so it has unlocked that for us as well, where immediately we return a high quality result. You get your answer instantly within Slack, within MS teams. So there's a lot going on here of how it's working, and I think the unlocks in other channels are even more successful than just improving search in a web.

RD So what unlocks do you see for the future? What are the sort of the things you can talk about that are sort of the Fantasy Island things that are coming next or that you hope are coming next?

AZ Well, I'd love to talk about it. I don't know if it's coming across, but I'm genuinely excited about all of this stuff. It's very easy for me to talk about. So the goal of let's get you to the highest quality, most validated answer as quickly as possible, part of this puzzle, besides everything I've spoken about, is that parts of the answer already exist somewhere else. What if we can get those pieces of an answer from your other tools, bring it back, and tell AI to also check Github, also check Jira, check Confluence. Check all these other places that you may have, see if there's anything helpful there. To me, the Holy Grail of search is that every company has content and answers already somewhere. The problem is who knows where that answer is. So generally, you go to your people. I have specific people in Slack. When I'm stuck on something, I'm like, “I know who would know where this thing is.” Well, now we're going to try to replace this ‘who’ with using these LLMs and using something like enterprise search.

BP It's this double win of instead of tiring out your subject matter experts, you have this universal librarian, and also as good as your SMEs might be, they're not sitting across all of the information which can be siloed in these different channels and might've been contributed by teams that they're not even aware of, right?

AZ That's exactly right. There's so many benefits here. So one is, SME, you do the thing that's more valuable than answering my question again. Also, let's catalog all your knowledge. It's not lost. Whereas traditionally you think, “Okay, it's out there somewhere. We're not going to get all that value from it.” Well, now you can. And also what it means to people is that your contributions everywhere can get recognized. You did a comment in your source code. You made a comment in a PR that's helpful. Guess what? We're going to bring that back and other people will get value from it, not just your media team. That's the Holy Grail.

RD I remember my previous job trying to get people to use a central system and took a survey and was like, “How do you find knowledge? How do you find answers?” And two thirds of the people were like, “Slack.” And we had a wiki, we had a Q&A product, so being able to centralize those in an actual single place sounds kind of amazing.

BP Yeah, for sure.

AZ And I've been talking a lot about people who have questions and getting answers. This also has a really positive effect on the people who have answers. So when it comes to behavior change, we've asked these SMEs, “When you answer a question, go over here and answer it.” But it's difficult to do that because I'm probably going to get bugged on Slack anyway. Or if I answer it over here, it's going to be out of date. I don't want to have to maintain all these things. Well now with things like Auto Answer, if you create the content, we the product builders will make sure people see your content, and those people don't have to remember any tools. We will make sure to surface your contents. Your content will get more value, more views, and you'll get bugged less. So we talk a lot about consumers of content, but creators, these SMEs, we have them in mind as well, and this is one of the principles from day one. How can we improve their lives as well? They're equally important to consumers or folks with questions.

RD Yeah, man. Let your senior engineers do the engineering. They answer questions all day.

[music plays]

BP We love it when people come onto Stack Overflow and share a little knowledge or express a little curiosity that leads someone to leave a great answer. A Great Question Badge, awarded to Jennifer M 12 hours ago, “How to combine the sequence of objects in JQ into one object.” Thanks for asking, Jennifer. Congrats on your badge. 70,000 other people have benefited from your curiosity. As always, I am Ben Popper. I'm the Director of Content here at Stack Overflow. You can find me on X @BenPopper. If you want to come on the show, you're a software developer, you work in the industry, hit us up, podcast@stackoverflow.com. We bring on guests, we take suggestions, we just listen. We're here to listen. And if you like today's show, the best thing you could do is leave a rating and a review, or go check out the new offerings we have with OverflowAI. We'll put some links in the show notes.

RD I'm Ryan Donovan. I edit the blog here at Stack Overflow. You can find it at stackoverflow.blog. And if you want to reach out to me with your musings on whatever technical topic, you can find me on X @RThorDonovan.

EM My name is Eira May. I am a writer and editor at Stack Overflow also, and I am on twitter.com @EiraMaybe.

AZ So I'm Ash Zade, I'm a Product Manager here at Stack Overflow for Teams. You can find me on LinkedIn. I'm generally not on much social anymore, but I'm on LinkedIn. You can find me there at Ash Zade. And I would love for folks to try out OverflowAI and start giving us that feedback that we've been waiting for.

BP Yes, feed the beast, make it smarter. We can't wait. All right, everybody. Thank you for listening, and we will talk to you soon.

[outro music plays]