The Stack Overflow Podcast

How Stack Overflow is partnering with Google to encourage socially responsible AI

Episode Summary

Ben talks with Ryan Polk, Chief Product Officer at Stack Overflow, about our strategic partnership with Google Cloud, the importance of collaboration between AI companies and the Stack Overflow community, and why Stack Overflow’s Q&A format is so suitable for training AI models.

Episode Notes

Stack Overflow has teamed up with Google Cloud to develop an API—Overflow API—to give Gemini, Google’s AI model, access to Stack Overflow knowledge communities. 

Learn how Ryan’s team is working toward socially responsible AI.

Connect with Ryan on LinkedIn.

Stack Overflow user verygoodsoftwarenotvirus earned a Great Question badge by asking something at least 87,000 people have also wondered: How can I get all keys from a JSON column in Postgres?.

Episode Transcription

[intro music plays]

Ben Popper Maximize cloud efficiency with DoiT, the trusted partner in multicloud management for thousands of companies worldwide. DoiT’s innovative tools, expert insights, and smart technology make optimization simple. Elevate your cloud strategy and simplify your cloud spend. Visit doit.com. DoiT– your cloud simplified.

BP Hello, everybody. Welcome back to the Stack Overflow Podcast, a place to talk all things software and technology. I am Ben Popper, Director of Content here at Stack Overflow. Today, I am very excited to be joined by Ryan Polk, who is our new Chief Product Officer. We're going to be chatting about some of the big announcements the company has made, as well as some of the stuff we've got on the roadmap for 2024. So Ryan, welcome to the Stack Overflow Podcast. 

Ryan Polk Thanks! 

BP When we start out, we always ask folks to give us a quick flyover. How did you get into the world of software and technology? 

RP Well, I started off really early as an engineer. I started off my career in software development, although I think I started off when I was a teenager writing code in the ‘90s, and so I'm dating myself a little bit there, but I'm willing to live with that.

BP That's all right, plenty of ‘90s kids on here. Were you writing code for a web forum or cracking games or selling web design? What were you doing in the ‘90s? 

RP Well, I started off in C++ mostly dabbling and playing around with stuff. I got really into games, and in the early 2000s, a little bit of botting. I'm going to admit that I did a little bit of botting inside of the games that I was playing. And I really moved into a career shortly after college in software development full time. I worked for a bank writing internal code there. I’ve worked in all kinds of different industries. I worked for a winery for a couple of years, worked for a bank, worked in the gaming area writing back end software for slot machines at a company that came from Midway Games called WMS Gaming for a while. I progressed through the ranks. I was constantly kind of moving up as a manager of engineering. I tried to hold on to my coding skills over the years, but you tend to fade over time. I moved into product later in my career about 15 years ago and then kind of bounced between the two. I’ve been both a Chief Product Officer for multiple companies, including Rally Software in the Boulder area, and then also Chief Product and Technology Officer for Carbon Black, a Cybersecurity firm in the Boston area. I stepped out for a while and actually got to work in the VC industry for a couple of years. I really wanted to learn more about the other side of the business, how companies got funded and how you built the full strategy of a business from the funding side. But I never thought I was going to stay in that area. I never thought I was going to stay on the ‘retire into VC’ side, so I wanted to get back into the game and work with companies directly and so I joined Stack about six months ago. I’m really interested in the community, really interested in how we could kind of reinvigorate and start to grow the community in the world of AI, and that's really what drew me here, really out of passion. I wanted to become a part of this discussion and help a community that's been growing for 15 years. And I'm excited to be here and excited to see where we take the strategy overall. 

BP If you don't mind me asking, were you ever a Stack Overflow user, or are you one of the many lurkers out there? 

RP Oh, yeah. Well, I will admit I've been a user for a very long time, but I've been more of a lurker, as in, I’ve found a ton of value out of the community, but I didn't do as much contributing. And that's something that I’ve looked at the behavioral influences on trying to figure out how we can get more people contributing like myself.

BP Exactly. All right, so let's get right to some of the big news because, for all of us here at Stack Overflow, it's very exciting. And also it's sort of one of the first things where you're really stepping out and writing on our blog, laying out some of the principles you're trying to develop for the company. We last week announced that we have a partnership with Google, and data from Stack Overflow will be used inside of their Gemini model to help it train and get better at things like, for example, the pair programming that goes on in Duet or folks who are asking questions in Google Cloud. Tell us a little bit about how this sort of partnership came to be and a little bit of those principles you laid out in your blog post about how we do this in a way that's respectful to the community, and hopefully not just respectful, but beneficial in that there can be a virtuous circle between the AI companies and the folks who provide the community knowledge they're training on.

RP Absolutely. We'll get into the details on the Google discussion as well, but I'll start off kind of at the beginning of this at what's driving the conversations and the shifts we've seen in the market early on over the last year or so as the LLM and AI models have become much more prevalent. The first thing, of course, that we saw was an impact, an impact on our community as companies launched their initial models. And that's to be expected– people are looking for quick answers to things. They’re looking for a quick solution, and that's something that we, of course, have to take into account as we build our community. How do we make it easier for people to find answers and interact with the answers that they're getting? Another thing that we're seeing is that there's a ton of different models being produced out there. Everybody and their brother is creating an AI model right now, and so what we're finding is that we're at the center of a lot of those discussions. If you want to create a technology/code generation model, our community is incredibly valuable to that. On top of that, we're also seeing that the LLM producers are shifting in their mindset. They're seeing that they have to continue to support the communities that are providing them information, that are helping them train or helping them learn. Let's put it in that context of learning. As a human user of our platform or a community member of our platform is looking to learn, so are these models based off of that overall community data set. What's also interesting there is that the requirement for proof of your work– how do we prove the work, how do we actually look at the answer that's being given and look at the information behind it is also becoming key. Everybody needs to be able to show how they came to that answer, even me. If you ask me a question and you don't fully trust my answer or you don't fully trust me, which you shouldn't– you shouldn't trust any one person you ask– you should be checking my sources, you should be checking where I learned this. What is the understanding that I've come to and then how did I derive that? So it's a simple human behavior. 

BP One of the first things we did when I got to Stack Overflow was we ran a story that some academics had done about the security risks of blindly copying and pasting from Stack Overflow. Not to say you can't learn a lot, not to say that you can't find the solution to your problem, but that doesn't mean that you just copy/paste and move on with it.

RP And so the same problem persists– what do you do when you don't get a full answer, or you don't get an answer you believe, or the answer doesn't really solve your problem? Where do you go? Well, of course, you come back to us. You come back to Stack, you interact with the community. If you can't find the perfect answer there, you ask, and the community is there to help you solve that problem overall. But it's a slower process. It's something that takes a little bit of time, it takes a little bit of interaction, and so you can see that pull to, “I'll go over and just get a quick answer and then if I don't get what I like, I'll go to the community.” Well, we want to figure out ways to, of course, build that interaction back into the community and make sure that people are providing back as well. And I want to make sure that the AI models, the LLMs, are giving back to that community as a part of that process. Another thing that we're seeing is that a lot of these major players are moving into enterprise environments where they're providing foundational models or they're providing full services to enterprises. Well, the enterprises absolutely require them to show their work. They absolutely require a good chain of custody around the data that they've used to generate their models and so forth, so that's a huge influence on the LLM providers as well. 

BP We had a conversation last week about this with someone whose whole job is to do due diligence before an IPO or an M&A, and if you can't speak to the provenance of your code, that's going to be a big thorn in your side trying to get through that. So if you can say, “Okay, the Gen AI assistant, the code gen assistant, helped me get to the solution here, and here's the Stack Overflow resource it relied on,” that gives you a big leg up in that regard.

RP And that's always been a requirement for our community. As part of our licensing, the need for attribution is absolute. Any provider needs to be able to build into their models, whether it be a RAG-style model or some other, the ability to show at least the most relevant information as to why that answer was generated. This is something that we've been working with our partners on and looking to integrate into their software solutions that they're building on top of their models. Google is a great partner in this. We've sat down and worked with them on this. As they launch, I believe their branding is shifting, but the Duet product, we wanted to work with them to not only provide attribution and provide that capability, but also to integrate our community into their system. So when people are looking for answers, they can stay in their Duet or Gemini console, depending on what the name of it's going to be going forward, and they can interact. But when the model doesn't provide the right answer, it needs to be built into that interaction, into their IDE the ability to then go the next step and actually work with the community, interact with the community, ask questions, research, and then be a part of that community from your console. We bring the community right into the IDE for the developer which is one of the key factors in this relationship, making it easy for people to interact with the community overall. 

BP I've been at Stack Overflow for about five years now and one of the sort of running jokes but also has a grain of truth in it was that every developer every day has got a few tabs open and their IDE is open here and their Stack Overflow is open there, and that's just so they can get their work done. And I love the idea now that maybe you've just got your IDE open. You do have a Gen AI assistant in there, but when you need to, you can also call to the community and then it's going to ping you back. And we're hoping that there'll be, like you said before, maybe even a value there where people can start contributing answers back, like, “Oh, I asked this question. Eventually this was provided. Now I approve it, or I tweaked it a little bit once I had it in the IDE, and now I'm sending it back and that can flow back to our public platform where people can find it again.”

RP Even more so they can provide code snippets and things like that that are actually the full answer as to the way it actually worked for them and be able to really interact with the question that they're asking and the answers they're given right there in the IDE. And so what we're trying to do with the beginning of this relationship is bring it directly to them and get them involved at an earlier stage as part of their whole process, and not just let the LLM be the end but be the beginning of the conversation inside of their IDE. Now, we've experimented with this a bit inside of our own site as well. You see inside of our Labs area we've got a version of this going, but as we evolve the future of the Stack Overflow platform, we're really focused on how these capabilities can drive interaction and collaboration inside the community, not replace it. And so that's one of the core areas that we're focusing on in our research right now.

BP One of the things that we made mention of when we did this announcement –we have sort of a landing page for folks who are interested in Overflow API and perhaps training on our dataset– was that we found if some of the well known models like Code Llama, for example, were fine-tuned with Stack Overflow, they sort of overperformed a model that was just code tuned. Why do you think that is? I have my own theories to share, but I want to hear you go first. And what does that say about sort of, I always want to say the structure of our data but that has a different meaning, so I'll say maybe the syntax of our data or the way in which Stack Overflow's crowdsourced knowledge community has always arranged its questions and answers?

RP I would go directly to the source and say that the original purpose of our community is all around curation of knowledge and building that catalog of the world's software development knowledge. And the format that they've chosen and has been chosen for the last 15 years, that question and answer format where we're really looking to drive towards the best possible answer but continue a conversation as part of that with multiple possible answers involved in that, is, bar none, the best way of being able to train. On the human side, for me to actually look at that and understand the conversation flow as it's happened, as it's evolved, as technologies have changed and new answers come in, that whole system that has been highly useful to our community members is also very useful for training models. And so that's something that if we went back in time and tried to design a better system, I'm not sure we would have been able to. 

BP Right. It's interesting to find that there is disruption, obviously, as you said, to our public site with the introduction of these AI assistants and code gen models, but in some ways our knowledge community is perfectly suited to it. It's in that Q&A format which is typically how you're going to be interacting with these things, asking it to help you generate code or to help you find a piece of knowledge within your company's database. And then when an LLM is training, it has no way to assign an accuracy or a recency or a best of score to something, which is what our community can do. And to your point about the conversation that ensues, that mirrors in a lot of ways chain of thought, which is kind of what they're hoping the next set of LLMs will do. Ask yourself a question, but then challenge it and then reflect on it and then critique it and then let yourself maybe vote, literally. They’ll have agents in there voting on which answer they think is best. So I'm kind of excited about that. To think that maybe Stack Overflow is unique in that way and that folks who therefore have Stack Overflow for Teams inside of their company– I went and did the case studies with the Bloombergs and the Microsofts of the world– may have a unique advantage because they have curated their data in a way that is especially well-suited to training an LLM.

RP Absolutely. And when we look at it, priority number one is always going to be how can we make it easier for people to interact inside the community? How can we continue to drive towards the best answers for the questions that are being asked by the people as a part of our community? How do we continue to make that system easier by leveraging AI capabilities, but certainly always having the human being the person who is validating, driving the answers, driving the communication and collaboration. And so we're looking at how we continue to speed up that cycle. Our number one pages are pages like our question asking page. That's where everyone lands. You search through Google, you search through somewhere else, you land on our question and answer page. How do we make it so that when you ask a question, you're asking a valid question? How do we make it so that that question can be taken up by the community quicker? A lot of our basic concepts around what we were calling Staging Ground this last year are certainly becoming much more valid and something that we want to invest in here in the near future as part of that question asking process. In the future, we want to help speed up the process for answering questions as well. This is something that is important but it has to be the humans answering. And so we're looking at how can AI make that easier and how can we help with that process. We’re doing a lot of research on this right now to evolve our strategy, but our community is number one. It's core to what we do, and making it so that people can learn from each other and learn from the community is our number one driving factor. 

BP So it seems clear from the fact that we have this API page that we hope to make this a broader business than just Gemini. Do you want to speak to that for a second? 

RP So as we're thinking about our Overflow API service, really what we're driving towards is creating strategic partnerships with a limited number of LLM providers where they can have access to our communities, along with then creating the commitment between us and these other companies to help support the community, help drive interaction in the community, help feed back to the community and be a part of that community with their efforts. Google is stepping forward as our first primary partner on this and they're stepping up. From a socially responsible AI perspective, they really are focusing on how they can help bring more people into the community, get them engaged, and help them drive more interaction in our community. And on the other side of that, the ability to access our data and train models off of that is valuable to them. So this is the give and take of the relationship that we're creating. It's great to have a great first partner here and we're seeing a lot more companies out there who are interested in coming to the table and working with us to become a part of our community, not just consumed from it, and that's key.

BP One of the things that I wrote about on the blog is that it has to be this virtuous cycle, or one hopes it has to be this way, otherwise, there's a sort of tragedy of the commons. If people stop contributing to Stack Overflow and the questions and answers are just happening in private chats inside of a system, then in the future, the LLMs have less to train on. They have less to verify. They need that public knowledge being created by humans to continue if they want to stay current and continue to improve and be able to answer questions that are about contemporary code. Things change so fast with different languages and frameworks and technologies. If a training cycle takes 6 to 10 months, it's going to be fairly out of date by the time it gets published, so it's important that there's a give and take there. 

RP Absolutely. 

BP So let's move over for a second and chat a little bit about OverflowAI. We announced a few things on the roadmap. There's a few things that people can check out in the Labs page. What are you excited about there and what can folks look forward to this coming year? 

RP I think that we've done a lot of experimentation. We've been looking to solve or at least get answers around the concept of if we can provide answers to questions faster. So inside of the Labs area, we've created our own conversational chatbot, which we've gotten great feedback from our community members who have tried it out. We've also created a summarization capability there where we're summarizing the answers when you're searching inside of our site. Both of those have been pretty successful, but with success comes caution. When we talk about our own conversational chatbot, what we found is that that actually lowers people's desire or interest in interacting with the community and it's something that we don't want to step in the way of. We don't want to put a capability into our site and our community that pulls people away from interaction and so we're taking a little bit of a step back on that and trying to determine how we can leverage these tools in non-typical ways. A chatbot is interesting, but everybody's got a chatbot. But how do we answer questions or how do we make it easier for people to ask questions of our site? How do we make it easier for them to interact? And so the evolution of what we've been doing in Labs is really going to be driving more towards interactive capabilities in the site where we’re helping with the question asking process, we're helping with the moderation process. There's certainly something that we need to invest in there around how we manage our site and take a little bit of the load off of our moderators who do an amazing job and commit a lot of time to this. How do we make their lives easier? And then how do we make it so that when I ask a question, that conversation starts faster? How do we make it so we give tools to the people answering questions so that they can answer questions easier, they can get into the flow of answering those things, and we can get the dialogue going quicker so that people don't have to wait as long to get an answer? If I ask a question and I have to wait days for an answer, most likely I found another solution. And so how do we integrate that into the community in a way that is facilitating the conversation and speeding that up? So that's where a lot of our thinking is going here as we evolve the AI capabilities that we've been implementing into the community. So I think you're going to see us take a step back a little bit from the conversational chatbot-style capabilities. We will be investing in search and summarization on our site to make it easier to find solutions, and then we're going to be investing kind of across the board with a bunch of different capabilities around question asking first, moderation definitely is an area that we want to focus on, personalization, how do you create the experience for you on the site and how do AI tools make it easier for you to find questions to answer, find questions and answers that you're interested in, subjects that you're interested in learning about? So there's a lot of capabilities we can do there. And then eventually we get to the point of how we speed up the answer process. Certainly make it so that humans can answer questions faster with more tools available to them as part of that process. 

BP Right. It'll be interesting to see what happens. I know there's been some papers recently, you're writing code and then immediately after it writes the unit test for you and you can just check your work and then you can flow from there. So maybe that would help people get good answers up quicker that can be verified and accepted. 

RP How do we make your life easier is definitely a part of that. 

BP And I've enjoyed, like you said, the search part of it, being able to go into Slack and ask a question about something related to some obscure HR document and I don't know where it is and then I get a nice summary answer with some links to what I needed. That's been a great little feature. 

RP On the Teams side of the product, our Stack Overflow for Teams, our enterprise product, a lot of these capabilities are going to be built in. Summarization is something that we're launching here as part of our OverflowAI initiative along with Slack integration, IDE integrations, and a myriad of other partnerships that are being driven by our data conversations with our major partners. So I'm really excited to see on the Teams side of the product our ability to really integrate with people's internal AI and LLM environments along with really helping people get to answers quickly inside of our corporate environments where it's much simpler to understand, “I want this answer, but I want to be able to attribute why we came to this point, and I want to move quickly into interacting with my colleagues inside of our smaller community,” that we would consider the Teams product to be. 

BP Nice.

[music plays]

BP All right, everybody. I want to say thanks so much for listening. As we do at the end of every show, let's shout out a Stack Overflow user who came on and helped to share a little knowledge or spread their curiosity. A Great Question Badge was awarded eight hours ago to verygoodsoftwarenotvirus. “How can I get all keys from a JSON column in Postgres?” If you ever wondered, there's a question and an answer for you, and it's helped over 86,000 people. So verygoodsoftwarenotvirus, we appreciate your curiosity and congrats on your badge. As always, I am Ben Popper. I am the Director of Content here at Stack Overflow. Find me on X @BenPopper. If you have questions or suggestions for the pod or you want to come on and talk about something, shoot us an email: podcast@stackoverflow.com. And if you enjoy the program, leave us a rating and a review.

RP I'm Ryan, new Chief Product Officer for Stack Overflow. Excited to interact with the community and be a part of this amazing community overall. The best way to connect with me, of course, is through our site, but I would also say to find me on LinkedIn. More than happy to connect with people and answer questions directly. Love to get the feedback and love to work with people on where we take our community as we grow. 

BP Sweet. All right, we'll put those links in the show notes. Thanks for listening, everybody, and we will talk to you soon.

[outro music plays]