The Stack Overflow Podcast

Pair Programming? We peek under the hood of Duet, Google’s coding assistant.

Episode Summary

On today’s episode, we chat with Marcos Grappeggia, the product manager for Duet, an AI-powered assistant that can help you craft code by suggesting snippets—even full functions—as you write. Grappeggia explains why he thinks tools like this will augment, but not replace, the human developers at work today.

Episode Notes

Interested in trying Duet? You can get on the waitlist here.

You can learn more about tuning and deploying your own version of Google’s foundation models in their Generative AI studio.

If tuning your own model sounds overwhelming, you can head to Model Garden, where a wide selection of open-source and third-party models are available to try.

Marcos is on LinkedIn.

Episode Transcription

[intro music plays]

Ben Popper Logitech just announced the new MX Keys S Keyboard, with a superior low-profile typing experience, enhanced smart illumination, and 17 programmable hot keys. The new Smart Actions in the Logi Options+ app gives you the power to skip repetitive actions by automating multiple tasks with a single keystroke. It's like macros with a little magic. Go to logitech.com to find out more.

BP Hello, everybody. Welcome back to the Stack Overflow Podcast, a place to talk all things software and technology. I am here again with my co-host, Forrest Brazeal. We are working on a series of podcasts where we hook up with some of the folks who are building AI tooling at Google working inside of Google Cloud, and more specifically talking about how this impacts the world of software development and the way people who are listening to this might do some of their work. So, Forrest, before we get into the chit chat with our guest today, there's a little bit of news I know you want to cover, so hit us up.

Forrest Brazeal Yeah, thanks as always, Ben. I think the biggest thing probably, because this is the first time we've been able to really dig in deep to what was announced at IO since IO happened in May, and I think the biggest thing is going to be the release of this thing called Duet. We're going to talk about this much more with Marcos when he comes on because he's actually the product manager for that technology. But you can think of this as code generation, code completion embedded right in your IDE wherever you work, whether that's VS Code or whether it’s if you're working in Cloud Workstations, a hosted IDE like that, you'll have both that code suggestion ability and also almost a little bit of a chat helper that will come in and talk to you about this code. And what I've realized getting into this, Ben, is the model that underlies Duet which is called Codey, is one of a family of models that live inside this thing that was released. It kind of put into GA at IO, or public preview called Model Garden. And Model Garden is kind of the place where you go for a bunch of Google Foundation models. They've got some open source models in there, too, and I think eventually the plan is to pull third party models in as well. If you go there today, you'll see several dozen of these models that exist, and they're what they call multimodal models. Multimodal models is my new favorite buzzword. So it's not just text-based models, but you've got code generation obviously, but also models that do images and models that do speech, and you'll see more and more of those pop up over time. The other thing that's been announced that I think is cool to dig into is something called Generative AI Studio. So there's Model Garden which is where you go if you want to just kind of pick the models off the shelf, and you can actually get direct API access to them so you can put them directly into your applications from there, and that's kind of your store bought, ready to wear AI models come out of Model Garden. But if you want to tweak the models and do different things to them, then you take them over into Generative AI Studio. This is where you get to kind of tweak and experiment. And if you want to go even farther than that, you'll have the ability to actually break out into Vertex AI, which is the full giant machine that can fine tune all sorts of AI models for you. And again, we'll talk more about this with Marcos, but at that point you are actually retraining the model partially. You're not just injecting more context into it. You're not just bringing in your own examples for it to train on, but you're actually changing the weights of the model itself. And once you're doing that, it's very important that you have confidence that whatever tweaks you make there are sandbox proprietary to you. They're not making their way out into the wider ecosystem or trained for others, and Generative AI Studio does a good job of locking that down.

BP Leaking the weights is the new leaving with a thumbdrive in your pocket.

FB Yes, that's the new pushing your public key to GitHub.

BP Yeah, exactly. I think there's a lot packed in there. I'm excited to talk to Marcos. A little bit of news, few things over at Stack Overflow. We just put up our annual developer survey. Over 90,000 people gave us their responses, so shout out to the community. We have an entire section focused on how developers are adopting or not adopting AI tools and whether they trust the code it writes for them or not, so some really interesting data in there. We've got a deep dive analysis, and then we launched a new labs page where we're showing off some of the experimentation we're doing, whether that's on the public platform or in Teams, using this new AI technology that has been making waves to improve the experience for our public users, whether that's optimizing question titles or for our Teams users, helping them to ingest content, understand what questions are being asked repeatedly, where they're wasting time, and how they can get those into a central knowledge base to fix all of that.

FB Nice.

BP All right, enough with the news. Let's get to the meat and potatoes of the show. We want to meet somebody interesting who's working in the field, talk to them a little bit about how this stuff is actually getting built. So Forrest, cue up our guest here.

FB Of course, thanks, Ben. So we're very excited to welcome to the show today Marcos Grappeggia, who is a Product Manager at Google Cloud working on something called Duet for Google Cloud. And Marcos, I'm hoping you can tell us a little bit more about what that is because it just launched. Not a lot of people have had their hands-on with it yet, but we're very interested in what this can do.

Marcos Grappeggia Yes. So, hi everybody, my name is Marcos. I'm a Product Manager within Google Cloud's Developer Experience Team, specifically focusing on Duet AI as well as Cloud Workstations and solutions providing AI assistance for developers as a cloud-based IDE and developer environment. The key idea for Duet is essentially providing developer assistance or the developer inner loop directly in the IDE. So let's say you’re writing code either on VS Code, JetBrains, IDEs such as IntelliJ, PyCharm, Rider, so on and so forth, the IDE is providing you with code completion and code generation directly on the IDE. Let's say you’re writing code, let's say you’re writing a function, essentially completing the current line of code you have, or even providing full form function generation or borrow play generation for tests such as writing a new function, writing a test, or trying a new API, as well as providing you with a companion chatbot directly on your IDE, which has the context of which code I have open, which will allow you to ask questions in the context of your currently open code. First a good point– we just announced that in Google IO a few weeks ago. This is currently in a private preview, so I can give you folks some instructions after about how to get access, but essentially you can go to go.gle/ai-assisteddev to ask to join our waitlist.

FB We'll get that into the show notes, I'm sure. So I've got to ask, Marcos, because we've seen a lot of these LLM code generation, code completion tools that are popping up. And Google has released another one just recently which is Bard. They've added coding support for that as of, I think, a couple of months ago. So can you help us situate, I'm a developer, I'm trying to generate some code. When and where would I want to use something like Duet as opposed to going and having Bard just ream out a giant file for me?

MG So both Duet as well as Bard use the same underlying foundational model, so essentially coding. So coding our model here, which, what it is essentially is a large language model, which is trained on a large corpus of data, but also is trained on essentially code samples. So I think that's where they're similar. Where it differs, essentially, I can see those as being used in different development modes. So maybe you'll say you're in the flow, writing your code, trying to be productive. It's a good fit essentially for Duet, essentially helping you and say, “Hey, I'm writing my code. I perform my day to day development.” But sometimes you may just want to take a step back, pause and say, “Hey, let me do a Google search, for example.” Now maybe you want to get a scratch pad, someone to ideate back and forth. That's essentially a good fit for Bard. You can go there, do a regular purpose ideation, “Let me see. Okay, what are some libraries I could use for this?” That's a mode which could be a good fit for Bard.

BP That makes a lot of sense. And so Bard is a little bit more general purpose and Duet is a little bit more honed in on when you're actually writing code, it can help you. Talk to us about some of the engineering challenges. When you're fine tuning an LLM to write code, are there things that people might not be aware of that are specific to that versus an LLM which is trained on legal data or an LLM that's just very general purpose, all the text on the internet.

MG That's a great question. I mean, there are several. Let me focus on two specific ones which come to mind. The first one is hallucinations, essentially how to tame your model, how to ground it. That's the first one. And the second one is essentially toxicity and security of the output. So for the first one, I think the first thing which is worth keeping in mind is whenever you are building a tool which is focused on productivity and developers, it's critical for you to provide an input which is accurate more often and than not. Not only that, it also avoids introducing potentially sneaky bugs within code, so a big area which took us significant time and I can say it's kind of a journey. I mean, it's making some good progress there. Essentially, hey, how can I get a few sets of journeys and make sure it's performing well. I think a key error for us was creating, for example, benchmarks. So for example we say, “Hey, let's pick a bunch of Google Cloud code samples and see how well this model performs,” and say, “What if we knock out one line and can the model bring back the same line?” So one thing we saw was just by running those basic benchmarks, it was kind of a good way for us to see gaps within the code. Essentially having a set of benchmarks which was scalable and reproducible was a critical part to essentially create a pattern where you can essentially avoid regressions. I think it's one of the key things which was whenever you do benchmarking manually, it gets hard for you to catch regressions over new builds. That was one key thing here, and the second one was toxicity security checks. Even though you say yes, I mean code is less likely to be toxic than a general purpose chat, we did still find cases where there really were some offending issues here. And the big challenge here was how to have a model which does toxicity check-in but doesn't impact your latency too much. That's one thing that took us some time to get. Just essentially function a small model just good, essentially flagging some of those patterns. And the solution I had here was essentially a compromise between, we do some pre-processing steps, that's one thing, but also do some post-process steps after accepted stuff. That's kind of the compromise we found with this too, essentially an 80/20 balanced approach here.

FB If I'm a developer then using Duet, Marcos, and I see a hallucination pop up– I'm assuming there are still cases where that might happen at this point– so what does that look like? What should I be aware of? And what should my development process be to make sure that I'm not using a library call that doesn't exist or something like that?

MG No, it's a fantastic question, and yes, they're getting less and less frequent, but there still is a chance that they happen. I think one of the key principles worth keeping in mind is large language models are very good with enhancing developers, allowing you to do more and to move faster, and also very good with helping developers do tests, and potentially be slightly above beyond their comfort zone or their expertise area essentially and say, “Okay, let me explore new libraries.” But they're not a great replacement for day-to-day developers. So if you just try to say, “Hey, here's a comment. Write my function for me,” if you don't understand your code, that's still a recipe for failure. And the model is still going to help you explain the code for you, get the high level, but essentially it doesn't replace developers fully understanding the code. I think it's a pattern here. Yes, it gives you some border plates, gives you some samples, but where it will create a risk is when you get to a state where people are submitting poll requests which they may not really completely understand. It's a part which I essentially want to still keep a culture of code reviews and AI tools as an extension of developers, not as a replacement that you blindly follow and then create some potentially buggy use cases.

FB Although I know that Google is starting to experiment with AI pull request comments and AI pull request reviews. I just saw a blog about that.

MG We are, yes. And actually there are two interesting patterns here. So one thing is Google has this internal culture. I think Google is famous for having a monorepo. All the code is readable by everybody, but as I said, readable. There is also a whole process of readability reviews, meaning before submitting a pull request, someone who knows Java or who knows Python reviews the code to ensure it's easy for anyone else to read that. Let's have some basic advanced linters for verifying those. But one thing which was very interesting was in many cases when you go through a code review process, your reviewer may say, “Hey, please fix this or please refactor for readability.” And what if you could provide a fix based on the comment for the reviewer that's something which you saw as being promising in terms of reducing the time for a fix getting done by the people who submitted the pull request. It doesn't replace, but it helps you give a starting point for you to think from.

BP Yeah, I think that's one of the most fascinating possibilities here, like you said, these little automated agents that are always spiting around the code base acting as linters or doing some of the tidying up work that other people have asked for, even suggesting sometimes, “Hey, this is a place you might want to tidy up.” So the code that Duet was trained on, where are you getting that from? And I remember when they announced Bard and we were talking with Paige, she had said, occasionally, if it's from an open source GitHub repo, they'll even link to it. They'll try to do some attribution at the bottom. How do you approach that kind of attribution and licensing for the code that's used in Duet?

MG Let me talk first about the training data sets, the pre-processing, the preparation steps for the model, then about post-processing to give you a complete picture. So for the training datasets, similar to how Paige mentioned, using similar training datasets for both models. It's essentially a crawl of publicly available Git repos. Essentially, we're talking about a few billion lines of code of training. You need to have a very large dataset for the model to be able to generalize and find enough patterns for it to be useful. What we do here is there is this public crawl, then we do two passes of processing of the data. So one is that we filter out non-permissively licensed code, let's say LGPL, AGPL, or commercially-licensed code. That's the first thing. And the second is removing PII or sensitive data from that person’s identifying information, or sensitive data from this training dataset. It’s essentially from the training dataset that the model comes. One thing which is always worth keeping in mind is that there are many, many ways people provide license information on their code. So there could still be cases where this process may catch a percentage of your issues, but not everything. That's where you have a second step, which is a post-processing validation step, which is after you have accepted the suggestion, we run a rest station checker which checks a string of up to 120 characters almost verbatim. Of course, mixing of double spaces or so on and so forth. We match close to verbatim to the training data set as possible. When it finds a match, it shows you, “Hey, this code matches our training dataset. Here is the repo. Have a link to the source repo. And here's the license.” So this helps both with even if you get permissively code suggestions, using permissively licensed code, it helps you provide attribution. You're going to be able to tell back, “Hey, that's where I got this code from.” And in some cases you may get, “Hey, this code has an unknown license, so please proceed carefully.” So that leaves you to make a call directly on your side.

BP Always copy and paste carefully. That's what we say at Stack Overflow. It's interesting, we've been thinking about a lot of the same things and how we might approach using gen AI tools, and I think one of the things that has been said over and over is that because we work with a community of folks who are known who are earning reputation and they're the ones who kind of contributed all the knowledge, we want to build a system where if that stuff goes into a model, in the end you can still reward the people who provided that information and they can still be surfaced. It doesn't just come out of sort of a black box. So I appreciate hearing that you had a thoughtful approach to it.

FB Yeah, that's fantastic. I just wanted to call out this Codey foundation model that you're referencing is not only a model that is used in Bard and used in Duet, but I believe through Model Garden and Generative AI Studio in Google Cloud you're also able to build your own code LLMs on top of that same foundation model, right, Marcos?

MG That's correct. It's a great question. It's one of the things which our team is working towards, which is allowing you to essentially fine tune those LLMs to your codebase– you as an organization or as the customer. Let's say you may have your own proprietary or internal frameworks or best practices. The idea here is doing, actually there are more main techniques here, but the two worth highlighting here is, one is supervised fine tuning, which essentially provides a list of question answers, say “Hey, when getting suggestions this way, that's a recommended way of answering.” That's one fine tuning. And the second is unsupervised fine tuning, meaning, “Just feed the model your codebase and have it learn from your codebase,” an unsupervised kind of learn here.

BP Right, all your developers’ good habits and their bad habits.

MG Exactly, yes. You don't want to do some round off cleaning in a codebase, maybe say authors or if we want to do a pass of, as people say, garbage in, garbage out. But I think there's also a second challenge here is that we each have a lot of code, I think we're really talking about millions of lines of code. So the reason why supervised learning tends to be a little easier solution is because it's easier for you to create a set of 1,000, 10,000 Q and A pairs versus you kind of have to have, let's say, millions of lines of code, something which may need to be some art in case you don't have a large enough dataset to start with.

BP You mentioned how this underlying model is used in both Bard and Duet, but how does that relate to PaLM? Is Bard using two underlying models? Can you just walk me through that?

MG That's a really good question. And PaLM more specifically, PaLM 2 is the foundational model which underlies coding, Imagine PaLM 2 as the underlying model trained on broad text. That's essentially it. It is a model which knows how to chat with you and converse with you. On top of that, there's an extra round of unsupervised training happening on PaLM 2 to teach PaLM 2 how to code. That's essentially coding. I think a good analogy here is there is Sec-PaLM, there's a version of PaLM which is fine-tuned for security. Coding is a version of PaLM which is fine-tuned for helping with code, both conversing about code as well as doing code completion. There's a more nuanced thing, which is just getting a bit into the weeds here, which is there's also a round of middle training due to ensure it’s not only generating code, but also doing middle code completion, which is something which is pretty handy for developers but I won't get too much into the weeds for this one specifically.

FB The way I think of it is a palm has a lot of fingers, so you got med-PaLM and Sec-PaLM and Codey and all these things connected to it.

BP That would be some strong branding. You should suggest that for the next IO.

FB We'll see about that.

BP So you just mentioned folks maybe putting this on top of their own repo or inside of their company's codebase. One of the things that I'm really fascinated by is LLMs ability to produce more accurate and useful output if you do a multi-step prompt, if you do chain of reasoning, or as you say, if you fine-tune them on the dataset that you're going to talk about so it's not just pulling from this vast generalization, but it knows to go to a tighter source material. So for folks who would do that, I don't know if you've rolled it out yet, it's still kind of, like you said, on a waiting list, but how do they know that their proprietary code is going to be safe? To put all that into a training model seems like it might make some companies hesitant.

MG I just want to reinforce some of the Google Cloud AI principles, which is, from our Google Cloud customers, we don't essentially use customer data such as their prompts or the responses they got back as input for improving our foundational models or our product, and that’s something to take very seriously, because the moment you use that as an input, there's a non-negligible risk of potential exfiltration, say, someone being able to attack via prompt, finding ways. So we don't use this data by default to train our foundation models. If you go the route of doing fine-tuning, what happens here is they're creating a separate model, which is essentially a customer-owned model. So Google doesn't use that model or the product directly, it’s something which is specific to your organization with all the tip controls, such as virtual private clouds, managing encryption keys, other controls to ensure that it's the customer model. The idea here is taking a foundational model, creating a separate checkpoint, which is trained on your codebase and it's behind your IAM, identity access management policies and roles. Essentially, it becomes your model as a customer. Of course, one thing we need to do at some point is create some kind of pipeline. Say, “Hey, every once in a while we build against the foundational model to catch all the updates.” Essentially it's your model, similar to how you use Vertex AI today, for example, for convolution on your networks, ResNet is the same thing. It's your model which only you have access to, of course with the proper transparency and control mechanisms for users.

FB Marcos, this has been so wonderful and I want to kind of finish up by asking, as you start to see more developers using tools like Duet and integrating generative AI directly into their development workflows, what are a few best practices you can recommend, or maybe some things you see developers doing that you would advise against to get the most out of these tools?

MG That's a great question. I think one thing that kind of gets a bit to the basics and kind of the best practices of development in general, which is you always want to have a code review process, senior members helping junior members get up to speed as well as sharing practices. I think the developer outer loop doesn't fundamentally change. You want to have some round of reviews here. I think the best practice you want to recommend here is to ensure that developers are really understanding what they're getting back from the models. That's one thing. That's something which even the models help direct you to prompt the model to help you understand your codebase. That's one thing. And I think the second thing, which is very interesting, is try to quantify the impact. There's even some white paper Google wrote on that topic, this whole idea of, “Hey, how can I use acceptance rates or how can I use percentage of code generated as a way to measure how much are developers really getting their savings out of the productivity of this model versus the impression of productivity,” which is something which may be nice. I think it's one thing which we saw as similar to how AI is very much an interactive process within image processing, image labeling, or general purpose chatbots. We see this taking a similar direction, which is yes, you want to have a way of quantifying the impact and do experiments to trick and say, “Hey, maybe if I do more tuning, less tuning,” that's very much experimental and I see as being something which we kind of recommend as having seen ML research experimentation approach been done for development tools as well, which is very much in its infancy, but I do see an error which can say quickly evolve, which is iterating and doing experiments to optimize those models to the developers over time.

[music plays]

BP All right, everybody. Thank you so much for listening. We're going to do something a little different today. Usually we shout out the winner of a Lifeboat Badge, which is somebody who came on Stack Overflow and contributed a little bit of knowledge, but today we are going to shout out someone who came on Stack Overflow and contributed a little bit of knowledge specifically to the Google Cloud Collective. So if you're not aware, there are a number of companies that are working with Stack Overflow to create kind of a community within the larger community where folks who are experts in all of this stuff can go and people can be recognized who are contributing even though they may not work at Google. So in the last seven days, our top contributor is Samuel. Thank you, Samuel. 570 points, lots of great stuff here about Google, BigQuery and Google Data Studio. And I'll shout out one more just in case we need it or for whatever it's worth. Alex Mamo, thank you for contributing, Google developer expert for Firebase. So appreciate you coming onto the collective and sharing some of that knowledge with other folks. And if you're interested in learning more about the collective, I encourage you to head on over. We'll share the link in the show notes and then just to get this out here because I know we mentioned a lot of the stuff in the conversation you just heard. Forrest, if people are interested in learning more about the stuff that we discussed with Marcos or they want to join that trusted tester program, where should they go?

FB So if you go to cloud.google.com/ai that should take you to a place where you can sign up for this trusted tester program. That's a waitlist that'll get you early access to any of these tools that are not out yet like some of the new Duet for Google Cloud features. You can play with some of this stuff now, particularly the Generative AI Studio things I mentioned at the top of the show, and you can just look for Generative AI Studio in the Google Cloud console and you should right away be able to get into Model Garden and start experimenting with those models.

BP All right, if you're looking to Duet, you know where to go. Thanks for listening and we will talk to you soon.

[outro music plays]