During the holidays, we’re releasing some highlights from a year full of conversations with developers and technologists. Enjoy! We’ll see you in 2025.
In this episode: Whether AI coding tools are making your code worse, how AI can improve pull requests, building software through prompt engineering, using AI to write cleaner code, and what we can expect from this technology in 2025 and beyond.
Listen to the full versions:
[intro music plays]
Ben Popper Hello, everybody. Welcome back to the Stack Overflow Podcast. If you are in the United States or elsewhere where a big holiday is happening, I hope you're enjoying it. I am Ben Popper, Director of Content here at Stack Overflow. I am off on my Christmas and New Year's vacation, so we have recorded a couple of episodes for you that take some of the best bits of the episodes we've recorded all throughout 2024 and stringed them together kind of thematically. So today, we're going to be sharing an episode about code generation. Is it making your code worse? Is it making it better? Will it replace junior engineers? How will it support senior engineers? All of that good stuff. The first clip you're going to hear is an interview with Tariq Shaukat. He is the Co-CEO over at Sonar. You may know SonarQube which was originally an open source project, and they've got a lot of other clean code solutions helping big organizations in the market. So without further ado, enjoy the interview with Tariq from Sonar.
[music plays]
BP It's interesting to think about what different companies demand and how they would approach this. I was speaking recently with someone who was formerly at SpaceX and Starlink, and those companies obviously achieved some incredible results, but he also mentioned that the mentality that it's okay to fail comes from this idea that we're going to launch some rockets and some of them are going to crash and that's going to teach us something, and that that even trickled all the way down to the software, that the push from above was to work as hard as you can, be passionate about this, if you make mistakes and there's a bug or the site goes down, that's okay. That's kind of the ethos within the company. We'd rather be creating new things and learning from our mistakes. So Sonar has its own proprietary or internal AI system that you developed to do code checking, or are you calling on some of the frontier models for this work?
Tariq Shaukat Actually before I answer that, to your last point, what we see actually generate a lot of value for the developers is what we've called inside of Sonar ‘learn as you code.’ The idea of a black box system that's just telling you to fix these issues but we're not going to explain to you why and we're not going to help you fix it, is much, much less interesting than the idea of, “Hey, there's a mistake or there's an issue here. Here's what we think is causing the issue. Here's how you can avoid the issue in the future,” and that's actually one of the most sought after features, both by individual developers and by companies who are struggling with this idea of how you take your junior engineers and make them senior engineers over time.
BP Right, the AI as a mentor pair programmer trying to level you up.
TS Exactly. So then to your question, as you mentioned earlier, 15 years ago there was AI, but it was a different type of AI. It wasn't generative AI, it was the more statistical reinforcement learning, et cetera, and even that 15 years ago was pretty rudimentary. The core of our system is really a deterministic to oversimplify rules based system that has 5,000 or so different scenarios, depending on the language and what have you that we look at. And so it's a very thorough algorithmic review of your code base to identify issues. What we are seeing now with all the advances in AI is that we can do a couple of things. One of them is that there are some problems that lend themselves to super deterministic systems and there's others where there's gray areas and actually the generative AI type of approach, the more reasoning type of approach, is better at solving those issues. So it's actually expanding the universe of problems that we can look at. We're still using the rules-based system for a large number of problems, because we actually think it works really well. And not everything is going to become a generative AI problem, and we're supplementing it with new approaches that will help us cover issues you couldn't identify before. So that's one piece, but then the other part that is really important is that in the past, we've been able to identify an issue for you, tell you that we think it's this type of issue, so here's an explanation of why we think this is an issue, here's the rule it triggered, here's an explanation of it, and give you some learning. Now what we can start to do is connect remediation to identification of the issue. And this is, I think, one of the more exciting use cases that I've seen around generative AI. Of course, writing new code is always going to be exciting and interesting and there's people doing great work there, but Stripe put out a study a couple of years ago that said that something like 40-50% of a developer's time is spent on toil work– doing debugging, refactoring, documentation, et cetera. And all of these issues are areas that we think Gen AI can really be helpful in and so for our purpose, we've got all this context about our analysis of your code and what the issue is, and that then leads into how we help create a fix that we can suggest to the developer. Again, we don't really believe in black boxes, so this is suggested to the developer and let them decide is that the right fix or not the right fix? And so we call this AI Code Fix to be super linear from a naming standpoint, and there we rely on external foundation models, OpenAI, Claude 3.5, that sort of thing. And a lot of our focus is on how do you reduce that toil and really let developers focus. There's a lot of people talking about, “Oh, they can focus on architecture, et cetera.” The developers I talk to actually want to code in addition to doing that. But to your point, really thinking about how you free up time so they can do their best work and get the most satisfaction and impact is really important. The other part that we hear time and again with companies implementing generative AI is that you need what we're calling a ‘trust but verify’ approach. You can trust the OpenAI models, you can trust Claude, you can trust any of these systems, but if you're working in NASA or you're working in the banking system or you're working at a retailer and you're working on the checkout system, you need assurance that the code you're writing is actually good code. And so we're kind of used in two modes inside of companies. One of them is helping developers catch and fix issues as early as possible so that they don't have as much rework that needs to get done later on, but then secondarily, it is assurance for the companies that this code is being written in the way that we like. Some of these large banks are basically software companies. They've got tens of thousands of developers and they need to be able to say to their regulators and to their boards and to their CIOs and whatever that yes, we have the quality controls and assurance in place. This is something that with Gen AI we are hearing is becoming a problem. Hallucinations are, in many ways, a feature of the Gen AI systems and not a bug, meaning that the way the math works is that they will generate some issues that are incorrect, some code that is incorrect. And the real question is how do you couple the benefit of that with the sort of systems that actually check it and can help you find and fix problems as they occur.
BP That's one of my favorite Andrej Karpathy essays that these are dream machines. I don't know why you think you should use these as a search engine or to generate completely secure code. They were created to make things up. That's what they were designed to do and now we're trying to shoehorn them into these new use cases. But to your point, you can use them in a workflow with different kinds of agents or more deterministic AI, and in that way you can maybe tap their potential while also reducing the downsides. So just as we wrap up here, what are you looking forward to over the next 6 to 12 months? Are there things coming on board in terms of capabilities or is this just about expanding the business? What are you thinking about and planning to work on in the year to come?
TS It's an amazing time. I think you said this earlier, it's an amazing time to be in the software development world because there's more change happening now. I wake up every morning and look at all the startups that have been funded and it's sort of mind-blowing. And for us, we have these three parts of our business. How do you identify issues, and we're applying AI to that, but also continuing the work we do on the deterministic side. That's one core area. I'm very, very excited about this idea of how you remediate issues that are found faster and better. And what's going on in the AI agent world I think is going to be really revolutionary, perhaps not on writing new code or building a new app. It might be, I'm actually not super deep there, but when we find an issue, how do we actually go through a process through a combination of generative AI, reinforcement learning, et cetera, these reasoning chains, to help you fix these issues or at least propose fixes for developers. And I see a lot of exciting work happening in that area. And then the third piece is the most boring part to talk about, but how do you actually make Gen AI coding ready for primetime? One CTO of a large bank told me that they're having an outage a week, that they are root causing back to a generative AI model. And that's not a Gen AI problem, it's a failure of systems and processes problem. I don't know any developer who grew up wanting to be a copy editor for AI-written code. And so they need systems and tools to help them get the full potential. So we're investing really in all three of those areas and I think six months from now it's going to look very different than it does today.
[music plays]
BP All right, everybody. I hope you enjoyed the conversation with Tariq. Our next guest is Bill Harding from GitClear. They have done a ton of research looking at hundreds of millions of lines of code and trying to come up with some analysis for what the impact of AI code gen and other AI tools is on the quality of the code. They are not as bullish as others. They find that AI-generated or enhanced code can often be of worse quality, at least according to the metrics they're measuring by. So this interview is a really interesting pairing with the first one and with the one next, kind of giving that glass half-empty view, at least to some degree, not to say that Bill Harding and GitClear don't have ideas about how AI can be used to make software developers more effective. So without further ado, please enjoy our interview with Bill Harding, the CEO of GitClear.
[music plays]
BP What is the business model for your company, GitClear, and what was sort of the research pool that you were able to access here? Who's working on this code and if there was sort of the control group that you were measuring them against?
Bill Harding So the mission of GitClear is to help developers write better code and to work with less tech debt on a day to day basis. And we especially want to make it easier for developers to review code because pretty much all the developers I talk to would much rather be writing code than reviewing code. And so we've built a code interpretation engine that took upwards of three years for us to initially architect. I think that if we were a standard VC-funded company, we would have probably not been allowed to spend three years just writing a code interpretation engine, but since this was something that I was just personally fascinated by as a developer myself, I wanted for us to be able to recognize code in the same way that developers can, and so not just looking at diffs as a bunch of deletions and additions, but looking at diffs as a combination of deletions, additions, updated code, moved code– moved being when you cut a method and you paste it somewhere else– copy/pasted code, which is what we looked at for this study, and find/replace code. And so having all of that information available to us opened the door to be able to look at really large-scale changes in how the prevalence of copy/pasted code has changed over the last few years, and that is where we were able to see the increase in copy/pasted code. GitClear uses this information to allow developers– well, uses it in a lot of ways– but the main way is that we allow developers to see diff of their work either on an individual commit, an ad hoc group of commits, or a pull request where you don't have to review as much code if you can minimize your attention that is getting devoted towards looking at the code that was merely moved from one file to another or from one part of a file to another. When you can have that level of granularity in interpreting the changes that are happening, it opens the door for a lot of time that can be saved reviewing code, and so that's why we started measuring it in the first place, but then it had the happy ability to allow us to look at the changes that were happening both across open source projects where we currently make it possible for people to visit what we call our ‘open repos’ section of our site where there's about 50 projects– React, React Native, TensorFlow, VS Code. All these large scale open source projects, we allow people to browse GitClear’s data for them and so we can also analyze those alongside our customers’ repos for the customers that have opted into anonymized data sharing. And between those two sources, we had I believe about 153 million changed lines of code that we analyzed over the four-year period between 2020 and 2023. So that was sort of what we do and how we used that.
Ryan Donovan And another one I thought was really interesting was the drop and moved code and viewing that as a drop in refactoring. Can you talk a little bit about the thinking behind that conclusion and what it sort of means for AI code?
BH Yeah, absolutely. I think that's one of the more interesting and perhaps underappreciated aspects of what we saw. Historically, moved code is a huge percentage of the overall change that an average developer will make in the course of their daily work. We found that in 2020, moved code was, I want to say around 30%, 25%, so that is right on par with in 2020 that was more than we detected as deleted code, more than updated, more than copy/pasted. The only thing that happened more frequently than moving code was adding code. And so something that I believe is really integral to the average developer’s commit is that you're trying to rearrange code in a way that allows you to reuse similar methods as much as possible, and reusing similar methods typically means moving a method that had started within the module or the class for some specific feature and then extracting that to a utility file or a utility library. And so it's a signature of human developers that they will often be finding opportunities to reuse code which implies moving code. And since there is no analog in how AI assistants currently work, they don't have a way to suggest removing code, only adding. It I think is reflected in the data that we see where in 2023 that 25% of all changes from 2020 that was moved code has now shrunk to only 17% of all changes, and that is a pretty significant change relative to where we started. And it definitely tracks with the experience that I have as a developer using Copilot that when there's just a single tab press that can get me the answer to whatever I'm doing, I'm going to be less likely to go looking for an existing method that I might be able to repurpose, and that seems to be what's happening at a larger scale in the last 18 months. There's less opportunities that people are undertaking to take an existing method, move it somewhere else, adapt it, and then use it across those multiple locations. I think that is what all of the data that has been produced to date suggests. There's this story that AI is suggesting usually good code, usually valid code, and functional code, and by virtue of accessing that valid functional code more quickly than it would take if you had to go through a directory that had a bunch of different potentially reusable methods, of course it's going to be faster to just recreate the method in your file. And especially if you're a new developer or a developer that is new to the project, you might not even be aware that there is an existing method that can make whatever transformation you're looking to make within the existing architecture of the project. And so if you can save the time having to go look up that method, then yes, you're going to be 55% more productive in the short term, but it's really a question of what does that imply a year or two or three down the line when for years there is a low percentage of code getting moved, thus, the implication that here's a low percentage of similar methods being consolidated and dried up so that they become reusable. Of course, the other benefit to reusing methods is that you're going to have better test coverage around them. The more time that you use your print currency method or whatever method it is that multiple modules need to access, the more avenues through which you are testing the degenerate cases for that. And so when AI is making suggestions that will work well enough to pass your test, will work well enough to finish your given ticket that you're working on, but then imply down the line you're going to have three to five similar methods that none of them have been really well-tested around the edges, I think that is the risk that a lot of companies are taking right now without necessarily knowing that they're taking that risk. I think that at some point it is going to become apparent. To the extent that teams are measuring how their velocity is changing over time, I think they will measurably see that as their lines of code continue to increase, their velocity tends to decrease. So at some point, teams that want to maintain a project for 5 or 10 years and have that ability to change things and to add things be as fast in the future as it is when the project begins, I think it's going to be necessary for teams to find opportunities to reuse their existing code. But so far I have not seen any evidence that any team has succeeded in proposing a way to do this, and moreover, I haven't even seen any precedent for how that would be presented to users. I know it's fairly commonly known in projects that you'll have a couple files that are like the dungeon of the repo where if you go into this file it's going to be a mess and hard to understand and hard to maintain and so you just kind of try to avoid that, and it's not very common that teams will go back and specifically make a ticket to revisit methods like that because usually what management is telling you is to get more done, get this next ticket done. We don't have time to go revisit sloppy code just because it is unpleasant to work in, but I think that unless you have some kind of incentive or unless you have a technical leader that is aware that all of the long-term cumulative detriment of tech debt is going to eventually slow down the team to the extent that they can't get their projects done, I don't know that they're going to carve out time specifically for that rewriting and specifically for that cleanup. And so what I wonder with regards to the larger token windows is does that mean that teams actually would stop and look for opportunities to clean up their code, and if not, then we would have to hope that these newer LLMs can just afford people opportunities to allow code to be moved, to approve moved code in the course of their normal development and it's a little bit hard for me to imagine what kind of UI that would look like. It's not just going to be a tab because it has to be removing code and adding code. So I think that's a pretty difficult problem but one where I would imagine the interest for it is going to increase as awareness increases that we are adding code more than is advantageous for our long-term interests.
[music plays]
BP All right, everybody. For our last segment, we are going to be chatting with Saumil Patel. He is a co-founder and CEO over at Squire AI and spent some time at Y Combinator. He's held lots of different jobs in software development– senior software engineer, and architect. Squire AI is a company focusing on helping engineers merge PRs quickly by writing descriptions, automating reviews, and implementing feedback. I think it was a really interesting conversation, especially ideas about AIs as peer programmers, as pair programmers, and where they might not just help, but also challenge and push back. So without further ado, hope you enjoy this conversation with Saumil Patel.
[music plays]
Saumil Patel We went through a couple of iterations, so we moved towards code ownership and we decided to kind of pivot into this idea of helping you understand who's responsible for what part of your code base, so that was another iteration that we did. And the most recent iteration is Squire AI, and with Squire AI, our objective is to create a suite of agents that developers can use to help them automate smaller tasks within the software development life cycle and that's kind of where we're headed right now. So you may have seen this idea of an agent that can just replace software developers. We don't necessarily agree with that. We're not there yet, and we probably won't be for several years. And on the other end, you have the autocomplete of Copilot that's directly within your IDE. We think that the future is somewhere in the middle where we use that LLM, that agentic idea of being able to take a task and bring it to completion, but having it be atomic and be very specific to either test specific pieces of code or review specific pieces of code or document pieces of code or maybe even help you write functions. And where we're headed right now with Squire AI is to build those suite of agents that you can leverage as you're writing code to help you along the way instead of replacing you.
RD Are you talking putting AI agents at build time?
SP Exactly. We are adding AI agents at build time, but we want to add them at every layer. So when you're doing research, when you're writing the code, when you're building in your CI/CD, agents can be permeated throughout the entire software development lifecycle to help you each step of the way. Today, we're starting with reviews, so when you create a pull request, our agent comes in, our agent traverses the code base to make sense of the changes that have been made to help you understand the changes that have been made, and also to help you guide and give you constructive feedback, not just from a, “Here is the diff and I'm going to pass it into an LLM and give it to you,” but we go way beyond that. We search for things, we search for symbols, we search for meaning in the codebase, we search your documentation to give you constructive feedback based on that context awareness that we've built.
RD I think it's interesting that the shift from pure code gen to agents and I think that code review part is pretty key to having an agent where it can sort of reflect and not just pump out code. It can be like, “Well, is this good code? Does this fit with the documentation? Does this align with this best practices document?” How do you see agents as operating, especially on this atomic level you're talking about?
SP So there's several different techniques that are emerging right now in terms of how companies are using agents, or how people are developing agents even. What we personally believe is that agents will be atomic and there'll be tiny agents that will work together with other agents to achieve bigger and bigger tasks over time as LLMs become more and more capable. So we are seeing these specific patterns that people are using with agents that include reflection, tool use, planning, and multi-agent collaboration, and together as you combine all of those pieces, agents are able to give each other feedback and they're able to utilize each other. So one of the things that we do is we have agents that go and do research and then we have an agent that is responsible for reviewing the entire diff and then we have agents that are responsible for reviewing parts of the diff and that allows us to have this fine-grained control over what each individual agent knows and doesn't know to avoid confusion that might happen if you're, let's say, looking at a diff that is a thousand lines long and you don't necessarily want to start going into different parts of the code base and getting confused about what you were actually doing.
BP One thing that came up yesterday, I joked about flirty AI because yesterday OpenAI showed off their latest iteration of ChatGPT and it was much more conversational, and not only that, but it brought humor and almost an affectionate attitude towards the person that it was having a conversation with. It's interesting because I heard Sam Altman on the All In Podcast, and he was saying something similar to you. We think that there's going to be these high-level reasoning agents that are created by the largest AI companies, and then they will go out and pick from a series of models or tools, and that will empower them to do all these things that they weren't sub-trained on. There won't be one agent you have for your coding and one agent you have for your language and one agent you have for your biology. There'll be a master model that's really good at reasoning, and it will know, “Okay, I can go out and hit the API call for the product that you're building when I need to do X, Y and Z.” So in that sense, I feel like there is maybe a consensus forming around what's going to happen, although there's no sense in trying to predict where we're going to be in a year or two here. But let me ask you a question. One of the things that Sam Altman said he wanted from an AI was something that was like a great senior employee willing to challenge him when it felt like it had a better suggestion or it was asked for an idea and it said, “I'll think about that, but just so you know, I'm not sure that's the best idea.” And they kind of showcased that yesterday with a developer getting ready for an interview. “How do I look?” “Well, maybe you'll pull off the sleepy coder thing, but not great.” “Okay, what if I put on this hat?” “I wouldn't go with the hat. You look better without it.” That's, to me, a really interesting new wrinkle, which is that the AI has opinions. And so obviously when it comes to, “Hey, will you write a function for me or leave comments on this code?” the AI might then bring an opinion. What do you think about that?
SP Absolutely. I think that is the direction we're headed in. I think it's important to mention this paper. Hugging GPT is a paper that I recently actually read a few weeks ago, I believe, and it actually goes into how you can give this agent a task and it's able to go on Hugging Face and find different models and leverage them to achieve that task. So Andrew Ng, he actually demonstrated this in one of the videos where he actually had a picture of a child on a scooter and he said, “I want to see a girl reading a picture in the same pose.” So basically the model went and found other models that can help them find what the pose looks like and then generate an image of a girl reading a book. So it went through several different steps and it selected the right models to achieve that task. And so we're definitely headed in this direction. And with tools like Refraction, for example, our agent that reviews code, the objective is to give criticism. The objective is to look for things that are missing or inaccurate or not done the right way, whatever the right way might be, and then you can use specific processes like tree of thought or chain of thought to really try and figure out if that is the way to go. Tree of thought could be a really good example of planning to use in that scenario where you can say, “Here is five different pieces of criticism we can provide. How do they lead to a better outcome in the end?” And you can use the LLM and reason to try and find that best possible path. And maybe it is no criticism, but maybe it's aggressive criticism, maybe you're just going, “Here's a suggestion.” So we're definitely headed in that direction where LLMs should be able to have this kind of divergence of thought and then come back to give you something that makes the most sense.
[music plays]
BP All right, everybody. That is the end of this episode. Hope you enjoyed it. If you have thoughts to share about what's going on with code generation, how it's working inside your company, how you're utilizing it to be more effective or efficient, if you think it's all a bunch of bunk and hype, get in touch with us– email me, podcast@stackoverflow.com. Give me some feedback, let me know if you want to come on the show, we'll talk about it. If you are enjoying a holiday, please rest up, and I will be back in the new year with some new episodes and maybe even some new podcast series. So excited to see you in 2025, hope you enjoyed the episode, and thanks for listening. We'll talk to you soon.
[outro music plays]