On today’s home team episode: a new study confirms that AI isn’t putting us out of business, why tech layoffs have been good for share prices, and the programming students learning to code with Copilot.
AI-generated code is “not equivalent to reliable and robust code, especially in the context of real-world software development,” according to a new study whose title got our attention.
Tech layoffs continue in the wake of the pandemic hiring boom, sending some share prices into the sky.
Take a look at how AI coding assistants are already changing the way code is made.
Shoutout to Stack Overflow user nonopolarity, who earned a Great Question badge by asking Can someone explain SSH tunnel in a simple way?.
[intro music plays]
Ben Popper Hello, everybody. Welcome back to the Stack Overflow Podcast, API era edition. I am Ben Popper, Director of Content here at Stack Overflow, joined as I often am by my colleague, Ryan Donovan. So there was a paper released last month– January 2024: “Can LLMs replace Stack Overflow?” Ouch. A study on the robustness and reliability of large language model code generation. So I want to caveat this at the beginning by saying that there's a rush of academic studies trying to figure out how you get LLMs to generate reliable code, and then more importantly, how do you measure that? What are the benchmarks, because this is a brand new field. So this study proposes its own idea. They have a data set called Robust API for evaluating the reliability and robustness of code generated by LLMs. They take 1,208 coding questions from Stack Overflow, about 18 representative Java APIs– they picked Java because it's so widely used– summarize the common misuse patterns of those APIs and evaluate what the LLMs spit out. So the GPT-4, which is cutting edge, 62% of the code generated contains API misuses. And they wanted to focus on this to point out that some other studies have looked at if the code is functional. When you scan it and run it, does it perform as intended versus maybe you're introducing something into the code that would deprecate it in terms of its quality, security, memory, whatever. So I thought that was an interesting metric. I don't know, what do you think about looking at it from that perspective?
Ryan Donovan From the robustness? What's the metric, the misuse of APIs?
BP Misuse of API as a yardstick for code quality.
RD I think that's a good one for testing LLMs because API guidelines and how you use them is generally pretty well documented on whoever's got the API.
BP Good point.
RD So if the LLMs are misusing them, they are probably making stuff up. They're making pretty rookie mistakes.
BP Gotcha. It says, “Generated code snippets are missing boundary checks, missing file stream closing, failure in transaction completion, et cetera. Even if the code samples are executable or functionally correct, misuse can trigger serious potential risks in production such as memory leaks, program crashes, garbage collection, et cetera, et cetera, et cetera.”
RD That's a little different, I think. That's the sort of standard problem where this is not secure code you're writing. I think we've had that with code samples on Stack Overflow. If you just get the code sample, you don't get the context around it. You don't get all the comments, you don't get the other answers.
BP Yes, it's important to point out that they found a 62% error rate or something like that, and they linked back to some earlier work that studied the robustness of code on a forum like Stack Overflow, which found that if you just copied and pasted from there, something like 29% of the code, for example, would have a security issue or 42% would contain a deprecated API. So not as bad as the 67% of the LLM, but obviously far from perfect.
RD And I think the additional context that you lose is that you're copying code from Stack Overflow. You are copying this sort of demonstration example code, whereas you get it from an LLM and, “Oh, this is something I can put in production. It should be good code.” And maybe people should be treating LLM code as that you have to make sure that this works and isn't giving up your database.
BP We'll share a few links here. I do think this is an issue we'll probably return to again and again this year, which is, what are folks doing with code generation inside of the organization? To what degree can it be a big productivity boost versus actually taking time away because it's introducing more errors than you would have just writing it yourself, and then what yardsticks are we going to use to measure this from a code reliability perspective? Maybe think of that as a DevOps or a testing or an SRE perspective, and then legal and security are issues that we'll have to sort out because you don't know the licensure and you're not sure what security risks you're introducing accidentally.
RD The more and more people use this, the more they're going to find the limitations and the benefits. They're going to find more of these things where they're just giving you broken code or find that it's really good with certain prompts.
BP So one other thing I wanted to talk about in terms of how things are changing it feels like at the margins, but how at scale it could really change– Shopify announced this week that they would be adding in some basic image generation capabilities. Okay, you have a product? Well, now you can change the background. You can remove the background from the photo you shot and we'll give you a bunch of different options. Not a big deal, but for a small merchant this is a great tool. But the really interesting thing they did, which we've been working on at Stack Overflow and you and I have written about, is adding semantic search. So now you can go on and ask a question in the search bar that previously would have been pretty inaccessible like “warm and comfy clothes for winter in a Scandinavian style” or something like that, and it's going to be able to return something to you based on that, and that's not a way we've ever been able to shop before. I don't know how it's going to change things, but I think what is interesting about it is that Shopify is at such a huge scale. There's so many merchants on there, there's so many customers coming through every day, and so people are going to start to touch this stuff and we'll have to see what kind of impact it makes.
RD And I think this is a good consumer friendly version of the generative AI. I want some boot cut jeans. I don't want to just go to the regular providers. This will also surface smaller merchants. Now on the other hand, are we going to get Amazon starting to get flooded with low quality merchants, let's say.
BP Right, that's true. I've read a couple of disturbing stories recently about old websites, The Hairpin was one that was kind of like a Gawker-ish site that had great writers and then closed down and now it's been resurrected as an AI content mill. So if people end up there, there's just a bunch of stories that are written by authors that don't exist. So that's one of the unfortunate outcomes.
RD My Kindle keeps recommending books that are clearly AI generated. I was like, “Oh, no.”
BP It knows what you're into. I wonder why it's doing that.
RD Well, it's better than the romance novels it kept recommending to me.
BP Why is it so off for you?
RD It's just the regular ads that show up on the main screen.
BP Oh, I see. It's just the, “This is for everybody.”
RD This is a story about a cowboy billionaire murderer who's my best friend's dog or whatever.
BP All right, I got two more before we sign off. On a sad note, a big story in The New York Times this week: Technology Companies are Cutting Jobs and Wall Street Loves it. This is really unfortunate. We've seen a lot of layoffs in the tech industry, startups and massive companies. I believe– don't quote me on this, don't take this to the bank– that the cuts that are being made still do not bring the companies below the number of engineers they had in 2020, that the massive hiring that was done during the pandemic years still puts them ahead of where they started. But unfortunately this can be a self-reinforcing cycle. They make cuts, the shares go up, the people who decide on cuts love it when the shares go up, and so that's not a great outcome for the developers who work there.
RD Yeah. And I read something that this sort of layoff strategy to encourage short term stock prices is something sort of pioneered by Jack Welch at GE. He would every year lay off the bottom 10% of the company, and then also do layoffs to improve the stock price. And I don't think constantly laying off people is going to make your company thrive.
BP I think there's an interesting question there, which is how many people are going to show you their true abilities in year one and how many are going to blossom in year two or three after they get to know the ropes and understand who they're working with. One year, two years is not necessarily enough time to decide where they belong in the stack rank or whatever it may be. After four, five, six years, you can say, “All right, how much is this person producing for the company,” and make a decision like that.
RD And I think it could be the other way. It could be that this is actually a good thing for refocusing. Like you said, maybe folks overhired during the pandemic. Maybe this is a lot of internal feature bloat. A lot of companies stretching out, going different places and being like, “Well, let's stick to our core business.”
BP I heard some crazy stories recently as I was discussing what's happening in the industry with some friends. Somebody who has worked at tech companies for a long time and is an engineering manager said that prior to the more recent cutbacks, it was not uncommon for people in a high level engineering role or product role at a very large and well established public tech company to sometimes not work for six months. A project they were working on would be shut down. They would want to get on a new project but there wasn't one immediately available and they were basically just on sabbatical. It was like a teacher in the rubber room. And at those salaries, it's crazy, but at that time, the companies didn't care because that wasn't what they were maximizing for. That wasn't what shareholders were clamoring about. So there was certainly, I think, some bloat there.
RD And I think Silicon Valley the show parodied that where there were just a group of developers that hung out on the roof and drank beer.
BP Right, exactly. The roofers. There's a great article in MIT Technology Review. If you're listening, I would suggest checking it out. It's about a professor from Duke University who decided that for one of his entry level programming courses he was going to change things up and switch from Python to Rust. So this guy has 25 years of experience as a developer, and his takeaway from using an AI assistant in the IDE was that it gave him superpowers. “There's no way I could have learned Rust as quickly as I did without it. I basically had a super smart assistant next to me that could answer my questions while I tried to level up.” So Ryan, I think to your point, it's not a replacement, it's an enhancement in this story. To be able to say, “You know what? After 25 years, I'm switching from Python to Rust in this class,” that's a really meaningful thing. Now all these other kids are going to learn something different and that was made possible with the help of this stuff, so it was kind of cool to read.
RD I think this shows the power of generative AI for somebody who is fairly competent in the field. Somebody who's not competent in the field doesn't know the right questions to ask. It's all unknown unknowns. But for somebody who's a professor who knows Python inside and out, what's the parallels? How do I find a way into Rust?
BP That's interesting. They would probably ask a lot of questions like, “In Python, I do X. How can I do that in Rust?” Just as a corollary to this though, there was a study released this month– January 25th, 2024– that reached the opposite conclusion. We find disconcerting transfer maintainability, code churn, the percentage of lines that are reverted or updated less than two weeks after being authored is projected to double in 2024 compared to 2021, and so there we further find that the percentage of added code and copy pasted code is increasing in proportion to updated, deleted, and moved. In this regard, AI generated code resembles an itinerant contributor prone to violate the DRYness– do not repeat yourself– of the repos visited downward pressure on code quality. So use at your own risk or use wisely. Don't just copy and paste, that's not what you're supposed to do.
RD And I think even when folks are just copying and pasting from Stack Overflow, that's not entirely advisable. I wrote something a while back about how good coders borrow, great coders steal, and the stealing is that you have to understand it and make it your own. The same with the AI code. You can't throw things in there willy-nilly and hope that you have a functioning program at the end.
BP I like it.
BP All right, everybody. It is time. Let's thank somebody who came on Stack Overflow, shared a little bit of knowledge or curiosity. A Great Question Badge was awarded to Nonopolarity, awarded yesterday for, “Can someone explain SSH tunnel in a simple way? Explain it to me like I'm five.” Well, you've helped over 65,000 people, so we appreciate it, Nono, and thanks for bringing your knowledge and congratulations on your badge. How to SSH tunnel. If you want to know, now you can. All right, everybody. As always, thanks for listening. I am Ben Popper, Director of Content here at Stack Overflow. You can find me on X @BenPopper. Email us, firstname.lastname@example.org with questions or suggestions. And leave us a rating and a review if you like the show.
RD And I'm Ryan Donovan. I edit the blog here at Stack Overflow, conveniently located at stackoverflow.blog. And if you want to reach out to me on X, you can find me @RThorDonovan.
BP Sweet. All right, everybody. Thanks for listening. We will talk to you soon.
[outro music plays]