The Stack Overflow Podcast

Writing tests with AI, but not LLMs

Episode Summary

Animesh Mishra, senior solutions engineer at Diffblue, joins Ryan and Ben to talk about how AI agents can help you get better test coverage. Animesh explains how agentic AI can expedite and enhance automation and refactoring processes, how Diffblue leverages machine learning techniques to write effective unit tests, and why clear use cases and trust are so important in developing AI tools. Plus: Why Diffblue sees Copilot as a complement, not a competitor.

Episode Notes

Diffblue Cover is an AI agent for testing complex Java code at scale. Check out their docs to get started automating unit tests today.

This article will help you understand the difference between Diffblue Cover and Copilot.

Find Animesh on LinkedIn.

Stack Overflow user Keet Sugathadasa earned a Populist badge by answering a question in the CI/CD Collective: Gitlab CI CD variable are not getting injected while running gitlab pipeline.

Episode Transcription

[intro music plays]

Ryan Donovan The Stack Overflow community has questions, our CEO has answers. We're streaming a special AMA with Stack Overflow CEO Prashanth Chandrasekar on February 26th over on YouTube. He'll be talking about what's in store for the future of Stack, and you'll have the chance to ask him anything. Join us.

Ben Popper Hello, everybody. Welcome back to the Stack Overflow Podcast, a place to talk all things software and technology. I'm your host, Ben Popper, joined as I often am by the Editor of the Stack Overflow Blog, my cohost, Ryan Donovan. Ryan, good to see you.

RD Good to see you, too.

BP I'm no longer at Stack Overflow full time, so we don't see each other every day like we used to. It's kind of a treat to get on the podcast now and see your face.

RD That’s right. I can finally enjoy your presence.

BP Exactly. But I'm no longer your boss, so maybe the power dynamic is different now.

RD That’s right. Watch yourself, Ben boy.

BP Exactly. So unless you've been living under a rock, if you're in the world of software development, you can't go a week, maybe you can't even go a day without hearing somebody talk about the rise of agentic AI and how it's going to change everything, and people on social media claiming to have agents building them businesses that are earning them money, most of which seems like nonsense, but hey. And so we are lucky today to have Animesh Mishra, who is a Senior Solutions Engineer over at Diffblue on the show to discuss a bunch of this stuff. We're hoping to talk about rise of agentic AI in coding, what that might mean in terms of a surge in large-scale refactoring, and the difference between LLMs and SLMs– I'm going to have to learn that on this episode. In any case, Animesh, welcome to the Stack Overflow Podcast.

Animesh Mishra Thanks, Ben. It's nice to be here. I'm looking forward to the feedback that we get from this episode as well, because I'm quite keen on hearing how developers are finding AI encroaching in their software development lifecycle. So they're always happy to push it onto the unsuspecting masses. Now the tables have turned. It's always good to hear how they're taking it.

RD That's right, how they turn tables.

BP So Animesh, just really quickly, 30 seconds or so, can you tell folks a little bit about how you got into the world of software and development and what led you to your current role?

AM So my education has been in embedded systems development. I learned how to make chips and program them in Assembly and C. And after that, I got into software development through an accident, to be honest. I started my own company. We were doing image analysis in agriculture and trying to give farmers a view of their crops before they go out and lace their fields with pesticides, and then it kind of didn't really get anywhere. But that experience gave me a taste for building things, looking at real world problems and attacking them using my skills, which were creating hardware and software and making them work together. That's how I got into software engineering. For the past five or six years, I've been largely working with enterprise companies, helping them adopt new technology. I worked first as an engineer and then once I understood how the sausage is made, I switched to the sales side to see how it's sold. And so currently I specialize in helping large companies understand and make sense of emerging technology and identify the right thing that delivers the value to them.

RD Well, and we're definitely talking about an emerging technology today– the AI agents. There has been a lot of talk about it, some of it pie in the sky, but you have a very concrete use case for it. Can you tell us about that?

AM So I would think about this AI challenge, and it is a challenge to people looking at it because there's too much noise and you have to figure out, “Okay, is any of it useful to me?” I would actually think of it not as an AI problem, but an automation problem. So ultimately that's the promise being delivered, you can automate away some tasks. If you do it using AI or just a shell script, that's an implementation detail. That's how I see it. So in that view, we look at the software development lifecycle and particularly where I'm working currently in testing. Now in testing, there has been some advancement thanks to DevOps, where quite a lot of testing has been automated. In fact, a lot of people call it ‘automated unit testing,’ which is a bit of a scam because it's only the running of the test that's automated, developers still have to write the test themselves. And so that's the final frontier, that's the Rubicon that ought to be crossed where, can we take away that manual task that a developer has to do to write a unit test, and automate it. And there are many ways to do it. We do it using techniques that are very different from large language models, and that's what makes us interesting and that's where customers see the value.

BP So I want to get into what you're using and how it's different from LLMs, but I think that sales engineer might be the most valuable thing inside of a company based on my experience at several different SaaS software companies now, or even at a hardware company like DJI, which is to say, you get into a room with somebody and you're trying to sell them an expensive solution, and the person you're talking to is probably pretty technical, and if you're technical, you're going to get along and you're going to be able to walk them through things. You're going to be able to troubleshoot things. I just think it's so challenging in some ways for salespeople to sell software to developers because developers immediately have an allergic reaction to marketing and to salespeople. They would so much rather talk to a sales engineer, so I think that's just such an interesting role. But okay, I'll take the bait. You're not using LLMs. Big mistake. You're not going to get your next funding round at the valuation you wanted, but that's okay. But tell us what you do use and why you think that's a better solution.

AM So as I talk about it, I want you to imagine an appliance. In fact, I want you to imagine the kind of automation that every single one of us listening to this podcast is used to, and that's a toaster. Every single one of us has a toaster, I hope.

BP Got a great toaster.

AM You wake up in the morning, half groggy, you put your bread in there, press the button, walk away, come back and you get a toast– predictably, repeatedly, every single day of your life, and that's what you like about the toaster, as an appliance, as an automation. It's not taking your time. It's not holding you up while it's making the toast, you can go brush your teeth while the bread is being toasted. And so when we looked at automating parts of the software development lifecycle at Diffblue, that was the design ethos we went with, that we want something that is repeatable, predictable, and customizable. There are dials in the toaster where you can– there's a friend of mine who once he toasts his bread it looks like it was nuked.

BP He wants to turn it into charcoal.

AM And some people just have it very light. So those are the three key design ethos that we have– it needs to be predictable, repeatable and customizable. Now, once you've decided that that's what you want to build, as an engineer, the tools that you use to build that, the choice becomes quite constricted. So the reason we don't use LLMs is not because we are anti-LLMs, we're not. I personally think they're very useful and there are certain roles that only they can do in part of the software development lifecycle. But because we assigned ourselves this design criteria, that leaves us with exploring techniques which can deliver predictability. And the reason predictability is important, and I'll come to the latter two, but I think the most key in those three– repeatable, predictable, and customizable– is predictable. And the reason it's important is, currently any new thing, if you don't understand how it works and you can't predict what it's going to do when you kick it or poke it, you're not going to trust it. If you're not going to trust it, you're not going to put it in your house or in your day to day workflow. And to make it predictable, the person making it needs to understand all the many different ways it can go about generating the output, which is one of the reasons why we have looked at LLMs, we have investigated into adopting some of the LLMs into our internal model, but we've never really taken the plunge. The other big reason is we predate LLMs. So Diffblue was started a long time ago. It's a spinoff from Oxford University. We're based in England, the company's headquartered in Oxford, and we've been in business for about seven years. So we kind of predate a lot of hype around LLMs as well.

RD And especially in something like testing, you don't want your test to vary on the same things. You want it to be able to test the thing correctly without you having to go back and change it up.

BP I kind of love the idea of a nondeterministic toaster where it's like, “I'm going down today. I don't know what I'm getting,” or maybe it's got an idea for something. It's like, “I haven't had toast this way in a while.” We're going to talk about agents, I did hear Sam Altman or somebody in one of the big AI labs talking recently about what's it going to be like when you have this agent that's kind of like your executive assistant, and they were saying, “I hope that agent is tuned so that it pushes back on me a little, so that it criticizes me a little, so that it kind of has not just a yes person, a yes agent personality.” But I totally understand what you're saying. I think everyone in the software development world, anyone at a big organization with a lot of centralized software development is saying, “I know I need to take advantage in some way of what's going on in terms of AI automation or I'm going to fall behind. I know I want to offload the toil work that my developers don't like so they can focus on the challenging and engaging stuff, but it has to be predictable, repeatable and trustworthy.” You have to know at the end that whatever comes out if it's code, is going to be something that isn't going to produce more issues on the back end than the time it saved you in the front.

AM That's right. And so that was a challenge set for us. So we said, “Okay, let's look at the techniques that can be used to help software.” The Diffblue for its origins are actually very different from testing. When the tool was first created, the founding vision was to create a solution that can help developers understand the effect of their code change. So imagine you're a developer working in a very large application, you might not understand it fully end to end, 10 million lines of code, who knows what it does, and you're tasked to go in and make some changes to that code. It might be a legacy application in the business that does critical business tasks. Now at that point, if you have a very good test suite in that application, then that is oldest, because then you can go in, start tinkering, run your test suite and see how your changes affect the behavior of the application. If you don't have that, and a lot of the legacy applications don't because unit testing, although it has been around forever, the uptake of it hasn't. So the farther back in time you go, the fewer unit tests you seem to find in codebases. And so in that scenario then, as a developer, you know what you need to do. You can make that code change, but you're hampered by the fact that you don't know all the side effects that you'll change.

BP Okay, so this is the diff in Diffblue. You're starting from the changes. I got you now.

AM Exactly. So initially the idea was, if you make a code change, you give Diffblue the diff of that change, and it can show you how that affects all parts of the application.

RD How it cascades down throughout the application.

AM So you change one line and you run Diffblue and your whole codebase lights up like a Christmas tree showing you everywhere that.

BP Showing in blue, highlighted in blue. Okay, got it. The diffs are blue. I got it.

AM The diffs were not blue, actually. Our marketing should definitely take it. I'll recommend it. But because it's from Oxford, from the University, the color of the University is blue, so that's where the blue comes from. So that was the origin of it. You write your code, you run this thing and it tells you how many ways you've broken your code. And then unfortunately, that's not a very marketable product, that one, because people get it like, “Oh, there's some value in it, but what do I do with it? It's more of an informational thing for developers.” And so after getting some market feedback, we pivoted and we figured, do you know what? We can make this better. We can take it one step forward. If we know what a piece of code is doing, and we can confidently also say how this change will affect it, we can do two things. We can write unit tests to start with for the whole application, and then as you make your changes, we can update your test cases to show you how the assertions change as you commit your code. And that is when that was a light bulb moment. That's when we identified a real business problem and a real developer need in the industry, which is that nobody likes writing unit tests. Developers will say they do TDD, nobody does it, and so you get all this tech debt with lots of code with no unit tests. Somebody needs to write those unit tests, but that's not the job finished. Once the tests have been written, they also need to be updated as you go and change your code, and that's the modern Diffblue as well.

RD So how do you write those tests? If it's not an LLM doing the test, what's the sort of AI under the hood that does that?

AM I wouldn't call it AI, because AI has become this very big, deep learning probabilistic model driven notion. So I would go back to the humble beginnings of AI which was machine learning, and what Diffblue uses, if I could say it in the most unsexy manner possible, I would describe it in one line as an iterative optimization algorithm that writes the least number of tests that give you the most coverage in an application. That is what that's doing. So the marketing buzzword then would be reinforcement learning. So reinforcement learning, there's a lot of reinforcement learning techniques in here. The only difference is we're not doing deep learning because we don't want to be probabilistic, we want to be deterministic, and that's our secret sauce. We figured out a way to do reinforcement learning in a deterministic way so that you always get the same test for the same piece of code.

RD So it's a chain of conditionals all the way down?

AM A bit more sophisticated than that because we can deal with code that is new, so it's not that if you give Diffblue a piece of code that it’s never seen before, it can't write a unit test for, but there's certainly some conditionality in there as well, some decision trees definitely in that. If you have a Java project for example, you will run Diffblue on that Java project. The first thing it does is we'll do a static analysis of the code base just to understand how it's structured. Then it will index all the methods and classes that are part of your project. And then it will look at the built bytecode of the application to start analyzing how the application actually delivers the functionality that it does. So this is something that, again, sets it apart from the LLM-based tools, which look at the plain text source code. For most of our analysis and test writing and validation, we use the built bytecode. The bytecode gives us the advantage in terms of the computational understanding of the code we are looking at, and that is key because the three things I mentioned, remember– predictability, repeatability, and customizability– the third, customizability, you can only customize things once you understand what's broken and what needs to be changed. So when Diffblue tries to write a test and it can't write a test, it will leave behind testability insight because it knows what's missing. Those insights can be as simple as, “This method relies on a property that's missing a setter, which means my test can't set it.” Add a setter, run it again, you get a test. But going back to the model, you will run Diffblue on a Java project. It will do a static analysis, it will look at the build bytecode, do what we call data flow analysis to understand how the data flows through an application, so what's the entry point, how all the methods are chained together.

RD Do you build out an abstract syntax tree with that?

AM A kind of abstract syntax tree. We have looked at it, because one of the next steps we might be going towards is refactoring, because if we can tell you what to do with the code to make it more testable, maybe we can do it ourselves. So this is something we're exploring as well, and that will take us more towards the AST route. But currently, what we have, I won't call it an AST. So once we have identified all of this, we will take a method for which you want to write a test and we will run it in an isolated sandbox, literally create a Java version machine on your own laptop or whichever environment you run Diffblue in, and we will run that piece of code to understand how it affects the application state. That forms the basis of the assertions that we write, because we know the before state, we know the after state, we know how the data is flowing through when that method is called, and so we can identify which methods will need to be mocked, which calls will need to be allowed to fall through, and then what kind of assertions do we need here to test the functionality properly. So in any given unit test case, there are only three parts. The unit test case can only be written in, you have the arrange, act, and assert– three parts to it. The arrange block sets up your test data. The act part does the mocking and calling of the method to be tested. And the assertion is when actually the rubber meets the road. And we do all of those three in that step. So we will write a unit test after having understood how your code works, and then we will run that test against your code to measure the effectiveness of it, and this is where that iterative optimization loop comes in. So we’ve written a test. We're now running it against your code and seeing how good it is. If it doesn't run, doesn't compile for some reason or it doesn't produce good coverage, we chuck it, we predict a better one. We keep going until we’ve found a test that is good. Now, the definition of good is subjective, actually. And so for us, a good unit test is a unit test that always compiles and runs, will always exercise the method under test, and will have assertions that reflect the true runtime behavior of that code. If those three criteria are not met, we will discard the test.

RD So how do you come up with those assertions? Is it just like, “This returns the thing it says it's going to return,” or is there some more sort of understanding of the functionality built-in?

AM There's more understanding of the functionality built-in. So we'll start off with very basic, literally how a developer would approach this problem. We look at a method, look at that method and see, okay, what is its return type? What value is it returning? If it's a void method, then okay, let's see, is it doing any side effects? What could that be? So that's the first pass analysis of it. Then we will see, because when we run the code, we're able to observe not just the direct effect of calling that method and seeing what it returns, but also the side effects, because we can see how the application state is changing in memory. That allows us to also pick up on side effects which may or may not be intentional. And so when we write our test, and this is something that's a key point that I would like to impress upon, we don't understand the intent of the developer because the only input to the model is the code. And so when we run that code to write that test, we will write test cases and assertions for both the effects that we're seeing, direct effects that probably is the intentional behavior, but also all of the side effects that they might either might not have thought about, or is probably intentional and then they get a test case for that anyways. And this is one of the ways where, again, having a computational understanding of the code really sets it apart from other tools out there, because what we're producing is from what we have seen by running the code and not by just, I don't want to sound disparaging, but just very pattern-matching chicanery.

BP So we've chatted a little bit I think about where Diffblue came from and how it works, what it does, and we got into one of the topics that we wanted to discuss which is why you prefer the SLM over the LLM, or just an automation that is not the nondeterministic toaster. Is there agentic AI involved in Diffblue? I know that was something you wanted to discuss. Did you want to discuss it more generally, sort of like, “Hey, this is something we see happening around the industry and something that we've got to work side-by-side with,” or is Diffblue using agentic AI or suggesting it as part of the toolkit you're bringing to customers?

AM I'll be honest with you, I still haven't fully understood the word ‘agentic AI’ because I've seen people use it differently. So I will tell you what I understand of agentic AI, which might be wrong, and then I'll also then share where Diffblue fits into that big picture. So the way I understood agentic AI is, the first wave of LLM tools were what you would call your ‘copilots,’ assistants a human could use to accelerate some part of their work, but they could never do something all by themselves, the person would need to be in the middle of the task. It's just an accelerator. Agentic AI would have agents that can go and do the job from start to finish without any human intervention, that's my understanding of it. And by that definition, Diffblue is an agentic AI tool because most of the teams, the way they scale Diffblue is by putting it in their PR processes. It’s the most powerful way to use Diffblue, and it also speaks to the idea of interfaces. So if you think about the reason ChatGPT was successful was not because it was the first LLM or it was the first thing to use transformers, it was successful because it figured out a way to provide an interface to laypeople which they could use to then use AI, and that interface was that chat box. And it was two or three years ago, ChatGPT. We still haven't figured out a better interface for AI. Now they're starting to come through. So my understanding of agentic AI basically is just giving people better interfaces to access the AI functionality that they need in their lives. So with Diffblue, there are three ways to use Diffblue. You can use it in your IDE as a plugin just like other tools. You can also use it as a command line interface on your developer laptop. Also everything runs locally, so all of this model, the synthesis analysis I've been talking about, happens on the device you run Diffblue in. That again sets us apart from LLMs because then you don't need all that GPU power. But the most powerful way to run Diffblue is automate the automation. That's what I tell my customers. So this is something you'd give it your code and it writes all the test cases for you and all you need to do as a developer is review them so that you can make sure that the assertions, the behavior in the assertions is reflecting the functional spec, because Diffblue doesn't know the functional spec. So if you write a method that is called ‘sum’ and you're doing subtraction in the method body, Diffblue will write assertions that demonstrate subtraction. Then it's up to you as a developer to code review that and say, “Okay, that's not a functionality I'm after, which means there's a bug in the code. I'll go fix that.” Now that all comes to the best interface to do this actually is to put it in the PR process. So this becomes part of your furniture completely independent of requiring any developer to be trained on it or asking the developer to press the right button. And so the way that would work, the interface would be you write your code as usual, you open a pull request, Diffblue looks at that PR, pushes a new commit with all the test cases updated, and then you do your code review and off you go. The PR interface is where I think– well, customers tell us that we deliver the most value, and this is the UI, if I may call it a UI, that allows us to be called ‘agentic AI.’ Have I understood the agentic AI properly?

RD I mean, I think you said at the beginning that it's just a way of automating things. One of the major problems of computer science is automation, and agents are sort of automating the LLM process. I'm curious, what are the other ways that you've heard people talk about agentic AI?

AM So I have heard agentic AI also being talked about from a point of view of orchestration, so basically Kubernetes. What Kubernetes is for Docker, agentic AI would be for LLMs.

BP I think I can sort of maybe shed a little light on this, which is to say, I think your definition of agentic is right. I talk to ChatGPT, I ask it a question, it gives me an answer. That's it, it's a single interaction with an agent. I say, “I'd like you to go and build me a website, and when you're done, get back to me,” and then it goes off. Maybe it has computer use, it's a computer use agent, and so it goes off and tries to do that by using a whole bunch of different tools, and if it fails, it tries to figure out why it failed, go back and do it again. And that's kind of what you said. If it produces a test and it doesn't think it's good enough based on the criteria you've set, hey, I'm going to throw this one away and do another one. I think sometimes people use agentic AI, like you said, to mean a mixture of experts, which is to say one agent's job is to build it, another agent's job is to critique it, another agent's job is to refine it. And so agentic, meaning they're all working together to sort of orchestrate a process and get to a great result at the end. And so one is really just a single agent versus multiple agents, I guess.

RD I mean, they're all kind of automating the prompt and response handling, whether you have it as a single prompt and response or you have a long running process that's sort of orchestrating.

BP And I guess one of the things I see a lot with agentic AI is they get caught in the death loop. They've reached some level of complexity where they cannot debug it, or they've reached the end of their context window and they're just losing the thread. Does that ever, it would be interesting for Diffblue, you said, “Look, we'll produce something, and if it's not good enough, we'll make another one.” Has there ever been where it's like, “It keeps making it and it's never good enough and now it's just going and going and going.”

AM It doesn't end up in a death loop like that because we have an iteration count, so if it reaches that and doesn't have a good test to show for it, it will just raise their hands up and say, “Do you know what? I'm not good enough for this one. Help me out.”

BP Very useful.

AM And that happens. By no means, and I never claim this with people on speaking prospects as well, Diffblue will never write 100 percent coverage on any application, definitely not on the first run. The best performance that I've seen it achieve was an Alibaba open source project, I think it's called NACOS API, and we got 94 percent out of the box. You run Diffblue the first time, boom, 94 percent coverage. Excellent. Most of the time you get about 40, 50, 60%, so then what do you do with that remaining 40%? And that's where that customizability comes into picture. So it will leave behind breadcrumbs. “This is what I tried.” So when it fails to produce a test, what it doesn't do is chuck all of that knowledge away. It will actually distill that knowledge into a partial test with lots of comments and stack trace and say, “Look, I tried writing this test and here's where it failed. There was a null pointer exception when I called GetValue on this object, and that's probably because I don't know how to instantiate that GetObject properly, and so the GetValue method is showing null. Please help me.” And then we have built a lot of whole host of customizations where the developer can then go and say, “Oh, you dummy. You don't know what you're doing. Here's a factory method. When you're dealing with that particular instance, use this factory method to create test instance of that type.” Then once you've done that and you run Diffblue again, boom, it picks that up, writes a better test.

RD I think you've done one of the most useful things with an AI. You've got it to say, “I don't know.”

AM Exactly. And I think this is one of the challenges with all the hype of AI, because I have to follow it as well because I'm in the industry, I'm a developer too. I do some development in my free time. I just want people to be honest because you're not going to build trust without it.

RD Right.

BP All right, so one last topic to cover here which I am very interested in, which is, given the tools we now have, is there going to be a wave of large-scale refactoring and what might that look like? So you noted an Amazon post where they talked about using their Gen AI assistant going in and doing a ton of refactoring and saving the equivalent of 4,500 developer years of work. Okay, take that with a grain of salt, but I feel like I've said this on the program, Ryan, like half a dozen times, which is one of the dreams along with writing great test coverage, is we work eight hours a day, we work five days a week, and then we tell an agentic AI on a Friday or before the holiday, “Go in there, clean things up, find some memory leaks, debug,” refactor is a little bit different, but basically, “While I'm asleep, go in there and clean and improve and when I come back, let me know the changes you think we should make, and then I can press, I can be the human in the loop to validate them,” and boom, what an amazing load of work taken off the developer’s shoulders. But Animesh, tell me where you're seeing this or what your vision of this is and how you think it can be applied in the real world today.

AM I think I'm yet to see a large-scale rollout of an LLM-based solution to the refactoring problem, and that's not to say that they're not capable or that Amazon's lying. They might not be, actually they might've achieved it. They probably trained it on their own code so it's probably better on their own code than it will be on mine. But it's that unpredictability. So imagine you're a developer, because I've done refactoring, so I know what needs to happen, to refactor your code, you need a good test suite and then you start hacking away at it. What you don't want is, so when a developer is doing it and there's a test suite in place, they can do trial and error and eventually get it to the place where they need to be. With large-scale refactoring using LLMs, are we just running one agent, going away, making a cup of tea, coming back and there's my 10 million lines of code application refactored? Because if that's the case, I don't know where to start analyzing whether it's right or wrong. So it's not that they can't do it, it's how do you measure that they've done it correctly? And so when the tool is unpredictable, that means that evaluation needs to happen every single time it is used, which is why you see the user experience in the UI for LLMs are always these Copilot-based UIs, because no one's going to fully trust them. And that's why they're best in scenarios where 80 percent of the way is still good enough, because it's saving me a ton of time and I'll do the 20%. Now with refactoring, I have seen a very good refactoring exercise being undertaken using LLMs where what the team is doing, they do database modernization, and so what they do is they write the test cases themselves. They're doing TDD, they're doing AI-assisted TDD. So they write the test cases themselves, completely developer-written, no AI involved. Then they badger Copilot to write code that satisfies that test suite, and they keep going and going and going and going until it works. They have to do it quite a lot. It's not pretty, but it's still saving them time, a significant amount of time– 30 to 40 percent of time. It does feel very unsatisfactory. That's what the developers told me. They’re just basically running a command and seeing if it works until it does, but that's one way to use LLMs and tools which are inherently unpredictable and bring in some predictability into the system. So give them something they should code against. The other way, which is what some of our customers have done, is to use tools which are deterministic.

BP I actually saw it. I went back and looked at the Amazon post here and, like you said, that was Diffblue’s choice. The blog about Amazon Q, they wrote that it leverages OpenRewrite and 80 to 90 percent of what it's done is deterministic OpenRewrite recipes. So that's how it achieves that.

AM Yeah. I was going to OpenRewrite as well. So the other way to do it is deterministically. And the reason that's useful is, when something is deterministic, it will do the same thing given the same set of inputs, then the evaluation only needs to happen once. So you can pick, say a cohort, you can make a bouquet of classes that you want to refactor and use a deterministic tool on them, and if the solution, what you get, the output is satisfactory, you can be confident that that is what you're going to get for the rest of them. So it makes that, ‘How do I know that this is good’ question easier to answer. Thanks for calling out OpenRewrite. So we are actually working with Moderne, which is the parent company that owns OpenRewrite to work towards this refactoring capability which we're talking about when Ryan asked the question about AST. So the holy grail here that we're going for is, currently when companies are doing large-scale modernization, they come to Diffblue to write their base test suite deterministically. So to give you an example, there was a defense contractor in the USA and they had to move to AWS Government Cloud, and the timeline was quite tight. It was I think less than a year, and they had these 4 or 5 million lines of code, so there was absolutely no way they were going to manage it manually. We wrote test cases for their 4 or 5 million lines of code in a day, that's the power of AI, and they were able to then accelerate that modernization effort and bring forward the timeline by about five months. They finished well before time. So that's the power of determinism. Once you know what it’s going to get, you can start rolling it out at scale knowing fully well it's not going to surprise you. Your tests are going to work the way it does. With OpenRewrite, we're taking it a step further. So over the past few years we've gotten much better at handling different kinds of frameworks. So a lot of our customers, for example, when they say, “We are modernizing,” they're not changing the tech stack. They basically might be going from say Java 8 to Java 17 and SpringBoot, something like that. So still they're not going from Java to Python. In those cases, a lot of times you know what needs to happen. There are certain strategies that you can codify and then repeatedly apply to certain kinds of codebases. And then OpenRewrite just takes that up to a whole new level. So what we are trying to do with our partnership with them is that we can work together. Diffblue analyzes the codebase, writes the unit test cases, identifies the gaps, passes all of that information to OpenRewrite which does create an AST, by the way, and then OpenRewrite produces deterministic recipes which could be applied onto the project, automate all of that away. So again, the key here is automation, because there's not a whole lot of deep learning happening in the whole process, but you get the outcome you're after which is a modernized code, taking 10 percent of the time it would have taken if you would do it manually.

RD Is there anything we didn't cover that you want to talk about?

AM We did this comparison with Copilot because a lot of our customers ask us about Copilot, and quite a lot of them already have Copilot. So internally in Diffblue, we treat Copilot as weather. And we don't see it as a competitor because we think it compliments us quite well. So we did this Copilot AI study where we took the same three open source codebases and ran Copilot Chat using the latest model to write unit tests for it, and we asked Diffblue to do it as well. The results are all online and available, we have published a study as well, but the key data point from that was Diffblue was 26 times more productive than Copilot in writing unit tests. The key reason for that was, even though sometimes Copilot produced just as much coverage in the application as Diffblue did, there was a lot more human intervention required, a lot of back and forth. And second, places where there was not much coverage, there was no avenue given by Copilot to help me understand what could I do, other than just some prompt engineering which is quite opaque, to figure that out. So we did that whole exercise across the three applications and the results were 26 times more productive if you're using Diffblue to write tests than Copilot. I will share the link so you can put that in the show notes for those who are more curious or want to replicate the results themselves.

BP I think we’ve got to check your work and make sure you can back up those claims, but we'll definitely put the link in the show notes.

AM Which is what actually most of these models should do. Thank you for pointing that out. That's another thing that really annoys me about all this AI models. They’ve created these benchmarks and the benchmarks are complete–

BP They're nonsense.

AM I'd say nonsense. And so because of that, there's no way. So they have these benchmarks where they produce the numbers for, but you can train an AI model on that benchmark and it does a really well job. So with any automation, whether that's LLM, whether that's Diffblue, the proof in the pudding is in having it identify a clear use case. If you don't have a clear need, don't use AI, not everybody needs to use it. Once you've identified a clear use case, try different things because the road to glory takes many directions. So you don't need to always use an LLM for something that is probably not a good fit for LLM.

BP That’s right.

AM I'll tell you one thing that's actually a very good fit for LLM, but is something we struggle quite a lot with– naming a test case. LLMs are excellent at naming test cases. This is something at Diffblue we struggle with. We've made a lot of improvements in the past so now they're coming out much better, but our very basic solution to naming, you would laugh at it now, was we would just number our test cases. So we basically gave up initially on figuring out how to name a test case. So if we were writing a test for a method called ‘show page,’ we would call our test, ‘test show page,’ ‘test show page one,’ ‘test show page two,’ and then it's up to the developer to figure out what the test is actually doing, whereas Copilot and other LLM tools will always name the test very descriptively so you can just read the name and figure out.

RD Well, naming things is one of the hard problems in computer science.

AM In life, actually, because I'm going to have a baby boy in May and my wife and I can't seem to figure out what to name it. So there you go.

RD Best of luck.

AM Thank you.

[music plays]

BP All right, everybody. It is that time of the show. Let's shout out someone who came on Stack Overflow, shared a little knowledge or curiosity, and in doing so, helped out our whole community. Awarded yesterday to Keet Sugathadasa, a Populist Badge. That means Keet's answer was so good that it got way more upvotes than the accepted answer. “Gitlab CI CD variable are not getting injected while running gitlab pipeline.” If this is a problem you've run into, you can check this out. It's helped over 57,000 people. It's part of our CI/CD collective, and there is an amazing answer from Keet, so congrats on your badge. As always, I am Ben Popper. I am one of the hosts of the Stack Overflow Podcast. You can find me on X @BenPopper, and if you liked what you heard, you know what to do. Leave us a rating and a review or subscribe to the podcast to tune in in the future.

RD I'm Ryan Donovan. I edit the blog, host the podcast here at Stack Overflow. If you want to reach out to us with comments, suggestions, topics to cover, you can email us at podcast@stackoverflow.com. And if you want to reach out to me directly, you can find me on LinkedIn.

AM Thanks, Ryan and Ben. My name is Animesh Mishra. I am a Sales Engineer at Diffblue. You can find more about Diffblue at diffblue.com, or if you're still on Twitter @DiffblueHQ. And I can be found on LinkedIn. My username is SirAnimesh.

BP Nice.

AM I'm yet to be knighted.

RD All right, everyone. We'll talk to you next time.

[outro music plays]