The Stack Overflow Podcast

Would you board a plane safety-tested by GenAI?

Episode Summary

Ben and Ryan are joined by Robin Gupta for a conversation about benchmarking and testing AI systems. They talk through the lack of trust and confidence in AI, the inherent challenges of nondeterministic systems, the role of human verification, and whether we can (or should) expect an AI to be reliable.

Episode Notes

Robin is the author of a practical handbook for Selenium test automation.

Connect with Robin on LinkedIn, Twitter, or via his website

Shoutout to user2651084, who earned a Great Question badge by asking How do I reset the Jupyter/IPython input prompt numbering?

Episode Transcription

[intro music plays]

Ryan Donovan Level up your generative AI skills with Neo4j GraphAcademy online courses. Learn to ground LLMs with a knowledge graph for accuracy and build a reliable chatbot. Start today at neo4j.com/LLMs. 

Ben Popper Hello, everybody. Welcome back to the Stack Overflow Podcast, a place to talk all things software and technology. I'm your host, Ben Popper, joined by my colleague, Ryan Donovan. 

RD Hey.

BP Ryan, you and I have chatted a lot about Gen AI as it has emerged as kind of the hot new paradigm and the central focus for a lot of big tech companies, the thing that they want to foreground that they're building, creating, got a platform for, et cetera. One thing we've also learned as we chatted with people internally at Stack Overflow building this and with other folks doing it is that it can be very difficult to benchmark this stuff, to test this stuff, to understand best practices around something that's changing so fast and is so new. So we had a listener write in, and really glad he did. Our guest today is going to be Robin Gupta, who's going to talk to us about the topic of test automation and AI. Robin, welcome to the show. 

Robin Gupta Thank you so much for having me on the show. It's a pleasure. 

BP So Robin, tell our listeners a little bit about how you got into the world of software and technology and how you ended up focusing on this particular topic. 

RG So actually, I've started programming with BASIC– Beginners All-purpose Symbolic Instruction Code. So that was the first programming language throughout school. College days were about hacking and writing viruses. And then I didn't choose testing, testing chose me. It's like the Sorting Hat. So at Accenture, they give you roles and domains, so I got testing. That's how I got into testing and therefore onwards has been testing, automation, full stack development, and so on and so forth.

RD And you got to testing and AI. So we've been talking to folks lately and they say a lot of people are spinning up Gen AI projects, but not much is getting into production. Why do you think that is? 

RG I think there are a few things. People are not very confident around my thought process. So for example, there is a question which I ask a lot of people, that if you board a plane, let's say you're coming from the US to India, and they announced that the autopilot on this plane has been developed and tested by AI, would you get off the plane or would you still do that intercontinental flight? So that confidence is missing. 

BP I think that's a good question, and I think it kind of gets to something, which is how deterministic is AI versus other software systems? Often you think of maybe AI as making decisions that try to interact with some kind of input in the real world, and so people tend not to, at this point, trust self-driving cars as much as they trust people, although I would argue people probably make more mistakes than self-driving cars. So let's get to it. Let's say there's an issue of trust. Do you think that's backed up by the data? If you were to look at systems and try to automate testing on them, do you think AI these days tends to make more mistakes in– the field we talk about first would be code generation.

RG 100 percent. And interestingly with code generation, it's one to one. You write a program for Fibonacci series and it will write that program for you and it will run it and get the outputs. It really gets interesting when it is testing. In testing there will be infinite inputs and infinite expected outputs. So even with the Fibonacci series, if I were to test it out, I would understand what's the boundary value. Is it infinite? Is it 9,999 plus 1, minus 1, and so on and so forth? That is where the real trust breaks and also the deterministic nature of the system breaks. For example, if you're doing testing for a login page to Stack Overflow, user ID, password, submit, you're on the home page. But for ChatGPT, it's not so much. Also, similarly, if you look at the data, interestingly, I think the performance has been increasing. It was 20% good with GPT-2, 30% with 3.5, increasing so on and so forth, but it's a whole lot farther away from being accurate to 90% or so, which is required in fields like banking or healthcare and all, or maybe avionics as we discussed. 

RD When a lot of people think about generative AI, they think about it being sort of creative and coming up with solutions, and I don't think people want testing to be creative. They want it to be reliable and repeatable, right? 

RG Exactly. 

RD Is there a way to limit the creativity of Gen AI to make it that sort of reliable, repeatable process?

RG Let me expand that question. We actually created Gen AI to be generative and now we are trying to limit that generation. So it is like asking somebody who's an author of a book to come up with mathematical solutions that have to be very precise and accurate. 

BP This is one of my favorite things about it. We came up with this amazing new sort of construct in AI which is like this dream machine that can dream up any image or any poem, and then we said basically, “Can you be a search engine and a code completion agent?” which is not what it was created for at all. 

RG This reminds me of my school days. So in India we were put in schools and, “Oh, you can become whatever you want. What would you like to become?” Exactly five years later, “You have to be an engineer. You must study mathematics and you must do these three things.” What was the point which you told me earlier?

RD Maybe the question is, should we be asking generative AI to be more reliable? Is that the wrong thing to do with it? 

RG Definitely not. I think generative AI actually not just creates, it actually creates and hallucinates. So even ChatGPT has these answers. If you ask it a question and it gives back an answer, it has actually hallucinated that answer. It is the Top P or the temperature of that creativeness as what we generally control. So we can definitely make it more precise, and we should for some use cases, but then it also depends on the use case at hand. For example, in our case, let's say it's testing. So if we are testing an e-commerce application and we are trying to place an order, it should actually come up with those creative scenarios. So it's actually a very good tool for that. It can create all those negative scenarios. Can I place one order? Can I place -1 order? Can I do 999 orders? It's like a tester walks into a bar and the bar explodes kind of thing. So it should definitely do those things and we should make it do that. 

BP It's interesting you use the word hallucinate there, and I think that has come to be a term of art in the industry, meaning that you've presented me with an incorrect fact versus a correct fact when I've asked you to sort of recall or search or explain something to me. But yes, more to your point, that's what it's doing all the time. It's trying to predict what the next token would be or what the next pixel would be. In a sense, it's just doing what it was designed to do. There are solutions to this now, which in a way almost mirror a kind of testing automation, which is to say, let's do chain of thought. Let's have one agent write, then another agent critique, one agent critique, and another agent check, and some of those other systems, and Ryan has written about this and talked about this a bunch, are older, more deterministic, more symbolic AI and so they're not creating or generating in the same way, right? 

RG 100%. And to that point, how do we best test AI at scale or how do we automate the tests for an application which is using AI, let's say RAG– Retrieval Augmented Generation. Why not use another AI which does all this? But then, when these two AIs are talking, who will check the result? That should be a human. So there should always be this human in the loop who keeps an eye on the results and ensures that things don't go out of hand. It's like two interns or freshers doing that testing, but then somebody senior should verify the results.

RD In some ways, humans don't scale as well as technology. It's very expensive. 

RG Controversial points are coming up. 

RD But we’ve seen a lot of the sort of LLMs double-checking LLMs. Do you think that's an effective process to test generative AI? 

RG Definitely yes and no, as most of my answers are. So it is definitely effective when you do, let's say, a RAG example as I mentioned. So the second day I can check whether the answer provided is actually from the grounding or training dataset or not. But for the first pass kind of testing, “Let's explore this whole Salesforce application which has hundreds and thousands of pages,” that will become just too complicated and expensive at the same time. So I totally agree, Ryan, that humans are good at scaling at some places, but not very good at a lot of these clicking and entering kind of things, which AI can help us with. Also at the same time, when we have a checklist of the usability of an application for certain use cases, then humans can only look at that and then comment on it.

BP One thing I do remember saying, and I'd have to double-check the context of this, but I'm pretty sure that I said this on a previous podcast in a discussion I had. I chatted with a friend of mine who works at dev.to, and he was saying that one of the things that he likes to offload to generative AI is the act of writing the unit tests for something he's just done. He finds that it's pretty easy for him to scan its output, and then when it runs he can kind of get a quick sense of whether or not it's doing what it's intended to do, and that saved him a lot of time writing all that unit testing. Do you think that's a useful application of this, and maybe in an ironic way, one of the things it tends to be pretty good at?

RG I would totally agree that that's a very good use case for unit testing. Also, interestingly, what happened at my organization was that we had the Copilot for our VS Code and a programmer was trying to write some very simple API and it had to call some other API. Interestingly, Copilot hallucinated that whole new API and the team member had to debug that API which didn't exist for half a month or one week or something. So unit tests, definitely, but then somebody has to read them and it's not like, “Oh, just go write unit tests and look at the results when they fail.”

RD That hallucinating libraries and APIs, I think that's going to become an increasing problem. I read about a security researcher who spotted one that was sort of being hallucinated regularly, and they created an NPM library for that API, and people started downloading it because it was suddenly in programs.

RG Exactly. See, all programmers are, by design, lazy to some extent. So if you tell me, “Robin, does this library use ‘library.a.package.method’ and it will do something,” I'm like, “Wow, I don't have to write it now.” 

RD I think it's a Bill Gates quote that if you have a hard problem, give it to somebody who's lazy. You'll find the most efficient solution. 

RG Exactly. 

BP So Robin, tell us a little bit about, if you can, what it is you're working on these days with your company. And I think it would be interesting to know where are the areas where you feel like you're able to take advantage of this, where are the areas where you still don't trust it, and where are the areas where maybe you see improvements being made over the next year or so as we sort of get a better grip of what you can do in this new arena? 

RG Okay, we'll start small. At my company, we are actually using AI to build out assistants. We found that, of all the various use cases, RAG applications are the simplest and most reliable ones to start with with Gen AI. So we have an assistant called assistant.provar.com. It helps users with the documentation and technical queries around the product. But now we are actually dipping our toes to use actually AI for UI test automation– user interface test automation. So at Provar, we really excel in testing the Salesforce and other B2B SaaS enterprise products. We are seeing if we can use AI to really test out some of those systems, not do the whitewash of just AI and we have an AI sticker on the product. So that's about us. Also, what I'm seeing increasingly happening is that when I had started using Selenium or some of these UI test automation tools, people used to write all of those things by hand. But now if you ask ChatGPT, “Hey, give me the Selenium script to log into Amazon,” it will just blah, blah, blah, blah, give it out. So we are also exploring those domains. Many players in the industry are also looking at if it can accelerate the UI test automation, creation, execution, reporting, and analytics. So rather than somebody writing it, can ChatGPT write it for me or some other large language model which can be incorporated into the system pretty soon and reduces the go to market time. 

RD So you're actually bringing things into production– Gen AI stuff into production? 

RG So for example, the assistant that I mentioned– assistant.provar.com– that is in production that can be accessed by end users and they can ask questions. Now, bringing it into production is one aspect, but interestingly, it's also based on large language models, so it can give wrong answers and we have a disclaimer at the bottom. So that is something that we very actively check what kind of questions are coming through and how many thumbs up/ thumbs down are coming and how we can fine-tune the data set to give more accurate answers. So it's a nonlethal use case of large language models, as opposed to that airplane case that we discussed. 

RD So in these cases where you can make mistakes, it's a little easier. Because I think when we were chatting before, you had a great rhetorical question: Are they brave enough to put it into production?

RG I have worked at healthcare, ed tech, and now developer experience. Some domains, let's say e-commerce and all, they are fine. The maximum which would happen is somebody will place five extra orders for free. But if you're doing it in a healthcare domain, whether it is a venue, whether it is a service provider, it's ambulatory care, it is very dangerous. Am I brave enough to enter that ambulance which is full self-driving? Not today. 

BP I think that's definitely something that we've been hearing as Stack Overflow goes out into the market and talks to different customers who are utilizing us and what we offer through OverflowAI, that there are industries which are moving much slower and more cautiously, typically those that are heavily regulated or have some kind of potential real world consequence. And I think that would be true for any technology, not just Gen AI, although this one tends to have a unique personality in terms of the kind of mistakes that it can make. You mentioned agents, and Ryan and I are going to have a guest soon to talk about sort of agentic workflow. Another thing that I have been hearing is that the next phase that people are excited about is the ability to have an agent which can sort of go off and do work for you over a longer period of time. So another thing that's interesting about Gen AI is that the way we use it is we ask it a question and we want an answer back quickly. And in fact, with code generation and things like that, they tried to get the latency down to almost nothing. But their performance improves on a pretty predictable path if you say, “Take more time to think about this,” or, “Come up with your 10 best ideas and then evaluate those and show me your top three,” or whatever. So what do you think about the idea of using this sort of agent workflow where you might say, “Listen, I'm going home for the night. I just finished my eight hour day. Take these 12 hours and poke and prod at this part of the code base. Come back to me in the morning with any improvements you have, any memory leaks you found, any bug fixes. I'll sort of walk through those and see which ones I can approve.” 

RG So in my head, and I've told this to a lot of people I meet and talk to, is that I've just made up this hypothesis that software development and testing as a craft has only like three to five years left in its life, because after that agents will just take over and we need to plan a coffee shop or something. 

BP Oh, no. Make your money now, Robin. Okay. 

RG Jokes aside, as a craft it will keep on growing and somebody has to review that agentic work. Even if we offshore our services or have contractors work on our project, we assign a workable packet to them and they do it, but still somebody has to review that PR and then merge it to production. Once it is merged into production, still run the checks. Hopefully those are not agents. Somebody has to click and see that it works as designed and then go through the motions of GTM. So I'm definitely excited about agents. 

RD It seems like it's a trend. You mentioned earlier, developers being sort of inherently lazy and being like, “Oh, there's a package for this. There's a library for this. I'll just use that.” This seems like an extension of that where there's not a library for it, but I can just go to Copilot, ChatGPT. I can get a library for it.

RG 100%. So when you think of the library, that is just a test angle, but with agents like MiniMe, maybe I can teach it Java and it can do these things and I'll just go snooze for four hours. Can it attend a Zoom meeting instead of me?

BP Robin, are there any particular pieces of this conversation or topics we didn't touch on that you'd like to discuss?

RG Two pieces, very specifically one. I definitely see this exponential rise in usage of AI for testing, test automation, and then testing the AI apps themselves. But then also at the same time, I see that also testing as a craft or human testers, I think that demand will just keep on growing from this point onwards with nondeterministic systems. Because with deterministic systems, we can code the test, but if we want to test out ChatGPT, interestingly, we will need something like RLHF, or Reinforcement Learning with Human Feedback. 

RD So in the future we're all going to become testers? 

RG Coffee shop or testers. Two options. 

BP I sent an Instagram ad I saw the other day to some friends, just kind of darkly dystopian, and it's all of these people saying, “I had no idea I could work from home and make $15, $20 an hour, totally flexible on my schedule,” and they're all just RLHF, just interfacing with the bot. Is this a good answer? Just being a knowledge battery that the AI is vacuuming up. It disturbs me, for sure. 

RD Do you agree with that path to dystopia with AI, or do you think there's a brighter future for us? 

RG I'm definitely very positive about it. I believe there's a brighter future for us. We moved from horses to cars. We'll move from IDEs to Copilots, but still the human at the center and the intelligence will have its special value in the universe. So it's like Star Wars and we'll have R2-D2s and all, but they will assist and fight the dark forces. So there will always be that going on.

RD And hopefully we don't get to the Dune scenario where we have to fight off the AI. 

RG Oh no, I would rather prefer the lightsabers than the Dune scenario.

[music plays]

BP All right, everybody. We always like to shout out a user who came on and shared a little knowledge or a little curiosity and helped to spread that around Stack Overflow. Shout out to user2651084, “How do I reset the Jupyter/IPython input prompt numbering?” How do I reset it? A great question that's been viewed 78,000 times. So when you ask a question on Stack Overflow, you help a lot of people with the same curious question. There is an answer here for you. As always, I'm Ben Popper. I'm Director of Content here at Stack Overflow. Find me on X @BenPopper, or hit us up with questions or suggestions for the show– podcast@stackoverflow.com. That's how we got in touch with Robin who's here telling us about test automation, so we'd love to hear from you in the audience and we'd love to have you on and chat if you work in software development. I think that's pretty much it. Oh, and if you enjoy the show, you know what to do– leave us a rating and a review. It really helps. 

RD I remain Ryan Donovan. I edit the blog here at Stack Overflow. You can find it at stackoverflow.blog. And if you want to reach out to me with article ideas, podcast ideas, hot takes, reach out to me @RThorDonovan. 

RG So my name is Robin. People can find me at robin-gupta.com. On Twitter, I'm @smilinrobin and on LinkedIn I'm polymorphicrobin. I work with Provar and also I have published a book around Selenium test automation, so you can even check that out on Amazon. 

BP All right, everybody. Make sure to look for those links in the show notes, and we will talk to you soon.

[outro music plays]