On today's episode we chat with Prof. Gregory Kapfhammer, Associate Professor in the Department of Computer Science at Allegheny College, and a specialist in the subject of flaky tests. He breaks down his definition of what makes a test flaky, how you can solve for it, and automated solutions that take some of the guesswork and busywork out of avoiding them altogether.
There is a ton of great research to be found on Prof. Kapfhammer's website, including:
We've written a bit about how Stack Overflow is upping its unit testing game and how you can evaluate multiple assertions in a single test.
Thanks to our lifeboat badge winner of the week, Survivor, for answering the question: Is it possible to find out if a value exists twice in an arraylist?
[intro music plays]
Ben Popper This episode is brought to you by Backtrace. Complex software design can sometimes cause catastrophic errors. Backtrace.io provides tools that make sure your development process moves forward even in the face of those challenges. Visit them at backtrace.io and sign up today.
BP Hello, everybody. Welcome back to the Stack Overflow Podcast. The first episode we are recording in the new year, Thursday, January 5th. I am Ben Popper. I'm the Director of Content here at Stack Overflow, joined as I often am by my wonderful colleagues and collaborators, Ryan Donovan and Cassidy Williams. Hey, y’all.
Cassidy Williams Hello!
Ryan Donovan Hey, Ben. Happy New Year.
BP Thank you. So today's discussion is going to be good. It's about flaky tests, the dangers they pose, how you can fix them, how you can identify them. This is an issue we have observed within Stack Overflow as we expand our focus on unit testing. We've written a few blogs about this in the past which we'll include in the show notes. And I think it's something industry-wide that people are thinking hard about when it comes to improving overall developer productivity and code quality. So our guest today is Professor Gregory Kapfhammer, who is an Associate Professor in the Department of Computer Science at Allegheny College, and has written quite a bit on the issue of testing. Greg, welcome to the program.
Gregory Kapfhammer Hi, everybody! It's nice to be here on the Stack Overflow Podcast today. Thanks for inviting me.
BP So for folks who are listening, just give them a quick background. How was it that you found yourself in the world of software development and computer science, and what led you down the route of academia?
GK So since I've been young, I've always really been interested in computers. For me, they were an opportunity for exploration. I always really enjoyed learning, and computing was always a great opportunity for me to get an opportunity to meet cool and interesting people as well. After I studied computer science as an undergraduate, I actually ended up working full-time in industry, and after working in industry as a software engineer focused on software testing, I decided to go back to become a teacher. I then taught full-time as an instructor while I also went to graduate school in order to earn my graduate degrees from the University of Pittsburgh. Since my very first year as an undergraduate, I've always been super interested in software testing and so I continue to teach courses in software engineering and software testing and to conduct research and to develop software testing tools.
CW That is awesome. And so we want to talk specifically about flaky tests. Could you talk about what flaky test cases are?
GK Sure. When we say the word ‘flaky test’, we normally mean a test case that passes or fails in a fashion that's unpredictable and concerning. So oftentimes a test case will pass sometimes, and then it will fail at other times. Those test cases tend to add noise to our development environment. They tend to make it more difficult for us to run our test suites in continuous integration, and they often limit the velocity of software developers as they're trying to add features to their systems.
RD We ran a blog post a while back about kind of creating a deterministic core in your code to prevent this sort of flakiness. What's the sort of main culprits for flakiness?
GK Yeah, so that's a really good question. I would say, broadly speaking, there are two classes of flaky test cases. There are those test cases that are what we would call order dependent flaky tests, and then there are those that are called non-order dependent flaky tests. So if it's okay, let me quickly focus initially on order-dependent flaky tests. In that case, Ryan, those are test cases that tend to execute and then pass or fail in a flaky fashion, normally because they're sharing some state with other test cases in the test suite. So those test cases tend to be flaky because they don't have good setup and tear down features that are associated with them. And then their estate during testing is contaminated by the state of other test cases in the test suite, and ultimately that leads to flakiness. Would it be okay if I gave an example of the non-order dependent sources of flakiness as well?
GK Okay. So when we talk about non-order dependent flakiness, there's a whole bunch of things that we found that will lead to flakiness. For example, you may have test cases or parts of the program that incorrectly use things with the date or time. There may be incorrect settings that are related to waiting for asynchronous code to finish. There also may be randomness that's a part of either the test suite or the program. And in all of those cases, flakiness can also manifest itself in the test suite. Of course there's many other causes of flaky test cases as well, but hopefully those will give us enough to continue the conversation.
CW Yeah. And it's something that I think, as developers, we should care more about them. Because if you have flaky tests, it'll just kind of end up slowing you down over time as you're trying to figure out, “Wait, is this test actually broken? Is my code actually broken? Or is it fine?” And so it's good to know these high-level categories and topics and sources of them, just because you never know when you might need to completely rewrite some code or test suites if you have flaky tests.
GK And I think you're hitting on another point. When I have a flaky test case in my test suite, I ask myself exactly the same question that you did. Is this a problem in my test suite? Is this a problem in my program? And so I start to lose confidence both in the correctness of my program and also in the way in which I designed my test suite. And I think often it's a trade off in terms of the efficiency of our test cases, the efficiency of our development processes, and then the ultimate reliability and confidence in the testing process itself.
RD So do you need to start running test suites on your test cases?
BP Who's testing the testers?
GK So actually, I think that the answer to that is, yes, you do have to test your test cases, and there's a bunch of things that you can do to assess the quality of your test cases. You can also use techniques which will scan your test suite in order to identify, whether through static analysis or dynamic analysis, you have test cases that are flaky now or likely to become flaky in the future. So in short, I think, yeah, we do actually have to test our tests in order to ensure that they're helping us to achieve a confidence in the correctness of our systems.
CW Yeah, the slowing down part of it is so real and the necessity of tests that work every time. I actually worked on a team once where we did have a test suite that was flaky, where sometimes it just didn't work. I don't know if it was just randomness or random memory leaks or the concurrency, but things just wouldn't work sometimes. And so our solution was to just run the test suite like three times to make sure it works. Then as long as everything passes in the overlap in those times you'll be good. And that wasted just so much time when we could have been developing other features, writing better tests, doing anything, but that was just kind of what we had to deal with.
GK Yeah, I actually think you're bringing up a really good point. One of the key strategies for detecting flaky test cases is to simply rerun them frequently. And sometimes a test may be flaky, but then it might not manifest as a failure for a very long time, so one of the things that you do is simply rerun those test cases repeatedly. Another thing we do is we say, “Well, hey, that test case failed, but if it passes the next couple of times, it's probably not something that's really wrong. Maybe I can just ignore the circumstance in which it failed.” And one of the things I would say is, I do all of those things as well. Another thing that I have found helpful in addition is to, if possible, randomly order the execution of my test suite, because that helps me to often introduce the randomization in ordering that will help me to discover those order-dependent flaky test cases earlier, and so then I don't get stuck in a situation where I have to quarantine a test case or delete a test case, and then I lose information about the overall quality of my testing process or the overall quality in my program.
CW Yeah. So besides rerunning test cases or randomizing the order, what are some other ways that we can detect flaky test cases as developers?
GK So one of the things that my colleagues and I have done is develop techniques that will either statically analyze the test suite or dynamically analyze the test suite. And so when I say static analysis, I essentially mean looking at the source code of the program and looking for patterns that might lead to flakiness. When I talk about dynamic analysis, that actually refers to running the test suite and observing characteristics about the test suite and the program while they're being executed. So one of the things that you can do is run those analysis techniques and then use those to identify characteristics that historically, for your project or for other projects, have led to flakiness in test suites. Those approaches tend to work well, but as humans, I think you'll agree that we get overwhelmed by all of those details. And so one of the things that we've done is develop machine learning techniques that can identify patterns arising in the characteristics from static analysis or dynamic analysis that tend to lead to test case flakiness.
RD Do you have any results from that? Has the machine learned?
GK So that's a really good question. And one of the things I would say is that, first of all, some of what I'm going to share next has been documented by others in books or in articles or in blog posts, and maybe we can link to some of those in the show notes. The first thing I would say is, test cases tend to be more flaky when they take up more memory or when they run for a longer period of time. Test cases tend to be more flaky when they act more like integration tests and less like unit tests. So when a test case calls a function that repeatedly calls other functions, you tend to get into situations where those test cases are more flaky. When test suites are running to test programs that use asynchronous code, or code that uses randomization or dates or times or calendars, in all of those situations, test cases are more likely to become flaky.
RD It's interesting that you talk about when it's acting more like an integration test. It sounds like there's a lot of this that relates to dependencies for the code that you're testing. Is there a way to avoid that by using mock data or test stubs?
GK Yeah, so your point is a really good one. In one of the research papers that my colleagues and I recently published, we actually ran queries against the Stack Overflow database to find out the kinds of questions that people were asking about flaky test cases. And you actually hinted at one of the big categories, which is related to some type of shared state. And so what we found was, either because you were using a version of a framework that had a bug in it, or you were using a database and then not clearing out the state, in all of those circumstances, shared state tends to be a big problem. And so one of the things that you can do in order to mitigate test flakiness associated with shared state is to have more set up and more tear down that you either build as a developer or as a tool tries to create for you so that you can get better isolation between test cases. Of course there's a trade off, and Cassidy, you hinted at this a moment ago. The better you do at this isolation to avoid these dependencies, you might end up making a testing process that's slower because you're clearing out a lot of state and then adding in a lot of state during the time when you're executing the test cases.
CW What were some of the other categories that you found as opposed to shared state? What were the things that you found when you were searching for what people are looking for?
GK Yeah. Another thing that we found is that when people ask questions about flaky tests on Stack Overflow, they're normally about issues related to timing in the user interface. And in fact, I've had a lot of test cases that I've written on my own where I'm using things like Selenium or other tools in order to test the user interface of a website. And when I do that, I often don't know precisely the amount of timing or waiting that I should put into my test cases so that the web applications UI can get into the right state. So at least in terms of how they ask questions on Stack Overflow, many people struggle with flakiness when it comes to the user interface of the system that they're testing. The other thing that we found is that oftentimes when a test case is flaky, it could actually be pointing to a more serious issue in the program under test. So for example, there might actually be a logic error or some kind of bug in the program under test, but the test cases assertions aren't tuned to that bug appropriately. So the test cases pass and fail in a non-deterministic fashion, but it's really actually because of some kind of bug in the program under test.
CW That makes a lot of sense. And so when developers do find these flaky tests, what do we do? How do we stop them?
GK So I think now we're talking about issues that are related to debugging and fault localization. And I think that some of that process is going to necessarily be manual in nature. Some of it may be automated and actually supported by various automated tools that might help you. For example, if we pick up our previous case, maybe they might help you to automatically find the logic error inside of the program that you're testing. For example, some of my colleagues and I have developed techniques that do what's called automatic fault localization. And then essentially, you run your program through your test suite, you record information about what happens, you look at pass/fail information on the test suite, and then you use statistical techniques which say, “Hey, with high confidence, I think the bug is actually located at this place in the system.” And then you might actually go as a developer and try to fix that bug and then see if after you fix that logic error in the program under test, the flakiness itself actually goes away.
RD What about for things that are sort of hard to test the results like machine learning itself? If you have flaky tests there how do you spot a failure?
GK Okay, so this is a really good question and I have answers that I’ll only briefly hint at right now. There are testing techniques that are called metamorphic testing techniques, and basically they ask you to consider the following question. So the question is, is there a way in which, for example, I could change the input to the test, which is the machine learning algorithm's input, and if I expect that changing the input shouldn't change the behavior of the machine learning algorithm but it does, that's a means by which I can often find bugs in my system without having to come up with fancy and difficult to specify test articles that are hard to encode as insertions in my system. So basically you look for these invariance that shouldn't change the behavior of the system under test, and when they do, Ryan, you say, “Aha! That must mean that I'm likely finding a bug in my system.”
BP And so you had mentioned earlier on that you are working through some of this academic stuff with industry partners, Duolingo and Google. Is that stuff you can talk about on the podcast? And if so, we would love to hear how you're helping them.
GK Yeah, that's a really good question. So, before I directly answer this, I should say initially that many of the things that I'm talking about today are examples of projects that I've done in conjunction with three close colleagues and friends. First is Owen, who's a PhD student at the University of Sheffield. Second, there's my colleague Phil, who's a professor at the University of Sheffield, both in the UK. And then my colleague Michael, who is an Associate Teaching Professor at CMU in Pennsylvania in the United States. So we initially started working on this project with funding that came from the Facebook organization, and we've since branched out and had conversations and discussions with our colleagues at both Duolingo and Google. What we're doing now is building prototype tools, which we've released as open source on GitHub, which are essentially customized for Python programs that use either unit test or pytest as their testing frameworks. And the techniques that we're developing can statically and dynamically analyze the test suite and the program under test, and then we use machine learning techniques which will help you to say, “Hey, the flakiness is likely at this spot in the program.” So what we've developed so far, Ben, are Python programs that can analyze large real world programs on GitHub like pandas and the test suite that comes with pandas, and then identify the flakiness inside of those systems. What we aim to do next is to develop techniques that can root cause the flakiness in an automated fashion. And then our ultimate goal is to be able to create automated techniques that can produce repairs for the test flakiness, and then surface those to the developers who can choose to incorporate them into their test suite if they think it's a good fit.
BP Interesting. So when you say ‘incorporate repairs’, is that something that a system could do automatically– evaluate and suggest improvements? I obviously have been watching as everyone has, the excitement over AI's capability to generate code, and recently saw it being asked to in generating tests. That was something I hadn't been seeing during the initial surge, but now seeing that as well. So can you tell us a little bit about how it would suggest those improvements? Where does that come from?
GK Let me comment on the point that you made about automatic test case generation. In fact, there are a lot of good techniques that can automatically generate test cases for you. If you're a Python programmer and you're familiar with pytest, there's a system that I didn't develop but I use, called hypothesis. And hypothesis will automatically make pytest test case inputs for you. Now I've used that tool and it's super helpful for me, but you're hinting at another issue which is related to either automatically suggesting a repair or automatically creating a repair, and we're currently in the process of developing those techniques now. My own view as both a researcher and a developer is that I prefer those tools to work in what I would call a ‘human in the loop fashion’, meaning that we want to be able to generate a test case repair, and then ask the developer for feedback, and then use the feedback from the developer to automatically enhance the repair that we're creating. And that's what we're doing in some of our current approaches which we hope to be able to release soon.
RD Yeah, you don't want it to be a black box.
CW Are there any other tools that you recommend that developers try out to figure out rerunning of failing tests or any sort of flaky test detection that might be useful for the average developer?
GK Yeah, I think there's a bunch of things that I would suggest. So as an example, if you're a Java developer, you might want to use the surefire plugin that will work and integrate with the Maven Build System. So Surefire makes it really easy for you to rerun failing test cases. You can also track the history of test cases. There's other tools, again, if you're a Java developer, like Test in G, which is very similar to JUnit, but it also allows you as a developer to actually specify that there are dependencies between test cases so that those dependencies are respected during the test execution process on your development workstation or in CI. In that situation, what we hope would then happen, Cassidy, is that when you're running the test cases in CI, the dependencies are respected and you don't get order-dependent flaky test cases, even if you're rerunning other parts of the system in different orders. The other thing that I would suggest if you're, for example, a Python developer, is that there's a whole bunch of super useful pytest plugins. There's pytest plugins that will reorder your test suite. There are pytest plugins that can rerun parts of your test suite or individual test cases. And then the other thing that I have found useful is something that I mentioned a moment ago, which is pytest plugins that will actually help you to do automatic fault localization. And so if a test case is flaky because of a logic error in the program, it'll actually say, “Hey, look here.” And then that might help me to resolve the bug and maybe even resolve the source of flakiness.
CW That makes sense. Those sound great.
RD You mentioned that you'd open sourced some tools that you were developing with some of the companies to detect flakiness. Can you give us a link to those, point us the way?
GK Yeah, I'd be absolutely glad to share those. We have all of the material on GitHub, and we have an organization that's on GitHub which is called flake-it, and we're releasing all of the tools that we've developed in the flake-it organization on GitHub. We're delighted to work with open source software developers or developers who are working in industry. If you'd like to try out our tools that would be awesome. We're glad to fix bugs in our system or to add features, and we're also really excited about trying out many of our flaky testing tools on your own programs. If you have flaky test challenges and you'd like to chat about them, I would love to talk with you so that as both a researcher and a software developer, I can try to identify cool news strategies to solving your test flakiness woes.
BP I love this. As the resident marketer on the call, I have a suggestion. “Flake it till you make it.” That's a good slogan. And I want to see an icon like the RedHat or the Penguin, some kind of lovely flaky pastry, but anthropomorphized. Seems like it would be ideal for this. Those are my important contributions to this conversation.
RD There you go. The Great British Flake-Off.
GK So Ben and Ryan, if you don't mind me commenting as the resident professor on the call, I should say the first paper that we ever wrote had as its first title, “Flake It Until You Make It.” But I never thought of The Great British Flake-Off until you suggested it. And I will now return this feedback to my British colleagues and see what they think as to whether it's a good idea or not.
BP Excellent. Well, yes. Thank you so much for coming on. It was a fascinating discussion and we will be sure to include many of these links in the show notes so folks can check it out or as you said, share their experiences with you and add to the research.
BP Alright, everyone. It is that time of the show. Let's shout out a Stack Overflow community member who earned a lifeboat badge. They came on and found a question with a score of negative three or less. They gave it an answer. Now that answer has a score of three or more, and the question has a score of three or more, so they saved some knowledge from the dustbin of history. Thank you to Survivor. Awarded five hours ago, “Is it possible to find out if a value exists twice in an array list?” Survivor will help you survive this and has helped over 35,000 people along the way. So thank you, Survivor. I am Ben Popper. I'm the Director of Content here at Stack Overflow. You can always find me @BenPopper on Twitter. You can always email us with questions and suggestions at email@example.com. I promised that I would shout out Andre who wrote in after a recent episode to say, “Yes, I have spotty internet and I do program on mobile,” and included a bunch of links which I'll share in the show notes to different ways you can program on a mobile device if that appeals to you. So thanks for writing in, Andre. And as always, if you like the show, leave us a rating and a review. It really helps.
RD I'm Ryan Donovan. I edit the blog here at Stack Overflow. You can find the blog at stackoverflow.blog, and I'm still on Twitter @RThorDonovan.
CW I am Cassidy Williams. I'm going to brag a little bit. I just got the yearling badge and the booster badge on Stack Overflow this week, so feeling pretty good. You can find me @Cassidoo on most things, and I'm CTO over at Contenda.
GK And my name is Gregory Kapfhammer, or Greg Kapfhammer. If you just search for that last name you'll get to my website, which is gregorykapfhammer.com. I'd love to connect with and reach out to people on Mastodon or Twitter or LinkedIn. And I look forward to any things that you would love to share with me about flaky test cases, the challenges that you’ve faced, and the solutions that you're interested in having us help you to develop.
BP Excellent. All right, everybody. Thanks for listening. Hope your tests are robust today, and we will talk to you soon.
GK Thank you so much!
[outro music plays]