Whenever a developer makes changes to their code, it usually affects just a tiny minority of their codebase. Sadly, the tests that cover these small changes often take up 90% or more of the entire time it takes to run builds and tests! It’s easy to see why a trending concept called “Test Impact Analysis” is taking root in organizations that care deeply about developer productivity.
In this talk, Gradle shares its strategy for Predictive Test Selection, its role as a promising Test Impact Analysis solution, the tenets and tradeoffs intentionally made to provide a solid developer and build engineering experience and best practices for software teams that want to use this exciting ML-based technology.
Having analyzed hundreds of millions of lines of code across thousands of projects, Gradle knows about build and test processes. You may be surprised to learn that 3% of tests take up 97% of the testing time, that the majority of testing time affects just 1% of code (and 1% of tests), and that testing more commonly uncovers problems with external systems rather than issues with the code itself.
What’s needed is a smarter ML-driven testing strategy tooling that provides observable and actionable data. Predictive Test Selection provides data that is instrumented to help dev and CI teams more rapidly troubleshoot performance issues and optimize feedback cycle times. In this talk, Gradle explains the pros and cons of common test-time reduction strategies, the origins and objectives of the Predictive Test Selection feature in Gradle Enterprise, and how all of this relates to DPE practices for building confidence in the performance and reliability of your toolchain.
Luke Daley is a Principal Executive at Gradle Inc. with a penchant for building developer tools and improving developer productivity. Over the last decade he has helped build Gradle Build Tool and Gradle Enterprise.
Eric Wendelin is a Sr. Software Development Engineer at AWS, and formerly led data science and research initiatives for Gradle Enterprise as a Principal Data Scientist at Gradle Inc. Previously, he led the Gradle core team and data engineering teams at Apple and Twitter.
Gradle Enterprise customers use Predictive Test Selection to save significant time by using machine learning to predict and run only tests that are likely to provide useful feedback. Other acceleration technologies are available in Gradle Enterprise, such as Build Cache, which reduces build and test times by avoiding re-running code that hasn’t changed since the last successful build, and Test Distribution, which further reduces time wasted by parallelizing tests across all available infrastructure. You can learn more about these features by starting with a free Build Scan™ for your Maven and Gradle Build Tool projects, watching videos, and registering for our free instructor-led Build Cache deep-dive training.
Check out these resources on keeping builds fast with Gradle Enterprise.
Watch a 5-min video on how to speed up build and test times with Predictive Test Selection.
How to leverage a free Build Scan™ to keep builds fast by identifying opportunities for build caching.
Sign up for our free training class, Build Cache Deep Dive, to learn how you can optimize your build performance.
ERIC WENDELIN: Hey, everyone. Today I’m gonna tell you about test selection, what it is, and why we think it’s a viable strategy for reducing testing time. So I’m Eric, that is Luke.
LUKE DALEY: Hello.
ERIC WENDELIN: And so, test time. We’re here because test time grows exponentially due to developers. More developers authoring more tests which submit more changes. And then, more and more changes also becoming automated, and requiring validation, and then multiplied by tests becoming more functionally complex. And so in that they require more sophisticated infrastructure to run. And that makes it easy for more developers to add significant computation burden to an already growing test suite. And yet there’s a lot of waste with this. So, we’ve talked with a lot of customers and users who say, “Hey, most of the time I’m waiting for feedback on whether my change was good or not. It’s attributable to testing.”
So that’s pretty typical. But it’s also typical for these changes to affect less than 1% of a given code base, and to affect less than 1% of tests in that code base. And worse yet, most statistically, most reported test failures are not regressions, they’re caused by external factors such as flakiness or external services. So, in order to deal with this problem, we as software engineers have developed a number of strategies to try to reduce this testing time. First of all, we might try to avoid running the tests when unrelated changes occur. So in this case, you might employ something like build caching or transitive dependency analysis to skip tests when nothing changed.
We might also try to increase the parallelism. So, we might split our test suite into multiple parts and fan that out to multiple CI agents. It’s very common. Unfortunately, this breaks down in the face of finite server capacity, or when using expensive cloud services, and when you want to spin up additional infrastructure. And during peak periods, this can cause either a lot of cost or a lot of waiting on other folks’ tests run. You can distribute these tests a little bit of a different way, but test distribution is not compatible with all projects either. And that either requires additional hardware or additional cloud spend.
There are also a number of strategies around static and dynamic analysis, which basically take either a reverse code coverage approach, analysis from previous test runs and code changes, and try to figure out which tests are statically or dynamically linked to the most recent change and run those tests. Unfortunately, many teams have found that this breaks down in the face of external service dependencies, dynamic invocations, Java reflection, for example, and multilingual code bases. As soon as you’re crossing some language barrier, those kinds of things become faulty and then they start missing test failures, and then you have to choose a different approach. Something I don’t see as much as I think we should is optimizing to the slowest tests.
So, a lot of folks know that maybe the vast minority of your tests incur the largest costs. These are typically like your integration, your functional tests. And some people would say like 80/20 rule, but it’s, in my experience and looking, I’ve seen maybe 3% of the tests incur like 97% of the time. And so just optimizing those tests actually can get you a lot of benefit. And probably the most popular strategy that folks use right now is what some people call shifting right. And that basically means running tests later in a validation pipeline, moving them to run, say, after merge or nightly. And so, very few teams in practice actually run all tests before merging changes. So how many of you folks have like nightly builds or tests that run after your merge changes and not before?
Okay. It’s actually less than than I would expect, but the thing is that folks who do that, someone had to choose what test suites need to run after a merge. So you’re already doing a form of test selection, it’s just that it’s done manually at times and it gets out of date very quickly. So you can’t, you already can’t have 100% confidence in your changes. And even with all that, and not all bugs are caught. So given all this, and the fact that folks are still really suffering from these long feedbacks times, we felt strongly that there’s an opportunity to establish new methods, particularly in the area of test impact analysis, to kind of expand the available tools and make them more mainstream.
And that’s why I’m talking to you about predictive test selection today. And test selection is a form of test impact analysis that intelligently selects and executes relevant tests based on code change history or the test history, using a probabilistic machine learning model. So the key word here is predictive. We’re predicting the subset of tests most likely to provide useful feedback. Now this is not a new concept, so many organizations have written about various strategies for test impact analysis and applying that. And the folks here typically report between 40%-70% of test time being saved with little or no quality risks. And more recently there have been some emerging commercial solutions as well. Not all of these are strictly predictive test selection per se, some of them use reverse code coverage strategies or different other forms of test impact analysis, but the idea is the same, that you are taking a smart risk to run relevant tests in order to save time or resources.
Okay. So as we’ve developed predictive test selection and talked to many customers and users, we’ve found a number of common patterns for things that we are certain, we’ve learned are really important to providing a good user experience. Really, all of these boil down to one thing primarily, and that is trust, skipping tests is scary. I mean, if you don’t feel like you’re in control and black box systems are hard to trust. So we have to work really hard to make these systems transparent and observable so that developers know exactly when it’s being activated, what the benefits are, why tests are being selected or not selected, right? Organizations need to be able to control where in the validation pipeline this is enacted. And even at a more fine grain level, necessarily, what tests are eligible? So you might have a test suite where some tests are really critical to validation, and you might want to designate those as always run. And so it’s really, we found it critical that folks have these mechanisms. And then of course, adding that observability piece. So like what are my expected risks?
What are my expected rewards? Measure the actual outcomes and compare them to what we expected and then adjust. Okay? And this type of thing needs to integrate with developer workflows in a natural manner without having to add analysis before and after their test runs. And it needs to be good. It needs to be good because it needs to save significant time and incur low risk, otherwise you lose trust, right? And part of that effectiveness is adapting to evolving code bases and highly varied cultures. So, one of the things that I’ve learned is that every team, sub-team project has its own little sub-culture. And we’ve had to work very hard to develop a model which can adapt to these different strategies and styles and remain effective both in terms of controlled risk and savings. Okay?
Alright. So I mentioned evolving code bases as a primary challenge. How do we deal with this? So first of all, we established a data partner program at Gradle where we have, I would say hundreds of projects and many, many millions of test executions to learn from. Okay? Having that variety was critical to developing a robust system. Also choosing a bit of a simpler model allowed us to provide something that is more robust in the face of these unknown environments, ’cause the reality is that we’re deploying this thing and we can’t predict how folks are going to use it, and what kinds of changes that this system’s going to face. So I had to establish these model variables, which as close as possible reflect actual software engineering domain, right? So for example, the features of a model we could classify in four areas.
The first is, the input to test relationships. The second is, the historical change patterns, so the types of changes that are normal and the types of changes that are abnormal. Also, of course we wanna take into account the actual changes right now, and model those in a way that’s sort of natural and yet robust. It can’t be too detailed because if you make something too detailed you tend to overfit and it will work for a variety of cases or it will work for a small variety of cases. But it might work very well. So we avoided that. And of course we wanna take into account the test outcome history, not necessarily how often tests fail, but whether tests are sensitive to certain types of changes. Okay?
And so this caused us to, this simpler model. We wanted to choose something like XGBoost classifier, which is tree based and more easily interpreted, and provide features that model these software development paradigms. Now of course, flakiness is probably the largest problem with trying to make any sort of prediction, ’cause flaky tests or flaky environments are inherently unpredictable. And we don’t want to predict flaky tests that doesn’t help users. Developers don’t want test failures, they want regressions. And yet most test failures are very, very many, are caused by flakiness. And in fact, Meta reported in their paper that a model trained using flaky test data would fail to report up to three times, fail to report up to three times the test failures. So that would mean like three times more often a developer would make a change, the tests would succeed, but a regression would be introduced.
And that’s what we want to avoid in all cases. So we had to guard against flakiness at many levels. And even then, it’s still not perfect. But, so it’s a continuing… Test flakiness is clearly just a problem regardless of optimizing testing time. So one of the other challenges that makes this even more complex is imbalanced classification. So a vast minority of tests fail. Maybe one in a hundred, one in a thousand depending on the type of test. And having any false positives from a result of flakiness noise in the model significantly worsens model quality. So, and all data scientists learn pretty early on I think that your model can only be as good as the data you put in. So you have to guard against flakiness very heavily.
Any system that can’t properly take flakiness into account is fundamentally doomed. ‘Cause here’s what happens. What happens is that you train a model and it learns to predict flaky tests. And then it’s just gonna select the test that fail most often. That’s not what we want. So, yeah, and just to note, I actually found a lot of inspiration from credit card fraud models because when you think about very small number of credit card transactions are fraudulent, and there’s a lot of false positives and it’s, I mean, yeah it’s very highly imbalanced classification problem. So, and similar to credit card fraud, you actually have to provide, sort of interpretable predictions. If you decline a credit card transaction, you need to justify to the customer, “Hey, we declined your card because it was used in multiple countries in 24 hours.” So we wanted to have the same thing to establish developer trust, ’cause it’s, again, it’s easy to lose trust in a black box system.
Okay. So here is how predictive test selection works with Gradle Enterprise. So first of all you have your test build Gradle or Maven for us right now. And what it does is it uses the existing build cache snapshot functionality to determine what changed. And so it’ll take this snapshot and it will take a set of test classes that are candidates to be run, and it will send those to Gradle Enterprise, which then extracts the change in test history for those changes in tests and invokes a machine learning model to select tests. And then those tests that are selected are, that test that is returned to your build, tests are run, and then the test results are immediately reingested into Gradle Enterprise in order to update the model. So, right after the change is made, a test fails, whatever, the model can immediately respond to that change.
Yes, yes. So the benefits of this are that direct integration with the build agent allows us to have a much, much richer model of test inputs than just file changes or whatever you’d put into version control. Why? So we can model different types of inputs. So a jar that’s a third party dependency, versus internal project dependency. We can model that. We can model environment inputs such as JVM or operating system that may affect tests. That you’re not checking into version control. And furthermore, you can also normalise inputs. So you can filter out things like code comments or dead code to avoid noise when making… For predictions.
Another nice benefit is that it’s kind of natural to configure your testing and your test strategy in your build configuration. So it’s natural for folks to want to do that. And it’s also able to integrate pretty well with other build caching, test distribution technologies. And we can provide as much observability as we want. In fact, it’s really, really important that we’re able to have control over the data. So we could make changes to the build tool to surface more insights to make better predictions. ‘Cause in the end we want a product that’s very, very, very high quality so that we can maintain developer trucks ’cause that’s paramount. Now, there’s the major drawback to that type of system is that it doesn’t really allow… There’s no room to allow integration into custom tooling. Luke and I get asked about this all the time. And we are making this intentional sacrifice because we want to provide an extremely high quality user experience to developers and at the sacrifice of having universal compatibility.
Okay, so how do we do that? So first of all, one of the key parts to the solution is what we’re calling the simulator, predictive test selection simulator. And what this does at a high level is it just compares full test results to the test results that would have been if test selection were activated. So when you enable predictive test selection on to Gradle Enterprise or whatever, first what it’s gonna do is it’s gonna go learn from the history. It’s gonna go back in time and look at all your historical changes and all your historical test results. And it’s gonna learn those input relationships and it’s gonna learn what types of changes different tests are sensitive to. And all of this is analyzed, packaged up into inference data by test task or goal. And then we have the simulator dashboard that allows you to isolate the test runs and the builds that might benefit the most from test selection and visualize, okay, what exactly are the expected risks? What are the expected rewards? And most importantly, having some kind of simulation mechanism allows us to show folks the expected risks and rewards without making any change to their build and without introducing any quality risk.
So here’s the simulator, this one’s for the Spring Boot project. So here I’ll just give a quick overview. The tool needs to show, okay, what’s the expected risk? So right now, just under 99% of test failures are predicted, which is okay, it’s not ideal. What we want is test failures to be very sufficiently rare that you won’t lose confidence and would save about 50% of time. But this risk estimate is actually quite conservative because the reason is primarily because we can’t perfectly detect flakiness yet. So we might dig into particular case where the model may have made a mistake. So the confidence is rather low, 94% or whatever. So we might dig into it and investigate, what would have happened here? So this possibly incorrect prediction here in this example at the bottom of the screen you’ll see that the avoidable tests, six of 19 tests would’ve been skipped. It would’ve saved 50 or 51 minutes but one of those six skipped test classes failed.
And we can dig into why or what happened so that we can have a proper assessment of the actual effectiveness of test selection. So here we can look at the test results and we can see that, oh, well, okay, in this case the test failed with an outer memory error, a flaky execution that could not have been predicted by the system. So it’s a conservative estimate. And then once test selection is activated for real, we need to be able to control adoption and visualize the impact and then compare it to what we expected. So here’s an example of that for the Micronaut project. In this case, just as an overview, you can see that test selection was activated on roughly 400 builds. And in that case it’s about 65% faster.
The builds are 65% faster on average, 12 minutes when using test selection, 28 minutes without test selection. So you can see that this has a significant impact on feedback time and for each test run where test selection was active, you can get a reasoning from the model, here’s why it selected this test or skipped this test. So that’s again, really important for maintaining developer trust. Okay. So now that I’ve told you a bit about what test selection is and why it’s a valuable tool for your toolbox, I’m gonna leave it to Luke to explain some test selection adoption strategies.
LUKE DALEY: Thank you, Eric. Yeah, so just looking at, okay, if this looks interesting to you, what does it actually mean to use it in practice? So I’ve got a great example up here. As Eric mentioned, it’s currently compatible with both Gradle and Maven, but I’ll focus on Gradle for the time being. So if you’re already using Gradle Enterprise, we have some this, the basic glue necessary to do that at the top here. And then enabling predicted test selection for existing Gradle test tasks is as simple as this. So quite easy to just drop it in turn it on and to get going. And just to be clear, so we’re talking about the simulator before. This is not enabling the simulator, this is actually turning selection on for real at this time.
The simulator doesn’t require any of these changes as long as you’re using Gradle Enterprise. Eric also mentioned that there are certain cases where it’s important to instruct a selection engine that, “Well, no matter what, I really, really want this test to run.” That’s pretty easy to do at the build time you can just specify, there are a few different ways to do this, but we recommend using annotation. So you can define an annotation in your project and say anything annotated with this, no matter what, always run it. This typically, there’s usually a few different tests within a test suite that will exercise a lot. Might be a good smoke test of the entire system. Just say, “Okay, always include that.” Yeah. So that’s what it looks like in a build time to actually adopt it.
So how do you integrate this into your feedback patterns, into the way that you’re constructing how developers get feedback on their changes? So one of the things that we have pretty clear in our mind and from talking to customers of Gradle Enterprise and potentially users for pretty good test selection is that this is the most common pattern that people are thinking about because a few reasons, it’s the simplest. So we’ll call this shift right to shift left. What we’re really doing here is applying predicted test selection to the pre-merge verification step. And we’re identifying tests that we can move to later. And why would you do this? Overall, it’s to reduce pressure on CI. You’re letting everything else come forward. You’re shifting everything else left by shifting non valuable tests right. So if you are compute constrained, which if you’re not congratulations, you’re the only person I know, then this is an effective pattern to get other feedback faster and not have that type all that compute with tests that aren’t providing that valuable signal.
It’s very easy to adopt this. With your existing pre-merge testing, you can effectively just turn test selection on, doesn’t require any other changes. So effective at reducing compute overhead and very easy to adopt. One of the things to think about here though, is that of course, as we’re talking, this is a risk first reward trade off. We are saying we’re gonna push some tests later and there’s a certain risk involved in those tests actually failing later. So you need to have a plan for dealing with that. And most teams already have this in place in one form or another. There are failures that are here post-merge, usually a way to kind of assign and triage and deal with those things. That just falls into that bucket as well. So this is typically where most people are starting and it’s very effective there.
There are some alternatives though, other ways to think about adopting this. So this is basically the same concept but pushed even a little bit further. So if you have that’s not really enough to reduce the CI pressure, you can think about taking some of those things you are using post-merge, so they’re running all the tests and push it even later. Again, you’re making an even bigger trade off here by shifting further to the right. So it’s really the whole dynamic in choosing which one to approach is, okay, how much compute do you need to recover and what are you willing to trade off for that? So this is the same pattern. Just push a little bit further. Alternatively, this is a different way of thinking about it. So now we’re not really optimizing necessarily for reducing compute usage.
Now what we’re wanting to do is get feedback sooner. So this is real shift left stuff. So now all those tests that weren’t run until later, we can bring them forward. And as Eric pointed out, most teams are already doing some form of manual test selection with a common pattern being, okay, unit tests happen pre-merge or the fast tests and all the slow stuff goes later. The problem with that is that a lot of those slow tests are actually providing really valuable feedback on the change. They’re the ones that are actually going to catch regressions a lot of the time. So that kind of simple, okay, just run the fast stuff first and the slow stuff later isn’t a very good strategy. So by using predictive test selection, it’s always making an intelligent decision, okay, based on this change, what should come forward?
And it’s going to adapt over time as a code-based changes. So even if you manually kind of make that determination one day and say, “Okay, with this subset of tests, we’re gonna bring these forward.” That’s going to change over time and most teams don’t go back and update that. So you’re effectively outsourcing that to predictive test selection. It does require changing build plans a little bit. So this isn’t quite drop-in stuff. So you’re now gonna have to change the PR builds or pre-merge builds to kind of run those sets of tests before but enable test selection on them. So there’s a little bit more to work to do there, and I like to think about this as a kind of automatic canary testing. So I’m making a change. Just pick a valuable subset of the slow and long running test, bring them forward and hopefully tell me earlier that something is wrong before having to wait for those post-merge builds.
So there’s another alternative to that, which is about trying to optimize feedback times here. So again, here we’re not really reducing compute overhead, we’re trying to fail faster. So you can think about the verification that your pre-merge builds are currently doing and breaking it into two passes. So you do a first pass with selection enabled to try and get a cross section of tests, get through that faster if that’s all good, move on to then doing the rest. So in this case, you’re trying to get that feedback in sooner. You haven’t traded off any actual… You’re not taking on any extra risk. You’re still running the same amount of tests before commit. You’ve just broken things up. So the benefit there is if something does go wrong, you’re going to find out about that sooner. In some cases, this does relieve CI pressure if you… This fails fast a lot, it’s failing without having to run all those sets of tests. But it’s really a side benefit of this pattern more than anything else.
So we’ve been talking about… The language I was using there, it’s very CI centric but what does this mean for local builds? And one of the things that I’ve been… We’ve been using predictive test selection internally now for over a year and I’ve actually been surprised at the effect it’s had on how I approach running tests locally. And I think most of us apply, again, some form of test selection locally when making changes as well. It’s usually, it can be prohibitive to say, “I can just run all the tests before I commit. So I’m going to manually think about, okay, this is probably a good test to run this package or that.” I’ve actually, combined with test distribution that we of course also use internally, I now just say, “Just test it.” So test selection applies locally. It will select the most relevant test for me.
I don’t have to think about which ones are for certain change. So the net effect of that is that I’m running more tests locally before committing because I can get feedback in a reasonable amount of time and running a good cross section of tests. We rely a lot on integration and kind of non-unit tests in the code base that we work on. So it’s been particularly effective there and that wasn’t something I was expecting. That’s something just kind of emerged out of natural usage day-to-day. So a side benefit of that is that predicted test selection, something we haven’t spoken about is smart enough to know it applies a kind of incremental or fine grained caching. So if I run those local builds locally and I enable this feature, when it goes to gonna run against a CI, those particular tests, those individual tests are going to be skipped because it knows they’ve run somewhere else and have succeeded. So it can also be used to offset CI a little bit there as well.
Okay, so is predictive test selection for you or where is the fit? So suggested you can think about it this way, there’s a kind of checklist. So consider build caching first, so build caching is really effective for tests with a shallow set of dependencies. That aren’t exposed to a lot of changes. So there basically you can effectively consider the entire set of inputs to the test. If none of those have changed, just don’t run any other tests. This breaks down very, very quickly with integration functional tests that are exposed to a lot of changes. Thank you, Vincent. So if you have a significant amount of test that don’t fall into that category, works for you. Again, if you’re one of the magical people with Infinite-Compute test distribution or something like that, we’ll get you later to get faster feedback. I’ll call it the cost of a compute, but that is complimentary. All of these things are complimentary. Build caching, test distribution, predictive test selection. They’re not mutually exclusive by any means. Okay. If you are shifting things right, if you’re using the common patterns to move some of the tests later, are you set up to deal with missed failures?
As I mentioned, most teams already have some way of doing this, but it’s something you need to think about, and how flaky are your tests? If you are drowning in flaky tests and not really doing anything about it, that is going to affect the efficacy of the selection model. So if you’re in a bad situation now, doesn’t mean necessarily can’t do it, but if you are doing something about it and are determined to drive that down, that is going to then yield better results over time. But if you’re in that situation with your flaky tests, this isn’t your only problem. Okay. So adopting it, how to think about it. So use the simulator, cost nothing. Once you have a Gradle Enterprise there, you don’t have to make any changes to your project. You basically just turn that bid on and then you can start to measure, okay, what would you save here? What would the risk be? Very low cost way to get a sense of things.
You can also use that to identify where, which projects and tasks and parts of your build have the most to gain. So you can selectively turn it on for some things. You don’t have to select it, turn it on for everything in the every project, very build. And also look at the prediction rate and make an assessment of, “Okay, well that degree of missed values, is that something we’re willing to accept?” Based on how much you stand to gain from it as well. As I pointed it out several times, make sure you have a plan for the missed failures. If you don’t have that and the first time a developer encounters something that was missed, they’re going to lose trust in the system very quickly. So make sure they understand how they can use Gradle Enterprise to identify why that happened. Look at all the diagnostics that are available there and have a triage process for that. And yeah, once you have actually adopted it, you can use the dashboard in Gradle Enterprise to monitor how useful and effective it was and ensure that you’re getting what the simulator told you were going to get. And also look for other opportunities to adopt.
Yeah. So thank you very much, I think we’re due to studying a little bit later. We’ve only only got a few minutes for questions. But if you would like to go and talk about us more, particularly about where we’re going with particular test selection, what’s next or how you can use it on your project, please come and see Eric and myself at the Gradle Lounge. Thank you very much for coming today. Yeah, and any questions, I think..