Testing is one of the main contributors to long build times, representing up to 90% of the total. Multiple factors contribute to test times. These include the number of tests, sequential versus parallel test execution, and testing dependencies on expensive external resources/services. In fact, covering this wide range of inputs with tests is one of the main reasons teams are starting to run tests only on CI, which pushes tests further “to the right” and considerably lengthens the feedback loop.
In this joint presentation by Gradle and Netflix, we take a behind-the-scenes look at the journey and many challenges of building a world-class Test Distribution solution. You’ll learn which specific issues came up when starting to distribute existing test suites and how these challenges were overcome.
“If we can deliver [Test Distribution] to more people, we are definitely going to change the lives of engineers at Netflix.”
– Roberto Perez Alcolea, Sr. Software Engineer at Netflix
Application testing is a necessary part of the SDLC, and even elite development organizations cannot avoid spending a vast majority of build time on testing alone. Speeding up the testing phase is thus a major priority for DPE practitioners, but it’s no easy feat–distributing tests precisely across parallel infrastructure requires the ability to increase file transfer efficiency, handle unstable network connections, enable intelligent scheduling and auto-scaling, and much more.
This talk explores the challenges and solutions behind making Test Distribution a real solution to be used by one of the world’s most advanced technology companies. Hear first-hand how Netflix reduced their build and test times from 62 minutes to just 5 minutes for hundreds of projects–built both locally and on CI–using Gradle Enterprise Test Distribution.
Marc Philipp is a software engineer with extensive experience in developing business and consumer applications, as well as training and coaching other developers. At Gradle Inc. he’s working on innovative products like Test Distribution and Predictive Test Selection in order to improve developer productivity. He is a long-time core committer and maintainer of JUnit and initiator of the JUnit Lambda crowdfunding campaign that started what has become JUnit 5.
Roberto Perez Alcolea is an experienced software engineer focused on microservices, cloud, developer productivity and continuous delivery. He’s a self-motivated, success-driven, and detail-oriented professional interested in solving unique and challenging problems.
Gradle Enterprise customers use Test Distribution to reduce test times by parallelizing tests across all available infrastructure. They often combine this with Predictive Test Selection–a feature which saves significant time by using machine learning to predict and run only tests that are likely to provide useful feedback—for a force multiplier effect that accelerates the build and test process by 50-90%. Learn about these features by beginning with a free Build Scan™ for your Maven and Gradle Build Tool projects, and watching the short, informative videos.
Check out these resources on keeping builds fast with Gradle Enterprise.
See how Test Distribution works to speed up tests in this short video.
Watch a 5-min video on how to speed up build and test times with Predictive Test Selection.
Watch our Build Scan™ Getting Started playlist of short videos to learn how to better optimize your builds and tests and to facilitate troubleshooting.
MARC PHILIPP: All right. Yeah, welcome. So I’m very happy to be here to talk about what basically I’ve been working on the last three years, more or less which is namely a platform for running tests in a distributed manner. And it’s not just me today but we also have a special guest which is Roberto from Netflix. He’s still backstage, so we’re gonna call him to stage a little later and he’s gonna share the experience that Netflix had using test distribution. But before we get to that, I will basically show you what we built and walk you through our journey, our learnings along the way, our mistakes obviously as well. And yeah, I hope it’s gonna be fun talk. All right, so first of all, the initial question is, you might have, why would we even want to distribute tests in the first place? And the main reason for that is that testing usually dominates the overall build time.
So what we have seen at many customers is that testing takes up to 90% of the build time, and that includes like local builds, CI builds, and yeah, anything you can do to reduce that time will help you tremendously. The second question you might have is, why are tests even so slow, right? I think most of us are probably developers and know that tests can be slow for a variety of reasons. One reason is that they have dependencies, like they need databases to run, or a web server, some directories or virtual machines. And if you add network latency to the mix, for example, it might be a shared database that’s running on some other server somewhere, then it’s even going to get slower. But that’s one reason. The other reason is it’s just the sheer amount of tests. We encourage developers to write tests for new features, tests for the… If they find a bug to first reproduce it in a test. So it just adds up over time and we want to make sure that things are still working after every change. So we just have a lot of tests, which is a good thing, but we also want to run them as quickly as possible. And I don’t need to stress this point too much since we are at a developer productivity conference here.
So everything you can do to reduce test time will pay off in terms of developer productivity for a few reasons. One reason is if you have faster tests, you will run them more frequently. Yeah. So you will not wait to run them on CI, you’ll run them locally. You can actually shift-left the testing from just running most of the tests on CI to running them locally and cut down the feedback loop enormously. And that has a number of benefits. Hans has mentioned some of them in his keynote this morning, so I’m not gonna reiterate everything, but basically all this context-switching that you need to do because of the wait time is greatly reduced and you’re much more productive. So that’s a very short motivation about why we want to distribute tests. So let’s next look at some existing solutions for running tests in parallel.
One obvious solution is to just run them in parallel locally. You can use your build tool to fork multiple JVMs or you could use a testing framework feature just such as JUnit Jupiter’s or Spock’s to spawn multiple threads, to run tests and threads. And that works to some extent. One problem is that it lacks historical information, so it doesn’t know like, okay, what are the usual execution times of these tests that I’m distributing here across these different threads or forks? So it often is suboptimal in terms of runtime. If you don’t start with the slowest test classes first, you will not ever reach the optimal performance that you might have had. And even more importantly, if you only run things on a single machine, there are limits. It doesn’t scale beyond the limits of the CPU. So if you have like n cores and you run n forks or n threads and you have CPU intensive tests, that’s it. Or memory is another constraint that you need to kind of pay attention to. So you can only spawn that many JVMs on the same machine and it doesn’t scale beyond certain limits. So one common technique that we’ve seen and we’ve also used ourselves in the past is to basically do this on CI, yeah.
Have multiple CI agents run different subset of tests in parallel on CI. There’s a couple problems with that. So the first is you need to somehow split the set of tests into kind of groups. And that is often something that you need to think really hard about. Sometimes it’s made of naming patterns or something like that. And you need to pay attention that each test set has almost the same length, approximately at least, so the parallelization actually good in the end. And you need to revisit that constantly after tests are added and removed and changed and all of that. And another downside is you have a lot of overhead because you need to run the build on each of the CI agents.
So everything up to the test task needs to be repeated. Even if you have build cache like a Gradle feature or Maven Build Cache, even then you need to download all these cache artifacts on every CI agent and you need to start the Gradle Daemon, go through configuration phase and everything. So it does quickly add up and it’s a lot of overhead. And another problem is that if you wanna look at the test results for a specific test class, you need to know in which of these CI jobs that test class actually was executed. It’s not straightforward and you need to find that particular CI job, navigate there and see it. So that’s a problem. Or you need to build your custom solution that somehow aggregates all of these results into one report somehow. And there’s also something you need to maintain, and it’s a lot of effort.
But I think the most important reason why this is not such a great thing is that it only works on CI. You can never use this from your local builds. So basically every time you want to have feedback, you need to push this to some feature branch and run it on CI. You have to wait for the CI queue to pick up your job, run it, and then eventually you get the results. So it’s definitely a very long feedback loop. So those are motivations why we set out to build test distribution in the first place. And this section is about our journey, how we got there incrementally, and what challenges we faced. So this is the initial prototype. This video is actually something I recorded for Hans for some conference back at the time. It was based on Jenkins. So you can see Jenkins in the background, running two test agents, running two tests. And there’s also on the terminal on the right, you can see that two of them, two sets of 25 test classes were actually also executed locally. So this was the first prototype.
It was built as a Jenkins plugin. And the idea was that we could reuse much of the infrastructure that Jenkins already had, like Jenkins nodes connected to the Jenkins server. We wouldn’t have to build this ourselves. And another pro was, that time, many of our largest customers already used Jenkins. So it seemed like a good choice to maybe build on top of that. However, as it turned out, of course, not all customers used Jenkins and betting on one CI server being the future of something is always questionable. And even those customers that use Jenkins, after we had talked to them, we learned it’s not something that is so simple as we thought. Jenkins plugins need to be kind of interoperable, need to work together. And testing that often involves testing the integration of all these plugins in a staging environment. And it’s not something that is done very quickly, especially in larger customer deployments. So it wouldn’t have allowed us to move as quickly as we wanted to.
So we pivoted and decided to do this differently and build our own solution in the end. And this picture shows the high-level, very high-level architecture. So you have Gradle Enterprise in the middle, and you have Test Distribution Agents on the left. And these connect to a broker component inside the Gradle Enterprise server. And on the right, you have builds, local builds and CI builds. And they also connect to the Gradle Enterprise server. They can request agents from the broker. And that assigns agents to the builds. And they communicate with the agents to run tests. And very importantly, this works for local and CI builds, there’s no difference. Doesn’t matter where these builds are running. It’s actually totally unimportant. And the same that was already true for Gradle Enterprise, it runs on your own infrastructure, however you want to deploy it, on-premises in your network. And the same is true for Test Distribution Agents. So this is not some Cloud service. You still have full control over how you run these agents.
And the way you do this is by pointing them to the Gradle Enterprise server and giving them some API key for authentication and potentially some additional capabilities that they have that they provide for test tasks. And this comes in two flavors. We can see here it comes as a Jar file and a Docker container. It runs on Java 11. It runs on Windows, macOS, Linux and it does some environment detection during startup, like detects installed JDKs, for example. And then there’s also some requirement capability, kind of label model that lets you add additional functionality to this. And this was released in 2022, the initial release, where we released test distribution to the public. Another very important thing is we decided this should integrate with Gradle’s default test task. It’s not some special task. So you don’t have to change your build that much. You have to opt in on the test task. You have to enable test distribution. There’s a few additional configuration options I’m not showing here. But that’s all you have to do to try it out.
Just take your existing test task and opt in. And a nice thing about Gradle is that it tracks inputs, like all the files that kind of affect the outcome of a test task or any task in Gradle are tracked. So that gave us already the set of files that we had to transfer to the agents so they could set up a temporary workspace. And the same is true for outputs. So all the outputs are also tracked for every Gradle task. So we knew which files that we have to send back from the agent to the build and its local workspace. And for some files, we already knew how to merge them. For example, JaCoCo coverage files are transferred back automatically and merged automatically. So this is kind of totally transparent in that sense that you get one single JaCoCo coverage in the end of your test task, regardless of how many agents were used. This requires that your test task uses JUnit Platform for the reason that JUnit Platform is the only kind of model that has a separate test discovery and a test execution phase. And this allowed us to, applying all these filters, to know, okay, which test classes do we actually have that we can distribute ’cause not all the class files in the test classes directory are actually test classes. Might be some test fixture code in there, some abstract test classes. So that’s really important to know which are the things that we actually need to distribute so we can create a good partitioning.
And that means out of the box, many test engines are supported, of course. JUnit 5, JUnit 3/4, Spock. For TestNG, we actually wrote a custom TestNG engine, which is now maintained by the JUnit team. And for ScalaTest, we also recently submitted a PR to make it work with test distribution. There’s ArchUnit, jqwik and probably many other engines that already work just out of the box, because they are well behaved in that sense. Unfortunately, at least two kind of are currently not supported, which is Kotest and Spek. They are two Kotlin test engines, but we’ll work on that and try to add support wherever possible. And this also applies to Cucumber. The other problem is a little bit different. Cucumber is not a class-based test framework, but Gradle and Maven are very class-focused on how you configure things. So there’s some complications there. Basically that means if you have a different test engine, yeah, just ask us basically if you wanna try this out, whether it works and we would be happy to help out. And there’s also a script that you can add to your build that checks for compatibility of all the known test engines basically, and would add custom values to your Build Scan. As here you can see it’s supported or unsupported and you can actually see which test engines are used by this test task.
All right, another interesting thing that we learned and yeah, we wanted to of course have some insights into what’s going on during the build, during a test task when we run tests, which tests are running in parallel, which agents are involved and all of that. And we opted for a very simple solution in the end, for now, which is Trace files. So Trace file is, there’s an open specification for it. It’s not a standard, but it’s JSON format that’s event based. And the nice thing about this, I bet most of you here in the audience have a visualization tool already installed because it comes with Chrome and it’s a really nice tool. It’s actually really simple but really powerful. So for example, here on the right, it’s very bit small to see here.
We have focused on one of these input file uploads on the top and you can see how long it took, 530 milliseconds in this case. You can see three other file uploads were running at the same time. And you can also see that the first two local executors were already running tests, while input files were still being uploaded. And the agents, the last and the second to last lane here in this chart, are Test Distribution Agents. They run remotely and they had to wait for the file to be uploaded. So you can see a lot of information in this and yeah, I can definitely, if you build anything like this, this is definitely something that’s really easy to write and really powerful to use.
All right. What this does is orchestrate the distribution within the Gradle test task or the Maven test execution. And that means that the build is only run once and we don’t need all these overhead of, yeah, running all the tasks up to the test task before we actually get to the distribution part, which is interesting. But we only distribute the tests, we only sent the files around that we need for running the tests. And what we also did from the beginning was to use previous execution times in order to create more optimal partitions, so we could use… Like here, we could start with the slowest test classes first and this usually results in a more optimal performance in terms of parallelism.
Of course the initial release wasn’t perfect, otherwise I wouldn’t be speaking of a journey here. So the biggest shortcoming by far was its file transfer solution that it had. So the way this worked was each agent would basically check in its local file cache, does it have this file that it needs to run the test, does some Jar file for example. And if not, it would reach out to the build that it’s connected to and requested from that build and the build will upload it. And to make matters worse, the upload would go through the same web server connection we were using for messaging. And if you imagine the worst case you have n new agents and m input files, you would’ve to send n times m files over this web server connection.
And depending on how slow the build’s connection to Gradle Enterprise is, that’s a really yeah, bad problem to have. So this is a huge bottleneck and we knew that we definitely had to fix this and improve this. So we did in next release, 2022.3, we added a file cache that’s on the Gradle Enterprise server in the middle. And now this changed to that agents not requesting the files from the build directly, but rather going through the Gradle Enterprise server first. And so it could check its local file cache and only if it didn’t have the file would it request it from the build. It would only be uploaded once by the build and it would use separate HTTP connections to do so. A couple and parallel as well. And then the agents would eventually be notified by the Gradle Enterprise server so that they can download the file and store it in their local file caches.
Yeah, so we did some measurements for this and as you can see, it’s quite a drastic improvement. So what we did here is we used the spring-framework open source project. We used 24 just started Test Distribution Agents, so that means they have empty local file cache and we simulated a slow network connection, LTE, 10 MBit upload. And basically it was yeah, basically ridiculously slow before, three hours, 11 minutes, and it went down to like eight minutes in the first run. And in the second run it went… Which also had to upload some files because, I think some spring artifacts contained timestamps. So they had to upload it again, but it was only one minute and 27 seconds then. So that was a huge improvement in the next release.
Okay, one problem solved. Next problem. Of course things are not always stable, and the more agents you add to the mix, the more likely it is that you have some network problem at some point. So the agents and the builds have kind of persistent connections to the Gradle Enterprise server, and any of these can fail at any time. In some networks it’s more likely than in others, but Murphy’s law applies here, anything that can go wrong will go wrong, at least eventually, not every time. So we had to do something about this. And in 22.4, we introduced active management of WebSocket connections using WebSocket pings. And as soon as we detected a connection to be unreliable, unresponsive, or even a dropped connection, we would reconnect and the work would be rescheduled in case an agent disappeared in the middle of a build, middle of a test task, the tests would be executed in a different agent. And the biggest problem that we solved for this obviously was avoiding builds from breaking. So with all these network failures, initially they caused a lot of disruption because builds broke due to them and we didn’t want that.
So even if network flakiness occurs, the builds will now still continue on and retry and eventually have executed all the tests, and have all the test results in the end. All right, the next release brought Maven support. So that was added in 2020.5. So far we only had Gradle support and this required the Gradle Enterprise Maven extension as an additional feature. Test distribution was now supported. And this integrates with the Surefire and the Failsafe plugin in the same way that it integrates with Gradle’s default test task. And there’s also a configuration option that looks similar. It’s an XML rather than a build script, but it’s on the configuration of the Surefire or Failsafe plugin. All right, that was a quick side thing that we did to add Maven support. The next thing was working towards auto-scaling, and the first thing we did was make the scheduling more adaptive. So in this very fancy GIF animation that I created for the release notes, you can see on the top right, there’s two test tasks running. The first one is pretty quick and the second one takes longer. Here, the tests take longer and the way the partitioning and the scheduling worked initially was that we would get an estimate of how many agents we were going to get from the Gradle Enterprise server.
We would partition the test set accordingly into the same number of partitions, and we would just then run those partitions on the agents. The downside was as like here, if a test task finished earlier than another one, agents were idle and the task time for test task B in this case was not optimal. So we really wanted to do something about this and make it more responsive and adaptive to agents becoming available. So what we ended up doing was instead of creating these large-ish partitions as in the top picture, we created many more smaller partitions, but not too small ones. Otherwise the overhead of sending them over the network, run this test class and then run this test class, would have been too large. JVMs are also reused across partitions, which also required that we added some lifecycle support. And we also worked with the JUnit team to add a launcher session concept that lets you specify, tear down or set up code once per test JVM. So you could prepare something before the first partition ran on a particular agent, and clean it up afterwards. And with that, agent utilization already increased. So in the scenario here, you can see in the bottom, things are much faster and of course the testing time went down. So this is like the first step because if you want to auto-scale and then you have new agents and you don’t use them, it’s kind of pointless.
So that’s like the first step towards that goal. And then in the next release, in 2021.2, we worked in auto-scaling. We introduced the concept of an agent pool. So an agent pool is basically a name for a group of agents that has a min and max size and a list of capabilities that let us decide which agent pool to use for scaling up, and it provides an end point that returns a JSON response. It looks like this, and it includes the number of desired agents based on the current demand. So you could use that for scaling. And since one of the preferred deployment mechanism that we can advocate for is Kubernetes, we also have step by step instructions for setting this up. It’s like a small third party component called KEDA that’s involved there. It basically queries this endpoint and then configures a horizontal port auto-scaler for you to do the scaling and it’s really simple to set up. But at the same time this can also be used for any kind of scaling mechanism, like AWS Auto Scaling Groups, with just regular EC2 instances, this also works. And you can see real-time usage and historical usage in Gradle Enterprise as can be seen on that screenshot on the right, like the chart of minimum, maximum, utilized, desired, connected agents over time.
And with this release, we first really considered test distribution to be production-ready. Because now it was pretty stable. It contained all the features you wanted to have, you know. If you wanted to use this for real without spinning up a huge number of agents all the time, you could use auto-scaling and all of that. That meant that we kind of… We still worked on test distribution and fixed things, but we could be focused a little bit on something else, which is another acceleration feature of Gradle Enterprise. Not going into this very deeply, predictive test selection. It’s about selecting the right tests, given certain change set and and running those. And it’s the first step that we have in this cycle of a solution. And after you have selected the right tests, you want to run them as quickly as possible, so it complements test distribution very nicely. And this was released just earlier this year in 2022.2. And if you’re interested in that, there’s also a talk tomorrow by my colleagues Eric and Luke about this. I think it’s at 10:00 AM. Yeah, so just join them if you want to learn more about this.
Yeah. Last but not least, just recently, we made a nice usability improvement. You no longer have to apply a test distribution as a separate Gradle plugin. It’s now integrated into the Gradle Enterprise plugin. So you only have to kind of maintain that one version. It just makes things much simpler. And that was just released in 2022.3. All right. And with that, yeah, I wanted to do live demo, but then it was said no live demos, kind of, mhm, maybe not work. So, I thought about recording one, but then I talked to Doug, my colleague, and he just recently recorded one and he really did a great job. And it’s really concise and short and given the limited amount of time we have here, I decided to use that. So I’m just gonna play that. It takes about three minutes and I hope there will be sound.
DOUG TIDWELL: To keep things simple, we’re going to start the build from the command line, but test distribution works however you kick off a Maven or Gradle build. Before we start though, here’s the test distribution panel in Gradle Enterprise. The agent pool we’re going to use is called DemoPool and it has a minimum of one agent and a maximum of 10. Test distribution automatically scales the number of agents up and down. Notice that our agent pools have certain characteristics. The agents in the demo pool are running Java 11 on Linux. Our test cases can specify their requirements, and test distribution will make sure those tests are only sent to agents that can actually run them. And by the way, we’re managing our agents with Kubernetes here, but an agent can be a VM, a container, even a physical machine. It’s up to you. So now we’ll kick off the build, we’ll show the important parts of the build in real time, but we’ll skip through most of the action here.
First, we’ll run the build without parallelism and without test distribution. This never runs more than one test at a time and it takes just over a minute and a half. Next, we’ll enable parallelism but not test distribution and try it again. Now we can run two tests at a time. That cuts our test time roughly in half, down to 46 seconds. Now let’s turn on test distribution. When we run the build, it starts out running two tests at a time locally then it runs more tests in parallel as remote agents become available. This cut our test time down to 26 seconds. Looking at the Kubernetes cluster that manages our agents, we have five agents already up and running. If we run the build again, agents continue to be spun up by the cluster so that at its peak, we’re running 12 tests at once. That gets our test time down to 19 seconds, more than 4 1/2 times faster than our original time.
Let’s take a look at the Gradle Enterprise build scan for this build. This gives us a detailed look at all aspects of the build. From the summary tab, we can see the overall statistics. As we noted, it took about 19 seconds. If we go to the test tab, we see a list of the test cases that ran and we can drill down into any test case that looks interesting. All of the results from all of the agents are displayed here in one place, no matter how many agents were involved and no matter how many test tasks were run in parallel. Back on our machine, this example also generates the standard Gradle build report and the JaCoCo code coverage report in the build directory. We can look at those to get different perspectives on the build. To sum up, our total test workload was distributed intelligently across multiple agents, and we got more than a 4x improvement without changing any code whatsoever. That’s a fantastic result with our small example. With a more typical build that has hundreds or even thousands of test cases and dozens of agents, the performance gain could be even greater.
MARC PHILIPP: All right, that was Doug’s demo. Thanks again to Doug for recording it. There’s also a full video containing some more explanations on things on the Gradle Enterprise website if you’re interested. If you wanna check this out later, feel free to do so. So basically this is how we built test distribution, and yeah, I think it’s now very pretty much complete feature. We still, of course, are going to improve it and, yeah, I have more observability to it, add like build scan integration to it probably in one of the next releases, so there’s more to do for sure but it’s already something that works pretty well. And you don’t have to take my word for it, because we have a nice experience report also here from Netflix. So with that, I would like you to welcome Roberto to the stage. Yeah, he will talk about their experience at Netflix. So let’s give him a round of applause.
ROBERTO PEREZ ALCOLEA: Awesome, thank you. All right. Well, that was actually a fantastic timeline on how these things have been panning out. Super excited to be here today. I’m Roberto Perez. I work at Netflix as part of the JVM ecosystem. What we do is we own the whole build package public experience within Netflix if you have a Java application library or any kind of Java stuff other than Android. And we have been partnering with Gradle for several years. And the last couple of years we adopted Gradle Enterprise and many of their features. So today I’m here to talk about how we are using test distribution, what are the things that we are facing, and how we picture this to happen or be in the future. So quickly, to go through our current setup, Netflix historically adopted the culture of microservices. We ended up at this point with more than 3000 repositories that are powered by Gradle builds. So this could be Java libraries, Java applications. It could be other languages like Kotlin, Scala, Groovy, or even Clojure. And we have more than 190,000 builds, Gradle-based builds per week across different Jenkins controllers, up to thousands of agents, which each one of them have multiple executors.
And as you can imagine, to maintain all of these, we have hundreds or even thousands of engineers. So why did we start looking to test distribution? Well, we found that 88% of the build time of every single engineer at the company was close to just running tests. So we were like, okay, if you are spending really 80% of your time and you think that the build is slow, there are definitely opportunities to be better there. Their pain point was the consistency across local and CI. The classic, “It works on my machine” or “Only works on Jenkins”. So sometimes engineers will be just avoiding running tests locally. They will first write some code, push to CI server, and see what happens. And this is most of times producing a slower feedback for people. You might have tests that require Docker to start up a container that has a database. You talk to the database. This could be different between local and CI if you don’t have a proper Docker operability in your company.
Better compute resource usage. Performance will vary based on where you run the build. That’s the reality. So if you have an old machine and you want to run a build parallelized that you are constrained by the number of CPUs, and Marc did a lot of great context around it, we suffer from this. And whatever you have locally is different from what you get on CI. And sometimes even the local one is more powerful than the CI. And giving test feedback fast to allow engineers to write code and also understand what’s happening is very important. So if we can speed up the local development, the local development feedback loop is the most important thing for us as a productivity engineering team in the JVM space. It sounds easy to roll out test distribution. In reality, it’s kind of challenging. There are things that need to be in place. One of them is, okay, how can I make sure that all my 3000 repositories are using Gradle Enterprise or the distribution enabled on the test task. So we built something called Nebula years ago, which is a set of plugins on top of Gradle, and we distribute these to every single engineer in the company automatically for them.
So in Nebula, we have the ability to turn on, turn off certain features of Gradle as a settings plugin. So you can think of Nebula as a custom Gradle distribution, but it’s basically a settings Gradle file that applies some conventions for people, and we just ship it to them as a personalized zip file. So we manage things like what capabilities you should be running with in test distribution, what are the timeouts, do you want to use remote executables or not, things like that. In addition, as Marc pointed out, test distribution requires JUnit platform. So we found ourselves into an interesting challenge, which was, okay, engineers have built these projects for years. They might have their JUnit 4, they might have JUnit 5, TestNG and so on, but really we didn’t want to run a massive migration in terms of code. We decided to, you know what, we are just going to introduce the JUnit platform engines and go from there. So this way we can enable as many as possible without having to rewrite any single test for the company. So far this has been playing out very well. Definitely engineers are not expected or excited about, oh, I need to rewrite my JUnit 4 to JUnit 5 test, or I need to go from TestNG because I like the features into JUnit 4 for example.
So when we started looking at test distribution, what we call the phase one, it was basically the footing at that point. So we just deploy a small pool, 20 agents, Docker agents, we enroll our own repositories into that, which most of the tests are really Gradle TestKit based tests, so they are heavy in terms of shipping classpaths and such. So it was a really nice way to get a sense on what this thing could do. So this was one of our builds at some point, which is the linting project that allows you to rewrite builds for people. And it used to take almost one hour, one hour, two minutes just to run the test. And imagine having this failure. It’s like time wasted.
We enabled test distribution with these 20 agents and we got it to four minutes, 59 seconds. So that was more than 10 times of performance improvement for us and obviously for our team. But it’s mind blowing in a sense, like, okay, if we can deliver this to more people, we are definitely gonna change engineers’ lives here at Netflix. Well, work life. So we decided to go and do a second rollout, which we call the current Phase 2. We started creating more agent pools, like really standard pools in a sense that you will have VMs base and also container base. So you can accommodate things like Docker in Docker. I will get into that in a bit. And then we will have multiple regions. So if you were running, let’s say, on a Jenkins build, on AWS East 1 or AWS West 2, we will try to just put it in an agent that it’s closer to you. So that has been very inconvenient for us. The thing with this is that we run oversize, in a sense that we have just one type of agents, they have the same CPU, the same disk, the same memory, so it could be a waste of resources when it comes to compute and also with money. But we were able to enroll automatically more than 350 projects, which is 10% of our fleet. And we are looking forward into getting more and more based on what we learn from the tool. So we are expected to do a big push probably early next year.
So really wanna share more of the learning so far. So having two types of agents, for example, Docker based on VMs and one on container base, it’s really complexity when it comes to operations. We need to maintain Docker, we need to have opinions on Docker. We need to make sure that we do proper cleanup and such because the agent is a long leap process at the very end. Agent selection can be hard because now you need to ask the person, “Hey, do you need Docker or not? Do you need this type of machine or not?” We wanna take those decision for the engineers, but we don’t know what they’re doing their test. So the simpler, the better. We also learned that when you’re writing test, you might be talking to external services, or you might need some specific security policies to encrypt/decrypt a secret. So you need these capabilities that your application is the only one that can help, though.
So this one doesn’t fit into a common agent pool. And obviously every build is unique. Of course, people might write on similar frameworks, kind of similar operations, but at the very end, people add and remove the dependencies. People write different kind of tests. So at the very end, when you have all these 3000 repos, we can treat them as unique with commonalities, but still we cannot just go and say, you know what? If you enable this, you’re gonna save 10% of time, because that’s not reality. And some tests might require more compute resources. We have projects that load data in memory on the Java Heap. So those ones are more suitable for things like, okay, I need an agent with 16 gigs of RAM versus a small one that might just require a small agent with four gigs of RAM and that’s it. So we need to get better into compute resource usage.
So to close it up, where do we wanna see ourselves using test distribution? Today we have this massive agent pool with multiple type of agents, a container versus VMs, but it’s gonna be difficult to scale. So we are taking a look on things like Test-containers Cloud to avoid reducing this operational complexity. So if you’re familiar with testcontainers, it’s a Java library that allows you to run integration tests and spin up a container that has a real database, let’s say. But at the very end, for us, it will be like, okay, now we need to run Docker instead of VM so we can schedule a container. So Test-containers Cloud helps with that, in a sense that I can just run one type of agent that has a testcontainers agent that talks to their cloud, and schedules the container somewhere close to my region.
So I don’t need to manage the life cycle of a container, which is a big win for our team. In addition, we need to get into more fine grained pools in a sense of compute resources. So there are ways that we can map requirements from people. If we know that you’re using a test task that requires 16 gigs of RAM, we can definitely schedule you in a pool that has those capacities. If you don’t need something specific, we can just schedule you in a smaller one. So that’s where that we are constantly looking into it, and most likely we will deliver relatively soon. And the most important part is application security. Pretty sure when you work in your company, not every test can be executed anywhere.
Sometimes your application might have to talk to a service that, I don’t know, manages financial information either on test environment or production environment. Opinions are here. So for this case, we wanna get to a point where every single team or application can bootstrap an agent pool with the capabilities that they require, so we will have the proper security policies embedded into those agents, and only then we’ll be able to run tests on those engine pools. So they can just basically leverage Gradle Enterprise test distribution without needing to bootstrap anything from their site. So I guess to summarize, the journey has been great for us. We see huge performance improvements. Every single project that we have enroll has gotten back like a lot of time. We have calculated at least several hours per month that engineers are saving when using this feature. And when we have an outage because things fail sometimes, engineers comes to us to say, “Hey, please fix it because you are taking away the most important thing that you have delivered to me in the last six months.” So that speaks a lot for what we are accomplishing this in conjunction with the Gradle folks. I guess that’s it. Thanks.