With over 350 engineers continuously developing Apple Siri, the slow Maven build times (up to 30 minutes) could no longer be ignored. This talk traces the transformations of the Siri build that started with the migration from Maven to the Gradle Build Tool. It then covers the approaches taken to not only improve the Apple developer experience but also significantly increase feedback-loop velocity, address common pain points, and identify the right KPIs for evaluating developer productivity across the team.
“Hey Siri…” are words spoken by millions of people each day. To keep Siri scalable and meet growing demand, the team at Apple is always looking for areas to optimize the developer experience. In this presentation, you will learn how Apple achieved a 35-75% drop in build times by migrating from Maven to Gradle. You will also learn how Apple leveraged Build Cache, Build Scan™, and other Gradle Enterprise features to address developer pain points, such as 20+ minute Maven builds, and troubleshoot local config challenges, saving a remarkable 16,500 developer hours per year.
Ankit Srivastava is a software engineer based in San Francisco, working at Apple. He has a background in solving challenging problems associated with development environment frameworks, CI/CD, scalability, and software build systems. His goal is to enable engineers to deliver products at high velocity and scale without compromising quality. Ankit’s passions include Buckeyes football, hiking, photography, and good beer.
Denis Cabasson is a DevOps engineering manager with a strong focus on automation and streamlining the user experience as it relates to CI/CD processes and build topics. Based in Ottawa, Canada, Denis enjoys playing board games with friends and family, as well as soccer and badminton.
Gradle Enterprise customers use the Gradle Enterprise Build Scan, Performance, and Trends dashboards to identify and monitor performance bottlenecks in the build/test process, including build cache misses, code generation, and annotation processors. You can learn more about these features by running your own free Build Scan™ for Maven and Gradle Build Tool, as well as in our free instructor-led Build Cache deep-dive training.
Check out these resources on keeping builds fast with Gradle Enterprise.
Watch a 5-min video on how to speed up build and test times with Predictive Test Selection.
How to leverage a free Build Scan™ to keep builds fast by identifying opportunities for build caching.
Sign up for our free training class, Build Cache Deep Dive, to learn how you can optimize your build performance.
Ankit: All right, guys. Welcome everyone. Hope you guys had a great talk. So many amazing talks here today. And yeah, I’m Ankit, I’m part of the… I was part of the Siri from 2014 to 2020, and right now I’m part of the Apple Cloud infrastructure team. And this is Denis.
Denis: Hi, guys.
Ankit: And so as the agenda goes, so we’ll cover our migration from Maven to Gradle and the issues, lessons learned as part of that process. We also cover how we partnered with Gradle and working on the distributed cache. And then Denis is gonna cover how we manage, I guess works on my machine related issues that come up from time to time for all of us, and then we’ll zoom out a bit and focus on developer productivity from more user experience focus. Let’s jump right in. Just to give you guys an idea about the landscapes, we have around 315 engineers monorepo and then build times were around 25-30 minutes at that time. And then, everybody here is familiar with the challenges with Maven just working on, we had to do a lot of workarounds to get artifacts packaged and just working with XML was just very painful. For us the biggest problem was 20 minutes local builds.
And we explored the incremental build support that Maven had, and also looked into the parallel build support that Maven had but it wasn’t good enough to move the needle for us. And then we thought, okay, it’s time to basically start. And we looked at the roadmap for Maven also, and there was nothing that was really compelling for us that said, okay, we should just stick with Maven at that time. And Jason van Zyl, who was the original creator of Maven, he had also moved on to work on a tool called Takari, which was supposed to be the next version of Maven, but we didn’t see too much come out of it as well. So we started our journey to evaluating different build tools. And in 2016, early 2016 timeframe, we created a small POC of our code base, which has Java, C++, Scala, and bunch of code generators and code and internal plugins. And we validated three build tools. We looked into Gradle, we looked into Pants, which is a build tool by, was it Twitter? And also looked at Bazel.
Bazel at that time was just open sourced and they didn’t completely open source all the features that were there. And we weren’t sure how the community was gonna adopt bezel at that time. We decided, we did our analysis, we did our POCs, and then we decided to go with Gradle mainly because we had a good… It had good community support and good, intelligent integration as well. And also a bunch of third party plugins that were available. There were a bunch of resources that were available also for the developers to learn from. Being a small team, we wanted to make sure we are not a bottleneck for the teams as well.
We started our started our POC. We worked on converting our internal plugins to Gradle, and then in May and August, timeframe. And then after that, our goal was to get the code compiled and deployed. Once we can deploy the code it’ll be a big win. We did run in a lot of issues there, especially because the dependency resolution is different between Maven and Gradle. Just figuring out, okay, what dependencies to include, exclude, took a while for us to get going. Once that was resolved, we focused on creating the intelligent configurations so that people can start using Gradle for their local development. We also added PR checks to our Gradle builds at that time.
And then created a beta team of Gradle users who would actually use Gradle for their day-to-day work. Learned a lot during that fail. During that process, we ran into a few intermittent test failures as well. I’ll talk about it later. And then in December is when we had a big training session for two days, switched our CI builds to all Gradle, deleted all the palm files and went on a long vacation. [laughter] It didn’t really end up that way, but yeah. Build time comparisons. We had a good gain. In terms of compilation, I would say the biggest, most of the gains that we got was from the Gradle Daemon and also the test paralyzation that we get out of Gradle.
And lessons learned for us in that process, in that journey was that how you write Gradle plugins is different from how Maven does it. In Maven, you just extend the goals, default goals that exist. But in Gradle you have build source where you write all your tasks and plugins, and then you have to make sure that it doesn’t impact the configuration time of Gridle itself. Those were the issues that you were running into at that time. And you would run into a bunch of intermittent unit test failures or deployment failures just because tech execution order is different or the test itself was just not written the right way. And bunch of deployment failures too. Just be prepared for all of that. For us, Gradle Daemon was using up a lot of memory at that time. All those issues have been on resolved, but we had a very good open communication channel open with everybody at Gradle and they really helped us resolve those issues as well.
I would say one of the biggest issues that we had was the code, the repository will move ahead. As part of this migration, when you’re working on it, yes, you can pull from, in your branch, you can keep on updating, but as soon as you get your compilation and deployment working, add a PR check to it so that when other developers who are submitting code, they can at least see that, hey, they wanna make sure that… We had both Maven and Gradle at the same time working in the repository. So you have to make sure that they can have a look at if there are any failures in on the Gradle build side, and they can fix it on their end. So we spend a couple, like a month or so trying to catch back on a few things that were added later on during the process of our migration. And be ready to plan for support for like 3-4 months approximately.
Because people will attend the training sessions, but once they actually start working on a plugin or something that they need is when they’ll reach out for support, or they might see some failures on their local workstation, which has… You haven’t run into before. So be ready for all those things in terms of when you do the migration.
Let’s dive into the distributed cache feature. We knew that we’ll gain the biggest bang for a buck with the distributed cache feature with Gradle. And we were in talks with Hans all the time about, okay, when is this feature gonna land? When is this feature gonna land and we can test this out for us? And we initially… As soon as we had a beta for that, we set up the Hazelcast cache at that time.
And then figured out, okay, what task are cacheable for us and what is not cacheable, just to get an idea of what are the things that we need to convert to make task cacheable. On Linux, things looked pretty well, build times were down to six minutes or so. And we also integrated Build Scans at that time with local builds too and CI builds, so that we kind of understand. Denis will dive deeper into all that later in the talk, but what exactly is the state of the system when we ran the build.
And we focused on converting our internal plugins to be cacheable, and then we started our data testing. Things looked really fine on Linux, but we ran into a bunch of issues with macOS in general and also test failures or intermittent test failures that might happen or build failures that used to happen if there’s a cache miss, etcetera.
So we spent a lot of time fixing that at that time. Gradle Enterprise also came out around that time, so we were one of the first customers with that. We worked very closely with the core Gradle dev team as part of this whole entire process. And in this June 2017-ish timeframe, we had a go live. We migrated Gradle 4.0 as well, and then used Gradle Enterprise for our build cache.
So in terms of speed, we gained significantly overall. So if you build from cache, you will get a minute 20 on a CI agent, which was really, really huge. But it all depends on where the changes are in the code base, right? That’s when the cache becomes… Cache is useful. So on average, I would say we gain around 30%-35% in term of speed with cache.
So lessons learned as part of this process was, you wanna… There’s a fine I guess balance between inputs and outputs that you declare for the cache. And you wanna make sure that there’s… If you define too many inputs for a given task, it might trigger tasks which you don’t want to be triggered downstream. Or the opposite, which is, if you don’t declare the right inputs, then you might retrigger a task which… You might reuse a version from the cache, even though it should be re-triggered. For example, if I have a profile updated, I wanna make sure that it gets regenerated instead of downloading from the cache.
So we ran into a few issues like that as part of our migration. So making sure that you have the right inputs and outputs defined for the task is very, very important. And similarly for the outputs as well, we wanna make sure that you remove any dates, getchars and build numbers and you attract them separately. And code generation order was a big issue for us. For example, in JNI, we had to remove these symbols because that would cause cache misses for us. So just figuring all these things out took time at that time. And similarly for macOS was case insensitive, so we were having build cache misses because the file name was different. So we had to actually check out the file again on the local workstation to get caching to work again.
So similarly, do not share any outputs between IntelliJ and Gradle output directories. And we had situations where we had the same getchar, you had the same Java version, Gradle version, everything is the same, but you’re seeing cache misses and turns out the export version, or clang version that were different between environments was causing misses, because sometimes people are working on a newer version, the new version of Xcode or they need something else. So that’s why there were some cache misses that we were having.
And you would run into challenges where the third party plugin is not cacheable, so we kind of brought it in-house and made those task cacheable at that time. So yeah, in summary we went from 20 minutes builds to four minutes for our monorepo and got really, really good incremental build support, and had the ability to manage the tool itself from just updating a file itself for everyone. And Gradle Enterprise was very, very good addition for us. We could really, really see what the differences between local builds and what’s different in a given build environment. And Denis will jump right into that.
Denis: Perfect. Thank you, Ankit. Continuing on from where Ankit left, after this journey of going from Maven to Gradle, we had additional questions such as troubleshooting local configuration. It’s a famous story of, “It doesn’t work on my machine.” Who hasn’t heard that? Obviously, that was a story that was happening to us as well, and it was something that we wanted to try and fix. Here, the Build Scans proved a very valuable help. We enabled Build Scan by default for all our builds: CI build; local builds; anybody building something, we wanted Build Scan to know what was going on. We had to work to enhance those Build Scan to include local information about the state of what you’re doing locally. Here is an example of a Build Scan where you can see that we have added some information about what’s going on with gets.
What was your original getbranch? What was the original getchar? What was the local changes that you made to those files? That proved a lot of help to us to understand the famous, “It doesn’t work on my machine, and yes, I have the latest master. I haven’t made any change.” But then as soon as you start digging into the build cache, the Build Scan, you could find the information of what was going on. And that helped us troubleshoot issues for developers. Other things that we did to make sure the local configuration was working for users is that we were fairly aggressive about keeping Gradle and Java up to date. That means that up to today, we’re currently running on Java 18 with Gradle 751, which was the latest version of Gradle when I created this slide yesterday.
So hopefully, it’s still the case today. We made sure that we were aggressive about removing deprecated use all the way in the Gradle files so that we were ready to upgrade to the next Gradle when it was around. And finally, the other thing we did to make the Gradle configuration a little bit more manageable was move all the configuration from Groovy to Kotlin. Originally, our Gradle files were Groovy because that was what was available back then, but as soon as Kotlin gained first-class support, we decided to move to Kotlin, and we haven’t looked back since, and we’re very happy with having our configuration file in Kotlin. Moving on. I’ll talk a little bit about the user experience focus that we try to have, how we try to identify the pain points for a developer and how we try to address them and make them better as part of our migration to Gradle.
So just to give you a little bit of background, Siri is a big monorepo. We run 20 plus PR statuses to try to make sure that your PR isn’t broken, things are still working. We do rely heavily on end-to-end test. I’m not really proud to say that, but that’s a fact of life. Lots of end-to-end tests as they can be flaky at time. Back in the day, it was taking about an hour and 20 minute to run all the PR checks, which was a bit longer than what we wanted to have. There was a lot of dependent subsystem. Siri was calling other subsystem, and of course, those system can have their own problem. We identify three main challenges at this point. We wanted to work on flaky test detection, especially in this end-to-end test.
How can we identify flaky tests? What can we do about them? Identify the root cause of a failure, try to classify all the failures so that we can automate the resolution and help the user understand what’s going on, and overall reduce our PR check runtime. Flaky test detection. As I mentioned, we were relying on second party, third party. The first thing we implemented was recording and replaying all the second-party responses so that the test would always have the same response independently of what’s going on with the external system. We implemented automated snooze of flaky tests. What do I mean by that? We tried to identify tests that were failing in many different PRs that looked completely unrelated. We defined unrelated as working on different parts of the code from different authors.
And if we could identify some end-to-end tests failing on two PRs that really shouldn’t share anything, then we would just mark this test as snooze in our central system so that neither of those PR or any other PR would be blocked by this test. And then finally, once we have that in place and we were able to snooze those flaky tests, we went one step further, which is going back to the PR that had been failed because of those tests and changing the status on those PR to green so that user could proceed with those PRs. That took care of flaky tests, or at least part of it. Identifying root cause failures. We noticed that we had a lot of data in Splunk and Graphite, but it wasn’t really easy to figure things out. We created a new application that we call Dr. Watson to identify what was going on in Splunk and Graphite and automate a lot of the queries.
During the lifetime of the cluster, what went on with the cluster? What error happened? Is there any other memory? Or how is the cluster doing overall? We use that as well to detect third-party API failures, the ones we hadn’t stubbed, and to detect how the dependent systems were doing so that we can bring all that back and expose that to the user so that the user had a good idea of how the infrastructure around the PR was working. Reduce PR check runtime. As Ankit has mentioned, the Gradle cache and moving to Gradle had already helped a lot, but we wanted to move even further. What we did is that we implemented selective testing.
Because a lot of our tests were functional-level test, end-to-end test, we decided, okay, Siri has many domain of knowledge, and really, if you impact only one domain, there is no really any reason to run the end-to-end test for all domains. We tried our best to identify, okay, this PR, in which functional area of the product does it change something, and where should we be checking that nothing is broken? We implemented parallel tests for unit and end-to-end tests so that we can run more tests in a given timeframe. Today, we’ll probably use test distribution, but back in the day, that wasn’t available yet.
We had a bit monorepo, but we didn’t necessarily wanted a monobuild. So we were started breaking down our builds. So, although we had a big monorepo, which made our life really easy, because we had the consistency at all time, we started implementing components for slow moving modules. So some modules we detected, okay, out of the 400 modules, those 20 modules, really, they never, they hardly ever change. And when they change, it’s once a year, and it’s a big event anyways. So let’s pre-build them. Let’s use them as a jar dependency rather than a project dependencies. So that helped us speed up the process. That included GNI module. GNI modules were a constant source of headaches, because it relies so heavily on your local configuration, which GCC compiler you’re running, and all those kind of variables that we were glad not to have to track.
So the rest of the repo would’ve a binary dependency on those components so that you wouldn’t have to rebuild them locally, but you would just use a version. It made it a little bit more complicated to change components, but it’s a tradeoff that we were willing to take. Now to change a component, you have to make two PRs. The first one to change a component, create a new version, and the second one to consume it. And then to wrap it up with this user experience focus, we wanted to make sure that we gave feedback to developers. So, originally we developed other system that had other view of the information, but what we found is that developers are used to use GitHub and look at all the information from GitHub. So it made more sense to push all the information in GitHub.
So that developer had access to the information in the context of GitHub that they were using already. We’re using a lot of GitHub statuses. As soon as the GitHub checks API became available, we started migrating from statuses to checks just because it gave us more real estate to get richer information to the developer about what was going on. Instead of having one liner, which you usually only see the first 20 characters of, now we can have a whole page explaining what went on, and what failed, and why did the CI decide that the PR wasn’t good. We want to give rich and timely information to the developers, but we want to keep it relevant. We got feedback from developers that some of the comments we were putting on the PR were too noisy, especially all the comments about some things that succeeded that nobody was interested in what succeeded.
People are only interested in what they need to action. So we scaled down those comments and made sure that we only kept the comments where action was needed. KPIs to track. So, the next step to all that was, okay, how do we measure success in that? So we had to find some measure, just because, it was an effort, and we wanted to make sure that we were moving in the right direction. So as you all know, as soon as a measure becomes a target, it stops being a good measure. But we had to do something. So we decided that those three measure were the ones that we were most interested in. Last commit to merge time. So, usually a PR has a life cycle of its own. Developers try changes.
There is feedback, they make changes again. But what we’re interested in, how much time was elapsed between when the last commit was pushed to the PR, which means that the PR was its in final state, and when it would actually merge into master. So that was the first measures that we decided to track. We decided to track the number of triggers required after the last commit. We had a push system where the developer would ask the CI system to make sure that the PR was good. And so in an ideal world, of course, you would only have to do one trigger and then the CI system would tell you everything’s good with your PR and you are able to merge it. But what we saw is because of the flakiness of infrastructure and other problems, developer had to re-trigger.
So we kept a close eye on this measure just to see if there was some infrastructure issue and if we’re able to improve things on infrastructure or if things were getting worse. Finally, the third way we used to measure whether things were working or not was our MTPR success rate. What we decided is that every hour, let’s throw an MTPR, a PR with no code change, the PR should work all the time if the CI system is stable and working. Well, unfortunately, that’s not really what we observed. So we started around 40%, 50% success rate, but as soon as we had that visible, it became easier to identify, okay, what is blocking the MTPR? How can we resolve that problem and actively work on that? We have to watch out for the pitfall of only looking at those measure because of course, they’re only specific measure.
But increasing the MTPR success rate led to more PR from developer going through. So that was our user experience focus. The last piece that we did that from my point of view was fairly interesting and valuable for our developer, was to work specifically on code owners for monorepo. So in the room, there is a couple of people from my team who worked on that. So we decided that we wanted to have a specific feature for code owners for our monorepo because we wanted to alert the right people based on a code change. Remember, that was before GitHub made its feature available. So we’re bound to develop our own tool. We wanted clear information to reviewers about what we needed to review. Some PRs get 40, 50 file changed.
And as a reviewer, you’re an only… As an owner, you’re only asked to validate a couple of those files. So it made sense for us to break it down and give this information to the reviewers too, and make sure that there is the expectation from the reviewer was there. Finally, we wanted to validate the owner’s information. All too often we see owner information getting ease of stale, getting all right, malformed and all that. So we wanted a way to validate all that so that it kept, it stayed as healthy as possible. So what does it look like? We implemented that as a check, obviously. And we did a couple of changes from what you would expect in the regular GitHub code owners. So first off, we implemented it as a hierarchy of file.
We found that on a big monorepo, it’s easier to have a hierarchy of owner’s file rather than one top level owner file. Otherwise, everybody tends to modify the same file, which is not really good in git. We wanted to have the ability to express fairly richly, who wanted, needed to review a PR, whether it was an end, we wanted team A and team B to review or we wanted team A or team B to review. So again, we built all that. We built a timeout as well, in the sense that, owners is a way to let the people that should be responsible for our code change that time is now to review that change. But if they don’t go ahead and actually review the code change, we didn’t want the author to be stuck on that for too long.
So we decided two business day was the right limit for us. And so we gave two business day for the code for the owner to react. We developed some metrics in the back to find the good owners, the bad owners, then the ones that needed to be taken out of the ownership of certain file. We wanted more information. As I mentioned, we wanted more detailed information about WIFI to review. So as you can see from this UI, it’s fairly clear which file we expect which team to review. So we have a team here, the Operation Siri Apps Tool, that need to review those files that have been changed. And finally, we have the code owner’s validator check, right above our code owner’s review check, which is the Linkedin of the owner files that I mentioned.
So again, we wanted to make sure that those files were staying up to date because we had too many issues with group permissions. When you add, as an owner, a group that doesn’t have access to repository, which GitHub doesn’t like, or when people were messing up their regex, because everything is regex based to know which file you want to be able to review. So this is what we did for code owners. To this day, we’re still using it. I think our users, like it and it gives them the information they need about what to review and do that in a timely way. And that was, all of it. So, I will open up the floor to question and answers. Thank you.