With over 4000 developers, Uber thinks a lot about developer productivity. To successfully build and deploy a single codebase, some of the unpleasant aspects of software development at scale are magnified. Learn how Uber keeps things moving with a focus on faster builds, as well as modern developer tools, techniques, and practices that reduce pain points like migrations, refactoring, and new technology learning curves.
Watch how Uber—with over 100,000 deployments per month—utilizes Developer Productivity Engineering (DPE) concepts to accelerate time-consuming steps like the wait period before merging code changes. By incorporating tools and practices to support build caching, intelligent merge automation, and leveraging under-utilized CI capacity, see how Uber made serious reductions in both merge queue time and overall merge conflicts with a DPE-focused toolchain.
Gautam Korlam is a Senior Staff Software Engineer on Uber’s Java Developer Experience Team. He is passionate about tools to improve developer experience in large codebases. He has spoken about a wide variety of topics around monorepos, build systems, CI & CD at scale etc.
Gradle Enterprise customers use the Gradle Enterprise Build Scan™, Performance, and Trends dashboards to identify and monitor performance bottlenecks in the build/test process, associated with build cache misses, code generation, and annotation processors. You can learn more about these features by running your own free Build Scan for Maven and Gradle Build Tool, as well as by attending our free instructor-led Build Cache Deep-Dive training.
To keep builds fast with Gradle Enterprise, check out these resources:
- Watch our Build Scan Getting Started playlist to learn how to better optimize your builds and tests and to facilitate troubleshooting.
- Watch this DPE Lowdown with Gautam and his colleague Ty Smith from Uber to learn more about DPE best practices at scale.
- Sign up for our free training class, Build Cache Deep Dive to learn more how you can monitor the impact of code generation on build performance.
Gautam Korlam: Good morning, everyone. Thank you for coming to the talk. I
haven't given a talk in almost three years since RideCo NSF, so it's very
exciting to see people face to face and just feel energy. Cool. So, my name's
Gautam. If you find it hard to pronounce you can think of it like Gotham City
from Batman. That's the easiest way you can think about my name. All right, so,
I work at Uber, I've been there for a really long time. Worked on mobile
originally, then recently I've been working a lot on the Java developer
experience. But I also work with the other teams to kind of generalize the best
practices we have across the company. And today I'm gonna talk about fast, not
furious developer experience.
So, what does the crux of this mean? I'm sure, when you're in the development flow and you're moving really quick, and you have interrupts that come in from time to time. Like, oh, you get a notification saying, "Hey, you got to go do this migration." Or, "Hey, your bill failed. Or go take a look at this flaky test." That takes developers off the flow when they're trying to get things done. Raise your hand if you have ever experienced this.
Okay. A fair bit. So, my talk is mostly focused on, how do we make sure the developers are in their flow? And how do you make sure that we are not taking 'em out of their core development loops, so they are productive and they feel empowered to make change. And I'm gonna give some examples as well as you walk through it. Quickly the agenda here will be just talking about Uber's scale a little bit to make you understand the problems we're dealing with at the scale. And as I mentioned, developers love to work on fast modern tools and frameworks. So, we'll cover a little bit of that as well. And a bit about how we prevent people from going out of their loop by using safety nets and standardization. And a quick overview on the results in summary.
All right. So, let's talk a little bit about Uber's scale. We hear this scale word thrown around a lot. So, just what does that really mean? Working at Uber, like if you see the dependency graph here of all the different microservices that are calling each other, we have thousands of them. We have hundreds of thousands of deployments happening every month. Millions of conflict changes. A few dozen something, mobile apps, 4000 plus engineers. And for these many engineers, the amount of developer product engineering team is actually very very small. I think they're two orders of magnitude is smaller than the number of engineers we have in the company.
We support officially seven programming languages, and we have tens of thousands of commits happening per month. And lots of lines of code etcetera, etcetera. What's interesting is we do use a monorepo for development for the most part. So, we see there are five platform monorepos, so platform essentially means the things like backend, frontend, mobile and we'll kind of cover a little bit about that as well.
And that also means that a lot of times when we have to do migrations, it's easier to do it in a centralized space so we don't interrupt people and take them out of their flow. All right, so in terms of the strategy that we follow, we have fast, modern tools and powerful frameworks to make sure developers are doing work in a good velocity. We have safety nets to maintain quality, and we have standardization to take care of efficiency.
All right, so let's talk a little bit about the fast modern tools and frameworks. On the tool side, we invest quite a bit in making sure our developers are getting the best class of tools available at any given point of time. I'm not gonna go too much into details on some of the more intricate ones. For example we used to do a lot of our development on laptops. We moved on mostly to develop on Cloud IDS. We have this product internally called Devpods. My colleague Ty has a talk later today, who's gonna talk a little bit more about how that might look like. And we also made sure that a lot of our developers are on M1 laptops to be able to compile fast. We have moved from Phabricator which is a legacy code review tool, and we are moving towards GitHub which is a more modern UI.
We have migrated from Jenkins over to Buildkite, and we support the latest version of IntelliJ at any given point of time. We also are mostly on Bazel but there are some parts of the code base like Java and Android are still on the Buck build system. And I'm gonna talk a little bit about how we are migrating towards Bazel as well. And since I work on the Java team, we are very focused on making sure we are using the latest Java versions for our developers. So, we also had a journey from Java 8 over to 11, and then again to 17. I'm gonna skip this one.
So, let's talk a little bit of the need for speed. I'm sure everyone loves to write code quickly, and then just get their job done. And one of the facets of this I'm gonna focus on today is the rate of commits. So, quick survey or like a question of sorts. What do you think slows down the rate of commits? How many folks think it's the CI build that slows them down today? How many of you think it's the code review process? Okay.
Cool. So, that's a really interesting aspect because we actually looked at this data internally, and we saw that code review is the number one contributor for the amount of time a commit takes to go into the main branch. Because usually it requires a lot of iterations, and there's a human component involved which means it's much harder to control. Every time you have to do a code review, you have to context switch from what you're doing to go review someone else's code, and then go back to what you're doing. So, that definitely use a lot of latency. And it's also a little bit harder to control or do much about without investing into better cultural practices about code review.
The second largest factor is actually pretty interesting. So I don't know if people realize this, but a lot of times when your code is fully built and also approved, it can just wait there in the... Before it merges. Because maybe you haven't seen that your code's been approved, you maybe put up a change for the end of the day and then you come back tomorrow and say, "Oh, it's ready to go, so I'm gonna go merge it now." Right? We see about like 10 to 20 hours from the time that the change is ready to go, before it actually goes and merges itself. So we at Uber call the process of merging code, landing, just so that we have some nomenclature in place. Landing is just essentially similar to like git merge. And we are gonna talk about why we use that term as well.
So what happens when changes wait before they're able to land in our case? They are more likely to conflict. So as I mentioned, we have hundreds of thousands of commits happening every month. So on a daily basis there's several hundred or thousand commits happening. So if you just went back home for the day and you came in and you tried to merge, a few things could happen. Like it could conflict with someone else's changes 'cause they landed ahead of you, or it could be that when you merge in the queue, which is essentially a lot of monorepos for example, have a queue ahead of them to make sure that when changes come in they're still able to rebase and build on top of the latest main before they're able to merge. And this ensures that the main branch is always green.
So when they go into the queue, the caches could be out of date 'cause you built it last night and then in the morning maybe especially at the scale of Uber, for example, we have teams working in India, in the EU time zones. So there's never really a lull in the development activity. So you can imagine like the cache being out of date, it takes longer to build in the queue, slows everything down for developers. And also just context switching. So if you're writing code and you have to go and merge your change, you have to maybe check out that branch again and then check to see if everything rebases. It's not ideal. So we've worked on this thing called Autoland, which is essentially a way to tag your changes as safe to land in this case or merge when they're ready to go.
So, what we do is on the client, or on your laptops, essentially, when you make a change, we ask, "Would you like to Autoland this change?" And in those cases that you want to Autoland, maybe you're making a small refactor, doesn't really need to go through a lot of iterations. A lot of changes are like this, by the way. We found that a lot of times developers wanna make small changes because it's easier to revert and also to debug. So for these changes you could tag them as Autoland. So when they go into code review and people basically take some time to review the change and then approve it, we are then able to intercept that. We listen for changes on the code review system and check is the build is green, and if the revision is approved, and if it's also tagged as Autoland.
In those cases, we actually do some additional stuff that I'll get into, but essentially we'll try to land it to the queue. But before we do that, since I mentioned like code review can take a really long amount of time. So imagine you put up a change maybe in the morning and it built fine. All the tests passed. Someone took a look at it in the afternoon and said, "Okay, everything looks good. It's okay to go merge." We don't want to immediately merge the change because if you did, again, the same problem of the caches being out of date, just because of scale of commits that we deal with, can slow things down. So what we do is in the background, when we are ready to land, we will rebase the change onto the latest main branch. We will rerun all the tests and make sure the caches are warm and then we push it to the queue.
Since this is a fire and forget operation for the developer, we don't really need to be very precise when we actually merge the change. We can have a little bit of flexibility in doing this additional work to make sure when the change is in the queue, it does not slow everyone else down and it can quickly merge. And in the cases that it is, for example, failing to rebase, we will send a notification back to the users via Slack saying, "Hey, you tried to Autoland this change but it's not able to because it's no longer able to rebase with main. So please go and fix that." And this is interesting also because even for changes that are not tagged as Autoland, since we can do the same process, we can actually tell people when they're ready to go that, "Hey, your diff is actually ready to land. If you want to, you can do it right now."
So instead of them going back and checking if it's ready, we still get some benefit of people just clicking a button on Slack and then deciding to land that change. I can skip this one. All right, so let me roll this out, what is the impact that we saw? So we, for example, saw the merge queue time improved significantly. You can see, for example, the median P95 times over here, they fell pretty well. I mean, this is for making sure that our... The SLOs for a merge queue are within manageable numbers, but they're doing pretty well right now given all these improvements that we have made. We also saw that the merge conflicts went on quite a bit. So before, when you would put something into the queue, 3% chance would conflict. Now it's less than 1%.
And that's really good because that avoids people going back and doing additional iterations and context switching. So that avoids the frustration of having to deal with this without having to worry about all that operations. And we also saw that this is actually very loved by our developers. So about 60% of our changes are tagged as Autoland. So a lot of people see the value in not having to go back and tag their... Manage the changes manually. There are of course some changes that you will want to babysit a little bit more carefully, like if you're making a really large platform change, for example. And we'll talk about that next. Generally the impact has been really, really good.
So as I mentioned, not every change can Autoland, and maybe it should not too, because there will be cases where you might want multiple teams to take a look at your change or you want maybe more reviews, or you may want to do like more thorough testing with the testing team or something like that. This could still cause problems for other folks. So if you have changes that do not Autoland or do not care about these Slack notifications, they will still slow things down and they will cause our metrics to go up. So how do we deal with this? Now, this was a very interesting problem for us to solve. So we follow this thing called a diff refresh cycle where, we looked at our data and saw, "Okay, how many of these changes do we actually had to deal with?"
It was actually very interesting. Most of the changes that we see that don't land, are within like, I believe a week old at most on the P95 case. And there are changes that take much longer to land, but those are the minority. So for the most part, as long as we're looking for changes that are like a week old, what we could do is, periodically, we could check to see what are the open code reviews, and we could do the same thing we were doing with Autoland where we background rebase and rebuild everything to keep the caches fresh. And we were able to do this because we had spare CI capacity that would fit the amount of diffs that we were trying to actually keep fresh.
So, we do have to calculate that to make sure you're not overwhelming your CI capacity. But once we did this, we would basically see that a lot of changes would be able to stay fresh and whenever they do decide to land, it's no longer going to cause a spike in the build time metrics. As an interesting example, we recently had some issues where we had to pause our queues for a couple of days. And we had like 150 something changes kind of just pile up during that time. And when we un-paused the queue, we actually didn't see any alerts so we're surprised like, "Why am I not getting paged that everything's just waiting such a long time in the queue." It used to be the case before, but since we are constantly refreshing these things in the background, when they went into the queue, we were able to clear the entire queue within an hour and we were kind of just within our SLO of 20 minutes for the merge time.
So it's just pretty amazing to see our efforts kind of pay off here. So it is also very simple for you to implement if you haven't done something like this. If you use a build cache for example, and if you have spare CI capacity, you could just do this act of rebasing and rebuilding in the background to use the spare CI capacity. And I think when we looked at CI utilization rates, apart from the active development time, we saw the utilization actually is very low. So you can do this when developers are not actually working to make sure your changes are fresh. So, it's quite easy to replicate.
Okay, so we talked a lot about fast tools. Let's talk a little bit about the modernness. So we like to keep things fresh, as I think I mentioned before, we have a lot of platforms that we support and each of these platforms we try to stay on the latest versions of tools. So on mobile, on the backend, also on the front end. And we're also always planning for what's next. How do we support the newest version of the SDKs, the JDKs, things like that. How do we support the latest IDEs? And it is very interesting to see this because a lot of times this does come with the cost of migration. So I think it's very important to quantify this because if you have a centralized team, you can kind of blunt the effect a little bit, but not in all cases it'll be all centralized. That's just a fairytale.
You will definitely want teams to maybe test something or deploy something to make sure the changes are still valid on the latest versions of the SDKs. So how do we deal with the cost of migration? So one example that I want to talk about is we, as a JVM team, we had this discussion a while back about, "Hey, how do we move from 8 over to 11?" And we didn't really have a process before this at Uber. When I started working on JVM, Java, it was mostly being used across the company. There were few folks using Java 6 for some reason. I don't even know why. And there were some folks using 7. So it was a mix of everything and we wanted to make sure that everyone is using at least 11 somehow. So we had to formulate the plan.
So roughly, if you look at the whole time, and it took almost two years over the whole period of time, but we were kind of able to break it down into a small amount of chunks to reduce the amount of decentralized work as much as possible. So we looked at first making sure everyone was on OpenJDK 8 as a baseline. And that would make sure that at least if you go from JDK 8 to JDK 11, you're only dealing with one JDK if you had to deal with breakages. So you're not dealing with multiple different versions. And after that, we made sure we could compile with the new JDK, which is always the least risky one 'cause you're not deploying with it yet. And then we would be able to run tests, also much less riskier because it's all localized to the development environment and not shipped to customers.
And then the production was interesting. So we had a flag in the deployment setting where folks could opt in and say, "Hey, I would to run my service on 11." And we also had to do a lot of fixes for this. We had to make sure the VM flags are compatible, and we are able to quickly switch between 11 and 8 if things broke. So we gave this option to our developers and said, "Hey, try out 11, let us know if something breaks and that'll be an opt-in." And after I believe around 20% of the company moved over, we said, "Okay, we are gonna move everyone over to 11 by default if you're a lower tier service, and if you have to opt out, you can do so by just slipping the flag back."
And that way what we were able to do is we were make able to make the transition really gentle for our developers. And after the runtime was proven out, we started working on enabling the language features. And also we are still working on enabling the Java 11 APIs. And that's a bit tricky because there are some legacy parts of the code base that cannot move to 11 for some reasons. And for them, if you use the 11 APIs, they will not be able to work with those. So we're still figuring out how to work with those, but generally, migration can take really long amount of time and during this time we want to focus on making sure we are not breaking the developers out of the loop until we are ready to say that, "Hey, this thing is ready to go." I think one of our engineers at Uber had this interesting code that when we do migrations, usually the new tool is not ready, but already killing off the old tool. And that is not good for any migration because people get frustrated that, "Hey, you're deprecating this old thing, but the new tool is not something I can use because it doesn't support what I really want." So we try to be really careful in migration specifically.
Let's look at how we do migrations at scale. So I talked about the process, but how do we do things like refactorings at such a large scale? So you definitely have to invest in a lot of good tooling. I believe there were other refactoring talks also happening today. So it might be interesting if you're more interested in deeper areas of this topic. At scale, for example, we have tools internally that can do refactoring. There is a tool that we use internally called Polyglot Piranha. It's also open source. There's a link down there if you want to explore it. The interesting thing about this tool is, it's imperative, so you can actually say it like "Hey, I want to... " It's a declarative thing. You can say that I want to take this code and transmute here. You don't need to tell it how to do the transformation itself. You don't need to do any AST, like syntax tree operations.
You can actually tell it like this is the operation that I want to do and it'll go figure out how to do it. And you can guide it in some ways. And one of the interesting things it does is, it supports many different languages. Since I mentioned we have multiple programming languages across multiple monorepos, we wanted a tool that could work across the code bases. Just creating these migration changes is not enough. Sometimes, people still have to review them and make sure they're okay to go in. So we have another tool called Kameleon, which is not open source, which keeps track of these migration diffs and then creates JIRA, so teams can burn these things down. And together, we call the system GodWit, which is kind of still naming in progress. It's not externalized yet.
So the overview is basically you create your specifications that synthesize the rules, the tools run, and the monorepo build targets and eventually, it will create a code review and then folks can comment on the review and then iterate on it. So you can do manual changes once the automated tool works and then you have a dashboard to track the migration progress. 'Cause even at monorepo scale, sometimes you cannot do large-scale migrations in one go. You have to do things piecemeal. Some examples of this are things like cleaning up stale feature flags, which is the original purpose of Piranha but now we're repurposing it to do other things as well. So like we do a stale string cleanup in iOS or we did the Kotlin Symbol Processing migration and recently as part of the Spring Boot upgrade on Java, we were able to migrate from PowerMockito to Mockito.
A lot of this stuff was driven through the same tool and we're continuing to invest more into this. So feel free to check it out and give us feedback. So let's talk a bit about safety nets. So we're doing a lot of these changes. So how do we make sure our code quality is not compromised when we're trying to move so fast with so many changes? One aspect I think a lot of folks struggle with is dependency management. In the Java world, it's particularly tricky because a lot of the times we deal with some interesting problems. For example, in the Java ecosystem, since every dependency is bytecode that we get from third-party dependencies, when, imagine let's say you're trying to upgrade something like Guava and it removes a method and there could be a bunch of other open source libraries that are using Guava that may or may not use that method.
We won't know until maybe your unit test fires and says, "Hey, something's... " Class not found exception or method not found exception. Or maybe you may not have good coverage because unit tests are typically testing just your code. They're not designed to test third-party code. So they might just fail at runtime because when you run your application and some endpoint gets hit, suddenly the application just stops working. Very hard to debug unless you actually have really good coverage and that can be a challenge. So our developers really struggled with this and this made central upgrades really difficult. So we talked a lot about how we want to do a lot of these upgrades internally. For example, if there's a security breach issue and we want to upgrade everyone, we need our engineers to feel comfortable that, "Hey, this change is safe to the best of our knowledge."
And how do we help our engineers feel safe about these changes? So we built on this tool called the Runtime Compatibility Checker. I don't think I've seen this in use anywhere else so far but if there is, please come chat with me afterwards. We're very interested in this area. So what we do is very interesting. So a non-naive way to detect a dependency is broken is you could just do an ABI check to see if the method signatures have changed or methods or classes have been removed. But the reality is your code may or may not actually call these methods. So you might be just creating a bunch of noise for the developer that may not really get a lot of valuable information from that report. So what we did instead is, we looked at for every dependency change, since our build system can tell us which services are impacted by that change, we look at their entry points.
So these are typically things like the application startup or let's say the RPC endpoints that you use or HTTP endpoints that get hit. They usually have a call graph. So once you hit that endpoint, you're able to follow that call graph through static analysis and we record the whole reachable call graph. And then we look at the ABI change report and see is any APIs that were removed part of the call graph? So, it could potentially mean that if that call graph were actually happening at runtime, there might be a runtime issue. So we were able to statically analyze runtime issues which is pretty interesting. So we are able to make... It is a bit expensive to run because it runs on binary code, unlike... So we don't run it on local builds but we do run it on CI. So it gives you an extra layer of protection. So in this case, for example we found there's a... Might be a missing call path for a particular class from a third-party library.
And there are, of course, false positives in some cases but it did lead to our developers having a lot more safety. We did see a lot of migrations either change course to do something else or in some cases, they had to upgrade the transitive third-party dependency to make sure the one that we're trying to upgrade is still compatible. So that reduced a lot of outages with regards to just missing symbols and things like that. So that had a lot of impact. Let's talk a bit about standardization as well. So I think a lot of these tools are really great but how do you make sure the whole company can actually use this? We talk about standardization quite a bit. I'm not sure why we have old slides. Okay. I'm just going to talk through it because I don't have the slides here.
For some reason, the slides are not up-to-date. But one of the things that we're trying to standardize on right now is our Bazel migration, Buck to Bazel migration. And the thing that's really interesting about that is whenever you do migrations of large scale like this and you're changing the core tool that developers use on a day-to-day basis, it's important to make sure we minimize the learning curve, right? So we, I think at Uber, migrated from Maven to Buck, from Pants to Buck, from Ant... I think we used Gradle to Buck. I think we used pretty much every build system under the sun for the JVM ecosystem. So it's not a first rodeo. But the thing that we did really well with the Buck to Bazel migration, and hopefully we'll get the updated slides later after the presentation, is that we made sure that when you look at the build file, when you look at the build definition itself, you would specify what is needed to compile your code in terms of dependencies and sources.
We made it so that the build file itself is exactly the same between Buck and Bazel. Which means the developer does not have to learn a new syntax or a new nomenclature for how they define their builds. And that's very powerful because all they have to do to try Bazel is invoke a Bazel wrapper instead of the Buck wrapper. And under the hood, we take care of translating the build definition to either Buck or Bazel. And that makes learning curves much easier and adoption much quicker and also means that we get feedback much faster so that if there are issues with it, we can react to them much quicker. That means that less documentation is also good whenever you're making large changes like this. And as part of that, you can see on our results, a lot of these work that we've been doing over many, many years. We've been measuring developer NPS or satisfaction since 2017. We've worked on stuff even before that but we didn't actually measure it before. So we do a survey quarterly. And we're doing more frequent feedbacks like with them as well. But generally at a high level, our NPS has been going up and we finally got to a positive NPS.
So we're very proud of this. After a really long time we were able to get a positive NPS, which means developers are actually using the tools and then they're recommending them to other people to actually use as well, which is great. So, a quick summary of all the things that we covered since it's a lot. So speed alone is not sufficient for a great developer experience. Speed is definitely good, but you want to make sure when things are moving fast, you're also taking care of not breaking things and also not taking developers out of that fast loop. And invest in tooling to reduce the developer tax. So when you're doing migrations or refactorings or just doing mandates, be very careful in when you want to put that in place and make sure that you have the right tools in place to aid the developers to have that agency to make that change.
And focus time is very valuable. I think a lot of people know this as well but minimizing the learning curve for a new system can make a huge difference in how quickly that system is adopted. And context switches and overhead for developers are not really great for developer productivity. So as long as you can try focusing on minimizing the learning curves and minimizing the context switches, the solutions that you put in place will be much more joyful, cause less frustration while you're trying to move fast. Cool, that is all my talk. Thank you so much and we have time for questions. There are mics on each side, so feel free to have questions.