What’s inside?
Get an overview of Airbnb’s multilayered, data-driven approach to optimizing developer satisfaction and productivity. You’ll see first-hand how Airbnb built a staging environment, enabled remote on-demand developer environments, achieved a unified build process, and incorporated their IDE and cloud assets to engineer a better developer experience.
Summit Producer’s Highlight
Airbnb has created a Kubernetes-driven platform for creating on-demand development environments called AirDev. Instances are made available through simple CLI commands via yak, and full build and test cycle functionality is available. Airbnb has also integrated their IDEs with a cloud build environment, offering several advantages including improved cycle and deployment times.
About Janusz
Janusz Kudelka is a Staff Software Engineer working on Developer Infrastructure at Airbnb. Before Airbnb he worked on building key-value stores and p2p systems at Facebook. He is passionate about efficiency with a background in distributed systems and high performance computing.
About Cory
Cory Mead is writing software for people who write software so they can spend more time not writing software.
More information related to this topic
Gradle Enterprise Solutions for Developer Productivity Engineering

Gradle is pioneering the practice of DPE and Gradle Enterprise serves as the key enabling technology and solution platform. Organizations successfully apply the practice of DPE to achieve their strategic business objectives such as reducing time to market, increasing product and service quality, minimizing operational costs, and recruiting and retaining talent by investing in developer happiness and providing a highly satisfying developer experience.

Interested in Developer Productivity Engineering? Try these next steps:

Janusz: Janusz: Perfect. All right. So let's start. All right. Hi everybody, I am Janusz. I'm a software engineer at Airbnb. I've been with the company for around four years now. I mostly focused on builds and CI these days.

Cory: And I'm Cory. I work on the Development Tools team at Airbnb. I've been there for about five and a half years. I'm also a chronic stage wanderer, so if I wander in front of the screen and start waving my arms and you can't see it just sneeze obnoxiously or something, and I will. I will get out of the way.

Janusz: Yeah, we'll try our best. So how do you know you're at a developer conference? Well, we changed a bunch of slides, including the title last minute. So you're at the right talk. It just has a different name. And today, we'll talk to you about how we were transforming the developer experience at Airbnb. So first, we'll start with something that is probably dear to all of us, but basically the vision of developer productivity at Airbnb, is that we want to equip all Airbnb developers to engineer the best software of their careers. And now I'll take you a little bit back in time to circa 2019, and this is sort of an average experience of a backend developer at Airbnb. So first you check out them on the monorepo. You enable sparse checkouts. You wait quite a bit because we have to figure out what actually is part of your sparse checkout. Then you open your editor and you wait even longer. You probably have like your third coffee of a day or something. Then you start running some tests. You also have to wait to do that. So you might consider another coffee. Then you run yak, yak is our internal tool that allows you to deploy your service locally. What yak internally does? It invokes a local Gradle build and also invokes a local Docker compose. At this point, your fingertips are probably getting very hot from all the heat that the laptop is generating while it takes over to the sky to extend the life of consciousness. So and this was in alignment with what we saw in our developer satisfaction survey. Basically, darker red means worse. Darker green means better. You might notice that the deploy column at the end is like pretty greener. This is because we migrated to Spinnaker, which is our continuous delivery platform. We basically automated, codified and make our deployments really fast. But this is not what we wanted to talk to you about today. We want to focus on this red rectangle here, and I'll give it to Cory to walk you through some things we'll be talking about today.

Cory: So one of the things that we realized as we were going through each one of these squares to try to make them green, is that doing small incremental things just wasn't going to cut it. What we really needed was a complete transformation of the entire development environment. This is also because Airbnb's work from anywhere policy, work itself was being transformed and so our development environment needed to be transformed or technology needed to be transformed. We needed to stop going from these small quarter long projects, to try to just improve one of these squares to trying to do more strategic things to improve all of these squares at once. These are the five strategic initiatives that we've done in the past couple of years that have been working out really well for us. And that's what we'll be talking about today. The first is a really robust staging environment, next a remote OnDemand dev environment, which we call airdev. Next is unified builds for all of our polyglot language repos. We have a whole slew of IDE improvements. And then finally taking all of these things that we've done and moving them into the cloud so that your laptop doesn't catch on fire. Let's talk about the staging environment first. When you're building a dev environment, one of the first questions that you need to ask is where are your development dependencies? So you have a service in development. Where is it going to connect to? What service is it going to call out to? What database is it going to connect to? And there's so many different answers to all of these questions and that Airbnb over the past several years, it feels like we've gone through just about all of them. The first thing that we tried was an isolated development sandbox. So you have your development service that you deploy and you also deploy all of the transitive dependencies for that service. This works really well. When you have just a couple of services at the company that you need, you can deploy an entire Airbnb in a box and that's great. We eventually out scaled that. You can't have Airbnb in a box when your box is really, really, really big and you need to give one to every single developer so that doesn't scale. And it's also not reliable because you have all of these services. Chances are one is going to break. So isolated development sandbox, that's out.

The next thing that we tried was a shared development environment. This was a brand new environment that we asked all of these teams to create just out of nowhere. And it existed for one purpose, and that was for other teams to connect their development services to your shared development service. So this solved the scalability problem, right? You don't have Airbnb in a box. We have Airbnb dev and all of these other services trying to connect to it. So we got really good scalability, but the reliability was just so bad. The thing about this sort of shared environment is that nobody cared. Nobody cared at all about the shared environment because it didn't give them any value. If you maintain your service in the shared dev environment, you get absolutely no benefit. It only exists for other people to connect to. So that didn't work. Why don't we just connect all of our development services to production? Because security says that we can't. So we're not going to do that right now for security and privacy reasons. But we have something that's very, very similar to stage, to production, which is our staging environment, which already exists. Engineers are already using staging. It's already an isolated environment that's separate from production. So all of these staging services are talking only to other staging services, staging services are talking to their own staging databases, which is completely isolated from production, which solves a lot of data privacy and security issues. Another great thing about using the staging environment to connect your dev services to, is that people are already running a lot of automated tests on their staging services already. This is what gives them the confidence to deploy their changes to production. They run it on staging, see what happens, and then they can deploy. This means that teams actually have an incentive. They actually have an incentive to keep their staging environment up-to-date and reliable, because it benefits them, because it gives them the confidence they need to deploy to production. So their staging environment is good. That has a great side effect. That means that all of these other developers who are doing their development services can connect to the service that already exists, which is already stable, which people are already used to maintaining.

So now that we have a really solid staging environment that we can build on top of for a development environment, let's talk about this dev environment. At Airbnb, we have an on demand Kubernetes based dev environment, which we call AirDev. There's a lot of metrics that we could use for success of this project. My personal metric for success is whether or not our CEO, Brian Chesky, is using the tools that I've built. He's not. He's not right now. I slipped on that. Okay. This quarter, but next quarter, next quarter, we'll get them. We want to make it so easy that even a CEO can use it. So we have our yak command line tool. Let's say that he wants to work on a home service to do yak at homes, using our CLI, and then he'll deploy it using the same CLI. What this is going to do is it's going to use the staging version of the service and do some manipulations to it to put it into this AirDev context. Some of the manipulations that we need to do, scaling out the replicas. You don't need five or six replicas for dev, you're doing in staging. So we'll change that. We'll also inject some logging, will inject some observability, we'll fiddle around with the routing. But at the end of the day, Brian's personal dev home service is basically home staging just in the development environment. So whenever Brian wants to talk to his private version of homes, he'll talk to it. Whenever it needs to call out to another service, let's say that it needs to talk to reviews over here, it will actually call out to the staging version of that service, which will then talk to the staging version of all of the other services. Now, let's say that Brian is trying to do a change to homes. Let's say he's fixing a bug, and he wants to see if it actually works. He'll do yak reload homes to get a really, really tight iteration loop. It's going to do a remote build. It's going to restart the container really, really quickly. We'll do some tricks there so that he can reload the service and see his changes almost immediately. Fun fact about Airbnb usage patterns. About 50% of the commits in our main repos actually span multiple projects. So let's see how that works. Let's say that Brian wants to modify homes and the review service at the exact same time. He'll do yak add reviews, he'll deploy his changes for the reviews. This is going to do the exact same thing. It'll take the staging configuration for reviews, do some devy modifications, and then put it in his private development context. So whenever he wants to talk to his home service instead of it going out to the review staging, our service mesh is smart enough to know, ah, he actually has the review service enabled in his dev environment. So let's talk to that version instead. And then, of course, we'll go out to the staging environment if we need to. Whenever we leave our dev environment, we're going to attach a header. And this header, just as this request is, is brought to you by the Brian Chesky dev environment. So whenever this other service in the staging environment or any other services in the call chain need to call back to one of these two services. Our service mesh again is smart enough to do this cross cluster routing, get it back to Brian's personal home service. This is really cool for end to end test as well because the same header logic also happens at our edge level. So for each dev environment or for each AirDev environment, you get a private URL. We call this a shareable URL. So that way, Brian, he can share this with his best friend Janusz and say, Hey, does this actually work? And Janusz says, this is awful, boss. Try it again. So he goes to this private, shareable URL which adds this header, which says this is a request that is for Brian Chesky's dev environment which then calls back to the service that he wants, which means that he gets to see a version of the Airbnb staging website only with his changes applied. Like most CEOs, we can replace him with some robots, so we can do some automated tests on this AirDev environment as well. Since AirDev is essentially the same as staging, testing this gives us a pretty good idea about what staging is going to do. This means that before we deploy to staging, we can actually test within this environment either from your laptop or through our CI/CD system, and get high confidence that this change isn't going to break staging, which means it's not going to break production, which means it's not going to break all of the other developers who are using this environment. So we're shifting this testing even further to the left. At this point. AirDev is the most used dev environment at Airbnb and there are quite a few of them. This is all organic growth that's happened within the past year or so. Yeah. It's giving us a lot of really good signal that we're moving in the right direction. We still have a long ways to go, but the initial response to this has been overwhelmingly positive. So this is focused on the actual type development loop. But if the builds are slow, we haven't really done anything. We've just put the slick UI and CLI on top of a already really bad system. So, Janusz, how are we going to make the build faster?

Janusz: All right. Well, thanks, Cory. That was really good and a lot of technical details. So before I talk about builds, I want to go one step back. And so basically the way Airbnb tries to scale its engineering is through monorepos, and we don't have the typical single monoepo. We actually have multiple monorepos, one per platform. But over time the size of this monorepos grew and with the growth, the builds and CI also got slower. There's relatively few of us supporting thousands of engineers and regressions happen very often. So if every one of these repositories has to kind of fight on its own with this regression, it becomes very time consuming. And we cannot really focus on making this like big improvement steps. So as each repo is trying its own on its own issues, we often feel like just a constant game of whack a mole. And these repositories, these different platforms ask pretty much the same questions over and over, like how to speed up builds, how to do caching, how to do code coverage, how to chart across machines in CI tests and builds, how to build analytics, how to report and disable flaky tests. And this just like keeps going on and on and on. And we want to ask yourself a question like, what if there was a way to, like, solve all of those together once and then share the results across all of them? So how can we do that? And our approach to that is that we want to bring them all together and unify them under the same build system. And for us, that build system is Bazel. We chose Bazel because it's a polyglot build system that supports all the existing languages across the platforms we currently use, as well as it has some features that are very critical to us, like hermeticity and the remote execution. Now it makes us and the elephant in the in the room pretty stark, because we are moving away from Gradle. But the.

Cory: Sorry.

Janusz: The lucky part for us is that Gradle Enterprise, the Build Scans actually are supporting Bazel. So in that sense, Build Scans is something that is like indispensable at AirBNB, it's like really fundamental to what we use. So as we migrate and merge everything onto Bazel, basically engineers that before weren't exposed to Build Scans could be, so in a way it's still a win win scenario. To talk a little bit more about this Bazel strategy. I want to talk a little bit about our org chart. So Cory and me live here in developer platform, which is sandwiched between product and infrastructure orgs. And we split ourself into two sister teams, application frameworks and developer infrastructure. And the way we roughly split these, is by are you on the platform specific or agnostic spectrum? So agnostic things would be in developer infrastructure, more platform specific in application frameworks. So to give an example of that, if you have like code coverage for Kotlin, Kotlin is a specific language so somebody in application frameworks will be actually responsible to get that code coverage. But all the aggregation reporting, displays, alerts, whatnot would actually belong in developer infrastructure is like the shared infra behind. Now how Bazel fits into this is that we're hoping that we'll be able to move more platform specific things into platform agnostic. Like we said, unifying them. And because we unified them, we can actually focus on making them the best possible versions of themselves. And we also hope that Bazel will be this interface layer between the two sister teams for which we can integrate. To give an example. Pretty much every CI builds at their being starts with figuring out what do we have to test and build? Depending on the changes in git you've made. Every repository will implement its own solution and it will have its own quirks, its own awesome features, and its own awesome box. Right. And it will again be implemented 4 times. People will not have enough time to like really make it perfect. So with Bazel, we hope we can unite under a single solution and give it all the time it needs, and we already do that. Another example, it's a much more complicated one. So the other one was just like, you know, we unify things. Things get better because we get more time to spend on them. But there's also other features that are important. So historically, again, across all the repositories, we always struggled to paralyze our builds across multiple machines. We had to do this statically basically sharding the build graph that has a lot of problems, bad utilization, duplication of work, hot shards and I could go on and on. There's really a lot of problems there and I could spend another 30 minutes, but I'll not bore you.

With Bazel, Bazel is developed from the ground up such that remote execution in mind. So every individual action, like a very small unit of work, can be executed remotely. So this really simplifies how we scale our builds. Additionally, this doesn't work just in CI, but it also works locally and the problem to make builds faster changes from like, okay, how do I parallelize this? It's like, how do I optimize my critical path? If we can reduce the critical path, the build will be faster. So that's about builds. So now let's talk about some IDE improvements we've done. So at Airbnb, we believe to use the best tool for the job. And in terms of editors, that's VScode and IntelliJ. Here, I'll mostly talk about our story of making IntelliJ work with our large model repo. So here there's like a simplified directive view. We have some projects home reviews, users, the stuff that Brian was working on, and some common libraries like date or math library. On top of that, we overlay some built system topology. So like these edges are dependency edges between projects. So homes depends on the reviews and the reviews depends on the math library. And again, this is very simplified. We have like thousands of those, if you would. Now, if you're a developer and you want to work on the user's project. You cannot just like load all of the backend repo into IntelliJ. It will just choke. So the next best thing is to try to load as little as if necessary. So users depends on date library, date depends on math, but we don't load homes or reviews. But even this small, smaller subset can still be like hundreds or thousands of modules. And what we noticed with IntelliJ is that it degrades sort of quadratically with the number of modules you load into it. So the next observation we had is that you really want to only modify users. You don't really want to modify the fundamental libraries that you're requesting. So for this, we'll just replace them with pre-built jars that we can fetch from the remote cache. This basically reduces it such that we only have a handful of modules inside of IntelliJ. Another improvement that we've done is that historically when you click build or test, we would delegate to whatever underlying a build system there is. So with Gradle, it's good because it's just incremental builds, but there's high overhead of doing that.

With Bazel, the overhead is much smaller, but it's still nontrivial and there is no incremental builds. And what we care about is this like inner loop. So, like, you're modifying an individual unit test and you really want to quickly see if what you're doing is right. And ideally, that loop of like I made the change in the unit test or the code under test and I would get the result in under 1 second. So what we're doing is we're actually going back and letting IntelliJ do the job. IntelliJ has a pretty good incremental compiler and it has very low overhead. With this, we're able to go from around the minute for our large services to around 5 seconds. So it's not like the 1 second we are dreaming about, but we're getting much, much closer. So let's bring it all together and basically go to the cloud. That's our strategy, move everything through the cloud. So on the left here, you see a laptop and IntelliJ running there. You have your source control check out. But on the right, you see like a new entity. You see this Kubernetes is build bot. That's your own private remote machine that can vertically scale. That means we are no longer scared of laptop cliff, which is basically hoping that Apple will come with the next generation of hardware every two years so we can scale our development. So the way this works is, we send like a Delta update, basically the changes you've done locally. This is like a very optimized process. The build bot gets that and it initiates a build. Like we talked before, the build will actually execute against Bazel remote cluster. Will execute all the tasks and then it will spit up back some artifacts. Now it's important that all this network hops happen in the same datacenter. Airbnb. We did the work from anywhere policy and people do really utilize that. Often that means they have the worst of networks imaginable.

Cory: I have a coworker who's in a hammock in Thailand somewhere doing his stuff, just sent off a coworker to Mexico today.

Janusz: So yeah, some people are.

Cory: Actually doing this.

Janusz: Climb being they're using StarLink to connect. I mean, it's, it's everywhere. So, so network is really important. But because we keep everything remote, it's not really a problem. We are able to, like, bypass that. So then we have like some Docker image and we put that in the container registry and that fits back to what Cory was talking about AirDev. It's also very, very important to have this entire loop be very quick because again, Brian wants the results to see the results quickly. So now talking about some numbers. If you look for like test time for one of our largest services, if you run it on laptop like a pretty modern laptop with Gradle, it takes around 12 minutes if we do remote plus Gradle. So basically we vertically scale, we cut that time in half, we go to 6 minutes and if we do remote plus Bazel, we go to 3 minutes. So another 2x. So basically we see like a 4x speed up and this benefit sort of stuck. Another number to share is the deploys are a third of time of what they were before. Again with Bazel. So but we don't want to leave IntelliJ behind. So we're looking at adding the shared indexes that JetBrains working on and we do want to put it in the cloud. Also using the JetBrains Gateway. We have like a working prototype, but it's not quite ready for production. We do have a working version of that for VScode. And with that, I'll give it back to Cory and he will actually demo this stuff.

Cory: And you all are saying, ooh, you're actually not going to see any of this stuff in this demo. And that's because all of this stuff is happening behind the scenes. It's all transparent. You're not going to see build pods spinning up. You're not going to see any of this stuff about Bazel. Just a really simple development interface for engineers to use and to get their job done really, really quickly. So all of this stuff is happening behind the scenes. It's not a we definitely didn't put a video in that doesn't match what we were doing because we were doing slides last minute. That's definitely not the case. So let's see what's happening over here. We have a shareable URL. I don't know if you can see it with this pixelated screen, but this is one of our coworkers. He's going to bend.dev.staging.airbnb.com and loading his version of the staging environment, plus all of the changes that he's already done. So he's going to make a change. He's going to use our command line tool to invoke the change, to do that really quick build using the Bazel environment, and then see his change almost in real time. So let's see how that works. Wow. So this is IntelliJ. This pixelation is I was not expecting this. So he's going IntelliJ. He's making a really quick change. Now he's going to go to his other terminal. This is all in the remote JetBrains Gateway workspace. He's going to say, yak, reload. This is going to sync all of his changes to the remote pod. It's going to do a really, really fast Bazel build. And then it's going to reload the services. It's going to get the Docker container, it's going to get the jar. It's going to reset the container. And then he can see his change all the way down here take place really, really quickly. Really super developer, really simple interface for all of the developers. And it's, it's really, really fast. So just to summarize some things, we talked about a lot of stuff. This was our developer satisfaction. We wanted to figure out how do we fix this big red rectangle right here? We can't do it just by doing small things. We need to do a few really big things and do them really, really well. So this is what we did. We did the staging environment. On top of that staging environment, we have this really world class, I think, development environment called AirDev. We have these unified builds across all of our monorepos. We have a whole bunch of IDE improvements and then we take all of these things and we get even better success by shipping all of them into the cloud. So this is a multi-year journey. You can't just do these things overnight. We estimate that we're probably about 50% or 60% of the way through this entire project. But we've already started to see some really, really encouraging signs. Anecdotally, people are really loving this. We're seeing some extreme initial signals that we're on the right path and hopefully that this is the last dev environment and the last build system and the last major restructuring that we have to do for a very long time. And this all comes back to the engineers, right? We want to equip Airbnb developers to engineer the best software of their entire careers. And I think we're well on our way to hitting our goal. Thank you very much. I'm Cory. This is Janusz, and we would like to open it up for some questions later, guys.