Get an overview of Airbnb’s multilayered, data-driven approach to optimizing developer satisfaction and productivity. You’ll see first-hand how Airbnb built a staging environment, enabled remote on-demand developer environments, achieved a unified build process, and incorporated their IDE and cloud assets to engineer a better developer experience.
Airbnb has created a Kubernetes-driven platform for creating on-demand development environments called AirDev. Instances are made available through simple CLI commands via yak, and full build and test cycle functionality is available. Airbnb has also integrated their IDEs with a cloud build environment, offering several advantages including improved cycle and deployment times.
Janusz Kudelka is a Staff Software Engineer working on Developer Infrastructure at Airbnb. Before Airbnb he worked on building key-value stores and p2p systems at Facebook. He is passionate about efficiency with a background in distributed systems and high performance computing.
Cory Mead is writing software for people who write software so they can spend more time not writing software.

Gradle is pioneering the practice of DPE and Gradle Enterprise serves as the key enabling technology and solution platform. Organizations successfully apply the practice of DPE to achieve their strategic business objectives such as reducing time to market, increasing product and service quality, minimizing operational costs, and recruiting and retaining talent by investing in developer happiness and providing a highly satisfying developer experience.
Interested in Developer Productivity Engineering? Try these next steps:
- Check out the Top 3 Reasons to launch a dedicated developer productivity engineering team.
- Learn how the Micronaut Foundation uses Gradle Enterprise to boost their developer productivity in this interview.
- Sign up for our free training class, Gradle Enterprise for Productivity Engineers, to learn more about optimizing your build for a satisfying developer experience
Janusz: Janusz: Perfect. All right. So let's start. All right. Hi
everybody, I am Janusz. I'm a software engineer at Airbnb. I've been with the
company for around four years now. I mostly focused on builds and CI these days.
Cory: And I'm Cory. I work on the Development Tools team at Airbnb. I've
been there for about five and a half years. I'm also a chronic stage wanderer,
so if I wander in front of the screen and start waving my arms and you can't see
it just sneeze obnoxiously or something, and I will. I will get out of the way.
Janusz: Yeah, we'll try our best. So how do you know you're at a
developer conference? Well, we changed a bunch of slides, including the title
last minute. So you're at the right talk. It just has a different name. And
today, we'll talk to you about how we were transforming the developer experience
at Airbnb. So first, we'll start with something that is probably dear to all of
us, but basically the vision of developer productivity at Airbnb, is that we
want to equip all Airbnb developers to engineer the best software of their
careers. And now I'll take you a little bit back in time to circa 2019, and this
is sort of an average experience of a backend developer at Airbnb. So first you
check out them on the monorepo. You enable sparse checkouts. You wait quite a
bit because we have to figure out what actually is part of your sparse checkout.
Then you open your editor and you wait even longer. You probably have like your
third coffee of a day or something. Then you start running some tests. You also
have to wait to do that. So you might consider another coffee. Then you run yak,
yak is our internal tool that allows you to deploy your service locally. What
yak internally does? It invokes a local Gradle build and also invokes a local
Docker compose. At this point, your fingertips are probably getting very hot
from all the heat that the laptop is generating while it takes over to the sky
to extend the life of consciousness. So and this was in alignment with what we
saw in our developer satisfaction survey. Basically, darker red means worse.
Darker green means better. You might notice that the deploy column at the end is
like pretty greener. This is because we migrated to Spinnaker, which is our
continuous delivery platform. We basically automated, codified and make our
deployments really fast. But this is not what we wanted to talk to you about
today. We want to focus on this red rectangle here, and I'll give it to Cory to
walk you through some things we'll be talking about today.
Cory: So one of the things that we realized as we were going through each
one of these squares to try to make them green, is that doing small incremental
things just wasn't going to cut it. What we really needed was a complete
transformation of the entire development environment. This is also because
Airbnb's work from anywhere policy, work itself was being transformed and so our
development environment needed to be transformed or technology needed to be
transformed. We needed to stop going from these small quarter long projects, to
try to just improve one of these squares to trying to do more strategic things
to improve all of these squares at once. These are the five strategic
initiatives that we've done in the past couple of years that have been working
out really well for us. And that's what we'll be talking about today. The first
is a really robust staging environment, next a remote OnDemand dev environment,
which we call airdev. Next is unified builds for all of our polyglot language
repos. We have a whole slew of IDE improvements. And then finally taking all of
these things that we've done and moving them into the cloud so that your laptop
doesn't catch on fire. Let's talk about the staging environment first. When
you're building a dev environment, one of the first questions that you need to
ask is where are your development dependencies? So you have a service in
development. Where is it going to connect to? What service is it going to call
out to? What database is it going to connect to? And there's so many different
answers to all of these questions and that Airbnb over the past several years,
it feels like we've gone through just about all of them. The first thing that we
tried was an isolated development sandbox. So you have your development service
that you deploy and you also deploy all of the transitive dependencies for that
service. This works really well. When you have just a couple of services at the
company that you need, you can deploy an entire Airbnb in a box and that's
great. We eventually out scaled that. You can't have Airbnb in a box when your
box is really, really, really big and you need to give one to every single
developer so that doesn't scale. And it's also not reliable because you have all
of these services. Chances are one is going to break. So isolated development
sandbox, that's out.
The next thing that we tried was a shared development environment. This was a
brand new environment that we asked all of these teams to create just out of
nowhere. And it existed for one purpose, and that was for other teams to connect
their development services to your shared development service. So this solved
the scalability problem, right? You don't have Airbnb in a box. We have Airbnb
dev and all of these other services trying to connect to it. So we got really
good scalability, but the reliability was just so bad. The thing about this sort
of shared environment is that nobody cared. Nobody cared at all about the shared
environment because it didn't give them any value. If you maintain your service
in the shared dev environment, you get absolutely no benefit. It only exists for
other people to connect to. So that didn't work. Why don't we just connect all
of our development services to production? Because security says that we can't.
So we're not going to do that right now for security and privacy reasons. But we
have something that's very, very similar to stage, to production, which is our
staging environment, which already exists. Engineers are already using staging.
It's already an isolated environment that's separate from production. So all of
these staging services are talking only to other staging services, staging
services are talking to their own staging databases, which is completely
isolated from production, which solves a lot of data privacy and security
issues. Another great thing about using the staging environment to connect your
dev services to, is that people are already running a lot of automated tests on
their staging services already. This is what gives them the confidence to deploy
their changes to production. They run it on staging, see what happens, and then
they can deploy. This means that teams actually have an incentive. They actually
have an incentive to keep their staging environment up-to-date and reliable,
because it benefits them, because it gives them the confidence they need to
deploy to production. So their staging environment is good. That has a great
side effect. That means that all of these other developers who are doing their
development services can connect to the service that already exists, which is
already stable, which people are already used to maintaining.
So now that we have a really solid staging environment that we can build on top
of for a development environment, let's talk about this dev environment. At
Airbnb, we have an on demand Kubernetes based dev environment, which we call
AirDev. There's a lot of metrics that we could use for success of this project.
My personal metric for success is whether or not our CEO, Brian Chesky, is using
the tools that I've built. He's not. He's not right now. I slipped on that.
Okay. This quarter, but next quarter, next quarter, we'll get them. We want to
make it so easy that even a CEO can use it. So we have our yak command line
tool. Let's say that he wants to work on a home service to do yak at homes,
using our CLI, and then he'll deploy it using the same CLI. What this is going
to do is it's going to use the staging version of the service and do some
manipulations to it to put it into this AirDev context. Some of the
manipulations that we need to do, scaling out the replicas. You don't need five
or six replicas for dev, you're doing in staging. So we'll change that. We'll
also inject some logging, will inject some observability, we'll fiddle around
with the routing. But at the end of the day, Brian's personal dev home service
is basically home staging just in the development environment. So whenever Brian
wants to talk to his private version of homes, he'll talk to it. Whenever it
needs to call out to another service, let's say that it needs to talk to reviews
over here, it will actually call out to the staging version of that service,
which will then talk to the staging version of all of the other services. Now,
let's say that Brian is trying to do a change to homes. Let's say he's fixing a
bug, and he wants to see if it actually works. He'll do yak reload homes to get
a really, really tight iteration loop. It's going to do a remote build. It's
going to restart the container really, really quickly. We'll do some tricks
there so that he can reload the service and see his changes almost immediately.
Fun fact about Airbnb usage patterns. About 50% of the commits in our main repos
actually span multiple projects. So let's see how that works. Let's say that
Brian wants to modify homes and the review service at the exact same time. He'll
do yak add reviews, he'll deploy his changes for the reviews. This is going to
do the exact same thing. It'll take the staging configuration for reviews, do
some devy modifications, and then put it in his private development context. So
whenever he wants to talk to his home service instead of it going out to the
review staging, our service mesh is smart enough to know, ah, he actually has
the review service enabled in his dev environment. So let's talk to that version
instead. And then, of course, we'll go out to the staging environment if we need
to. Whenever we leave our dev environment, we're going to attach a header. And
this header, just as this request is, is brought to you by the Brian Chesky dev
environment. So whenever this other service in the staging environment or any
other services in the call chain need to call back to one of these two services.
Our service mesh again is smart enough to do this cross cluster routing, get it
back to Brian's personal home service. This is really cool for end to end test
as well because the same header logic also happens at our edge level. So for
each dev environment or for each AirDev environment, you get a private URL. We
call this a shareable URL. So that way, Brian, he can share this with his best
friend Janusz and say, Hey, does this actually work? And Janusz says, this is
awful, boss. Try it again. So he goes to this private, shareable URL which adds
this header, which says this is a request that is for Brian Chesky's dev
environment which then calls back to the service that he wants, which means that
he gets to see a version of the Airbnb staging website only with his changes
applied. Like most CEOs, we can replace him with some robots, so we can do some
automated tests on this AirDev environment as well. Since AirDev is essentially
the same as staging, testing this gives us a pretty good idea about what staging
is going to do. This means that before we deploy to staging, we can actually
test within this environment either from your laptop or through our CI/CD
system, and get high confidence that this change isn't going to break staging,
which means it's not going to break production, which means it's not going to
break all of the other developers who are using this environment. So we're
shifting this testing even further to the left. At this point. AirDev is the
most used dev environment at Airbnb and there are quite a few of them. This is
all organic growth that's happened within the past year or so. Yeah. It's giving
us a lot of really good signal that we're moving in the right direction. We
still have a long ways to go, but the initial response to this has been
overwhelmingly positive. So this is focused on the actual type development loop.
But if the builds are slow, we haven't really done anything. We've just put the
slick UI and CLI on top of a already really bad system. So, Janusz, how are we
going to make the build faster?
Janusz: All right. Well, thanks, Cory. That was really good and a lot of
technical details. So before I talk about builds, I want to go one step back.
And so basically the way Airbnb tries to scale its engineering is through
monorepos, and we don't have the typical single monoepo. We actually have
multiple monorepos, one per platform. But over time the size of this monorepos
grew and with the growth, the builds and CI also got slower. There's relatively
few of us supporting thousands of engineers and regressions happen very often.
So if every one of these repositories has to kind of fight on its own with this
regression, it becomes very time consuming. And we cannot really focus on making
this like big improvement steps. So as each repo is trying its own on its own
issues, we often feel like just a constant game of whack a mole. And these
repositories, these different platforms ask pretty much the same questions over
and over, like how to speed up builds, how to do caching, how to do code
coverage, how to chart across machines in CI tests and builds, how to build
analytics, how to report and disable flaky tests. And this just like keeps going
on and on and on. And we want to ask yourself a question like, what if there was
a way to, like, solve all of those together once and then share the results
across all of them? So how can we do that? And our approach to that is that we
want to bring them all together and unify them under the same build system. And
for us, that build system is Bazel. We chose Bazel because it's a polyglot build
system that supports all the existing languages across the platforms we
currently use, as well as it has some features that are very critical to us,
like hermeticity and the remote execution. Now it makes us and the elephant in
the in the room pretty stark, because we are moving away from Gradle. But the.
Cory: Sorry.
Janusz: The lucky part for us is that Gradle Enterprise, the Build Scans
actually are supporting Bazel. So in that sense, Build Scans is something that
is like indispensable at AirBNB, it's like really fundamental to what we use. So
as we migrate and merge everything onto Bazel, basically engineers that before
weren't exposed to Build Scans could be, so in a way it's still a win win
scenario. To talk a little bit more about this Bazel strategy. I want to talk a
little bit about our org chart. So Cory and me live here in developer platform,
which is sandwiched between product and infrastructure orgs. And we split
ourself into two sister teams, application frameworks and developer
infrastructure. And the way we roughly split these, is by are you on the
platform specific or agnostic spectrum? So agnostic things would be in developer
infrastructure, more platform specific in application frameworks. So to give an
example of that, if you have like code coverage for Kotlin, Kotlin is a specific
language so somebody in application frameworks will be actually responsible to
get that code coverage. But all the aggregation reporting, displays, alerts,
whatnot would actually belong in developer infrastructure is like the shared
infra behind. Now how Bazel fits into this is that we're hoping that we'll be
able to move more platform specific things into platform agnostic. Like we said,
unifying them. And because we unified them, we can actually focus on making them
the best possible versions of themselves. And we also hope that Bazel will be
this interface layer between the two sister teams for which we can integrate. To
give an example. Pretty much every CI builds at their being starts with figuring
out what do we have to test and build? Depending on the changes in git you've
made. Every repository will implement its own solution and it will have its own
quirks, its own awesome features, and its own awesome box. Right. And it will
again be implemented 4 times. People will not have enough time to like really
make it perfect. So with Bazel, we hope we can unite under a single solution and
give it all the time it needs, and we already do that. Another example, it's a
much more complicated one. So the other one was just like, you know, we unify
things. Things get better because we get more time to spend on them. But there's
also other features that are important. So historically, again, across all the
repositories, we always struggled to paralyze our builds across multiple
machines. We had to do this statically basically sharding the build graph that
has a lot of problems, bad utilization, duplication of work, hot shards and I
could go on and on. There's really a lot of problems there and I could spend
another 30 minutes, but I'll not bore you.
With Bazel, Bazel is developed from the ground up such that remote execution in
mind. So every individual action, like a very small unit of work, can be
executed remotely. So this really simplifies how we scale our builds.
Additionally, this doesn't work just in CI, but it also works locally and the
problem to make builds faster changes from like, okay, how do I parallelize
this? It's like, how do I optimize my critical path? If we can reduce the
critical path, the build will be faster. So that's about builds. So now let's
talk about some IDE improvements we've done. So at Airbnb, we believe to use the
best tool for the job. And in terms of editors, that's VScode and IntelliJ.
Here, I'll mostly talk about our story of making IntelliJ work with our large
model repo. So here there's like a simplified directive view. We have some
projects home reviews, users, the stuff that Brian was working on, and some
common libraries like date or math library. On top of that, we overlay some
built system topology. So like these edges are dependency edges between
projects. So homes depends on the reviews and the reviews depends on the math
library. And again, this is very simplified. We have like thousands of those, if
you would. Now, if you're a developer and you want to work on the user's
project. You cannot just like load all of the backend repo into IntelliJ. It
will just choke. So the next best thing is to try to load as little as if
necessary. So users depends on date library, date depends on math, but we don't
load homes or reviews. But even this small, smaller subset can still be like
hundreds or thousands of modules. And what we noticed with IntelliJ is that it
degrades sort of quadratically with the number of modules you load into it. So
the next observation we had is that you really want to only modify users. You
don't really want to modify the fundamental libraries that you're requesting. So
for this, we'll just replace them with pre-built jars that we can fetch from the
remote cache. This basically reduces it such that we only have a handful of
modules inside of IntelliJ. Another improvement that we've done is that
historically when you click build or test, we would delegate to whatever
underlying a build system there is. So with Gradle, it's good because it's just
incremental builds, but there's high overhead of doing that.
With Bazel, the overhead is much smaller, but it's still nontrivial and there is
no incremental builds. And what we care about is this like inner loop. So, like,
you're modifying an individual unit test and you really want to quickly see if
what you're doing is right. And ideally, that loop of like I made the change in
the unit test or the code under test and I would get the result in under 1
second. So what we're doing is we're actually going back and letting IntelliJ do
the job. IntelliJ has a pretty good incremental compiler and it has very low
overhead. With this, we're able to go from around the minute for our large
services to around 5 seconds. So it's not like the 1 second we are dreaming
about, but we're getting much, much closer. So let's bring it all together and
basically go to the cloud. That's our strategy, move everything through the
cloud. So on the left here, you see a laptop and IntelliJ running there. You
have your source control check out. But on the right, you see like a new entity.
You see this Kubernetes is build bot. That's your own private remote machine
that can vertically scale. That means we are no longer scared of laptop cliff,
which is basically hoping that Apple will come with the next generation of
hardware every two years so we can scale our development. So the way this works
is, we send like a Delta update, basically the changes you've done locally. This
is like a very optimized process. The build bot gets that and it initiates a
build. Like we talked before, the build will actually execute against Bazel
remote cluster. Will execute all the tasks and then it will spit up back some
artifacts. Now it's important that all this network hops happen in the same
datacenter. Airbnb. We did the work from anywhere policy and people do really
utilize that. Often that means they have the worst of networks imaginable.
Cory: I have a coworker who's in a hammock in Thailand somewhere doing
his stuff, just sent off a coworker to Mexico today.
Janusz: So yeah, some people are.
Cory: Actually doing this.
Janusz: Climb being they're using StarLink to connect. I mean, it's, it's
everywhere. So, so network is really important. But because we keep everything
remote, it's not really a problem. We are able to, like, bypass that. So then we
have like some Docker image and we put that in the container registry and that
fits back to what Cory was talking about AirDev. It's also very, very important
to have this entire loop be very quick because again, Brian wants the results to
see the results quickly. So now talking about some numbers. If you look for like
test time for one of our largest services, if you run it on laptop like a pretty
modern laptop with Gradle, it takes around 12 minutes if we do remote plus
Gradle. So basically we vertically scale, we cut that time in half, we go to 6
minutes and if we do remote plus Bazel, we go to 3 minutes. So another 2x. So
basically we see like a 4x speed up and this benefit sort of stuck. Another
number to share is the deploys are a third of time of what they were before.
Again with Bazel. So but we don't want to leave IntelliJ behind. So we're
looking at adding the shared indexes that JetBrains working on and we do want to
put it in the cloud. Also using the JetBrains Gateway. We have like a working
prototype, but it's not quite ready for production. We do have a working version
of that for VScode. And with that, I'll give it back to Cory and he will
actually demo this stuff.
Cory: And you all are saying, ooh, you're actually not going to see any
of this stuff in this demo. And that's because all of this stuff is happening
behind the scenes. It's all transparent. You're not going to see build pods
spinning up. You're not going to see any of this stuff about Bazel. Just a
really simple development interface for engineers to use and to get their job
done really, really quickly. So all of this stuff is happening behind the
scenes. It's not a we definitely didn't put a video in that doesn't match what
we were doing because we were doing slides last minute. That's definitely not
the case. So let's see what's happening over here. We have a shareable URL. I
don't know if you can see it with this pixelated screen, but this is one of our
coworkers. He's going to bend.dev.staging.airbnb.com and loading his version of
the staging environment, plus all of the changes that he's already done. So he's
going to make a change. He's going to use our command line tool to invoke the
change, to do that really quick build using the Bazel environment, and then see
his change almost in real time. So let's see how that works. Wow. So this is
IntelliJ. This pixelation is I was not expecting this. So he's going IntelliJ.
He's making a really quick change. Now he's going to go to his other terminal.
This is all in the remote JetBrains Gateway workspace. He's going to say, yak,
reload. This is going to sync all of his changes to the remote pod. It's going
to do a really, really fast Bazel build. And then it's going to reload the
services. It's going to get the Docker container, it's going to get the jar.
It's going to reset the container. And then he can see his change all the way
down here take place really, really quickly. Really super developer, really
simple interface for all of the developers. And it's, it's really, really fast.
So just to summarize some things, we talked about a lot of stuff. This was our
developer satisfaction. We wanted to figure out how do we fix this big red
rectangle right here? We can't do it just by doing small things. We need to do a
few really big things and do them really, really well. So this is what we did.
We did the staging environment. On top of that staging environment, we have this
really world class, I think, development environment called AirDev. We have
these unified builds across all of our monorepos. We have a whole bunch of IDE
improvements and then we take all of these things and we get even better success
by shipping all of them into the cloud. So this is a multi-year journey. You
can't just do these things overnight. We estimate that we're probably about 50%
or 60% of the way through this entire project. But we've already started to see
some really, really encouraging signs. Anecdotally, people are really loving
this. We're seeing some extreme initial signals that we're on the right path and
hopefully that this is the last dev environment and the last build system and
the last major restructuring that we have to do for a very long time. And this
all comes back to the engineers, right? We want to equip Airbnb developers to
engineer the best software of their entire careers. And I think we're well on
our way to hitting our goal. Thank you very much. I'm Cory. This is Janusz, and
we would like to open it up for some questions later, guys.