Simona Bateman discusses the signals she and her team observed by capturing test telemetry data which indicated that PR (Pull Request) builds were failing due to flaky tests. This also explained why their releases were being blocked.
Google Fitbit’s developer productivity team shares their successful strategies for improving developer experience in the last year. Learn about Google’s best practices around Flaky Test management. Also, discover lessons learned from the migration from Artifactory to Google Clouds Artifact Registry.
Simona Bateman is a software engineer passionate about making developers’ lives easier by identifying pain points and building tools to alleviate them. She has worked in a variety of roles across startups and at larger companies like LinkedIn and Fitbit/Google, with a recent focus on Gradle and Jenkins performance. In her spare time, you’ll find her hiking and camping with her two young boys, training for Spartan races, video gaming, and reading as many books as she can.
The management of non-deterministic test management is a multi-step process. Not only do flaky tests need to be discovered, but there must be practices in place to proactively deal with them once they’ve been detected. Gradle Enterprise’s Test Failure Analytics provides a comprehensive solution that helps detect, prioritize, and fix flaky tests.
Interested in Flaky Test Management? Try these next steps:
- Read how the Gradle Build Tool core team manages flaky tests through "flaky test days" prior to a major release.
- Check out this short video tutorial on how you can use Gradle Enterprise to mitigate, manage, and proactively eliminate flaky tests.
- Sign up for the next DPE online workshop that dedicates a full section section to flaky test management.
Simona Bateman: Well, ok awesome. So it's more than flaky tests today. I'm
actually going to be talking about two stories, two stories that I've covered,
some of them in the past five and a half years, some just the past year. But let
me go through a quick agenda first. So I'll introduce myself and my team and
what we do at Fitbit/Google. I'll give you some background info about our
infrastructure, the tools that we support. And then we're going to go into our
two stories, the flaky tests and migrations. Everybody's favorite, right? All
right. So my name's Simona. I've been on the Fitbit Developer Productivity Team
for the past five and a half years. For those who don't know, at the beginning
of 2021, Fitbit was acquired by Google. So now we are part of the Google Devices
team. Our team still supports Fitbit. So all of the talk from now on is really
referring to specifically the Fitbit code and workflows.
So what do we own? Everything. So it sounds like an exaggeration, but we own a
lot of tools that are around basically everything that the entire workflow that
goes between writing your code, committing your code, all the way to
deployments. So some of these tools include infrastructure. We actually own the
infrastructure for Jenkins, binary repositories. Right now we're using
Artifactory, we're moving to artifact registry, which is what I'm going to talk
about later. And we own the source control. Bitbucket. We're recently moving
into Google Gerrit and then I've got more things, we own our own tools. Some are
built internally, some are third-party. We've got code quality, which is on our
cube. We have build analysis, like Gradle Enterprise, and you'll see it comes in
really handy. The deployment automation standardization, that's an internal tool
that we've built, trying to make it very easy for our engineers to go from that
code to actually having things deployed in production. Flaky Test reporting,
which is a little bit integrated with Gradle Enterprise. We own a lot of Gradle
Plug-ins, to help make things easier for our developers, and then also lots of
Jenkins libraries. Jenkins does it. Who here has heard of Jenkins and uses
Jenkins? Excellent. Pretty much everybody. So with Jenkins pipelines, you can
write them on your own, or you can have some common code, some common libraries
that makes it a little bit easier for folks to write their jobs. All right. So
we told you what we do, what we own, all the infrastructure, all of that stuff.
Let's talk about a little bit more details. We've got a lot of code at Fitbit,
we've got Java, we've got Python, we've got Go. Node. We are in theory, a mono
repo company, and yet we have over 2000 repositories. Yes, not exactly monorepo.
But let's talk a little bit about the monorepo. The monorepo is supposed to
focus on our Java backend services, all of our rest APIs, and just mentioned of
the Java-based lots of microservices. We have about 200 microservices in this
monorepo, but we also have a monolith. Which has been around for maybe, well,
I've been here for five and a half years. When I joined, we were saying we're
getting rid of it. We are starting with microservices. We will be done in 2 to 3
years. We're not done. I honestly don't know when we will be done. So hopefully
soon.
But let's talk about but more details about our monorepo. We have over 2000
Gradle projects in it. And when I say Gradle projects, I mean build.Gradle
files. In terms of actual project, services, I already told you there are about
200 microservices. Quite a lot of other libraries go into it. In terms of tests,
we have lots of tests, over 20 K tests, unit tests, integration tests, component
tests, automation tests, everything that you want. Custom build Gradle plugins
over 40, that are teams you use to make things a little bit easier to build
there. There little projects. And of course I've mentioned we have so many
projects and this huge monorepo when it comes to building, every time you have
to make a change, you have to build all of these projects. That sounds pretty
scary. And I think if we were going to do that every single time, it would
probably take over 3 hours to build everything, which is not very reasonable. So
we build our own internal tool called a build orchestrator, basically determines
based on project dependencies. When you change your code, what should I be
building? Really? Do I really need to build everything or do I just need to
build these two little projects? It saves us a lot of time on builds. But we
still have a few things that result in the whole monorepo having to build. And
then when that happens, here comes flaky tests. So I don't know if you've
attended the previous talk, but what are flaky tests?
Flaky tests are tests that sometimes pass, sometimes fail, but you don't
actually know why. You would expect them to really always pass, but they don't.
So we've had this issue. It's probably everybody's vain of existence. We've had
this issue for a long time. So ever since I joined, we had this nice little chat
where everybody could come in a Slack channel. Say hey my build is failing.
Well, okay great, how can I help? Well, these tests are failing, but the code
that I just changed has absolutely nothing to do with these tests. Where. Why is
my build failing? Well, go talk to the team who owns the tests, and the team
says, well I don't know why my test is failing, it should be passing. So we try
the build again, like, okay, build your pull request. Let's see what happens.
Oh, it passed this time. Oh, well, that's not good. We have so many requests
every single week, and that was basically what we spent most of our time on, is
trying to debug, find out if these tests are flaky or not, and talk to the teams
to fix them. It was very time consuming. We did not have a lot of time to work
on our actual projects. So what are we going to do about it? Releases are
blocked. Everybody's asking us. We're all spending time trying to figure things
out. How many flaky tests do we have right now? I mean, it's been five and a
half years. We tried to do so many things, try to reach out to the teams. We
still have over 200 flaky tests. So. We really don't like it when builds are
blocked. So what have we done? Before Gradle Enterprise. We build our own
internal flaky test detector. But here's how we did it. Our monorepo. We said
it's over 2000 projects, and okay, tests are failing. We need to somehow rerun
these tests. But we're not sure if it's really the tests, or there's something
else in the infrastructure, so what are we going to do? Is we're going to rerun
the build, three times and then find out how those tests fared. But notice that
I've said, we are rebuilding the whole build three times. So everything takes
three times as long. So while we were finally getting some reports and we knew
which tests were flaky. That didn't really help us much when it came to
developer productivity because everything took three times longer and here it
comes, Gradle Enterprise. Since then, we have been using the test retry plugin.
So now whenever there's a test, we try them three times. But instead of running
the entire build again, we're just running the tests, speed things up
considerably, and we have really good data. All right great, we have data.
That's wonderful. Do you think people go and look at this data? Do they ever fix
these? Nope. So we've had this tool for about a year and a half, actually more
than a year. But for a year and a half, we had this tool. Nobody ever fixed
flaky tests. There was still a problem. And yes, we help a little bit by doing
the test retries, but we still had issues.
So. What are we going to do about it? Here's that nice little dashboard. The
slide should have been removed. Anyways, how are we going to tell people? Well,
here's here's some issues. The Gradle Enterprise tool is amazing. It really
helps you explore the test to figure out when did this start being flaky? How
long does it have a problem? What are the other tests that are flaky or flaking
in coordination with this test? Awesome tool. But again, people don't look at
it. And also, people don't know, wait, is this my test or is this some other
team's test? So we had to actually make it very clear for our engineers that it
is not enough to. We're not going to tell it. But they need to know, you team
needs to know that you have failing, you have flaky tests and you need to fix
them. So we created a little small application that generates a report with
integrated with our owner system. So we took the data that Gradle enterprise
provides, and based on the test, the test class names, we determined the owners.
And then every week we sent out an email. And every two weeks we have a
basically engineering presentation that in theory all engineers should attend.
And we put out a slide and say, hey teams, you three teams, you've got a lot of
flaky tests there. Please fix them. And of course, you're going to want that to
actually help somewhat. What we did see is, is an improvement in people's
awareness. We've seen people saying, hey, I see that I have a problem. I'm going
to create a ticket. Here is the ticket. So now we have a communication that,
yes, somebody claimed responsibility and they have tickets that they're going to
work on it eventually. But at least there's awareness. So. So now what? People
have tickets. It's great, they're aware, but they still don't actually focus on
it. So now being part of Google, we like to look at Google's processes. And one
of the things that they have is this priority codes.
Does anybody remember log4j? Okay. Funny story. I was on call, log4j comes in, I
get on notification on my watch. I'm also in the middle of a parkour class. I'm
about to jump and get the notification. My watch and it says code red? I'm like,
I don't know what Code Red is, I've never heard of it. I jump, I break my ankle.
Actually, that is very true story. I still haven't fully recovered, but I've
learned about the code. So what are these code things? Code red. Code Yellow.
This is something that the organization does, usually with support from
directors saying, we have a problem, we need to fix this depending on the
gravity of the issue. So code red, everybody works nonstop until log4j is
resolved, huge security issue. Code yellow, we have a problem. We're going to
continue working in our office hours or working hours on this until it's
resolved and complete. So for our next steps, this is what we're doing. We're
currently basically creating the code yellow and working on getting the
director's support. We're almost there to spend several weeks. Every single team
will spend their working hours fixing those flaky tests. So hopefully, maybe by
next year, I will have an answer on how that went. We hope that it works
well.
All right. That was my first story. Second story, repository migration. And I'm
talking about binary repositories. I'm talking about whatever it is you're using
to store your build artifacts, Docker images, jars, python packages, you name
it. Right now we're using Artifactory. But of course, the question is, why are
you migrating? How many people here use Artifactory? Okay, people use something
else. Those who didn't raise their hands. Okay. Maybe internal or external. You
don't know, does JetBrains have their own, like, artifact repository that thing.
Okay. All right cool cool. All right. So, in general, when it comes to
migrations, what are the reasons you may want to migrate? That can be that the
current tool that you're using is not really good. It can be because developers
are complaining, or sometimes it comes from decisions made by those higher ups.
For us, the reasons were twofold. One, we're part of Google now, so one of the
requirements is we should try as much as possible to use internal or first-party
tools. And if you really use a third-party tool, you have to really build your
case. Now. How many people have heard of GCP Artifact Registry? Okay. A few.
That's good. It's a pretty new product. I think it's got about a year. So last
year when we joined Google, we weren't really considering it because it wasn't
quite there yet. So let's go nuts. So one was we need to integrate. We have the
option to everything already that we own is in GCP. So this makes it easy from a
security standpoint. Use the same IAM roles, very easy access. But there were
some other things that we wanted to do with this. So since we're moving, and as
we're thinking about designing and planning on migration, what should we take
into account? Right. So, okay, how many artifacts do we have? I'm not even going
to do a count is probably in the millions. But let's just look at repositories.
We've got 21 local repositories in artifactory. Basically, that's where we store
our internal things. 57 remote. So that are the repositories that connect to
some external source, usually external dependencies. 13 virtual, basically the
virtual repositories combined local and remote, makes a little bit easier to
use. What else do you need to think about when it comes to migration? Well,
security. I don't know about your employers, but Google is very, very strict
about security. We can not just do anything we want. We need to go through a
repeat process because we want to make sure that Google's data and our users
have confidence in us, that all the data that we own about them is safe and
secure. So it's usually about a two-month process to get approvals. Whenever you
design something. Okay. So that's those are the first two things. Now we need to
talk about existing features. Features that we use in artifactory right now, and
features that we would want to use artifact registry, but maybe they don't exist
yet. And there are still quite a few that don't exist yet. They will. But what
do you do when you have a hard deadline? And you need to move to a new product,
but it doesn't have the features and challenges. We'll talk about that. I pulled
a little note here, making it self-service. So right now, the way we're
monitoring our artifactory is we do everything, our team, we are basically the
managers of it. We own the infrastructure. When it comes to somebody asking,
Hey, I need a new repository in artifactory. They say, well, we say, okay, we'll
create it for you. What do you need? What do you want? Who do you want to have
access to it? It's a lot of work on us. So we decided that with this wonderful
opportunity of being integrated with GCP, we're going to make everything
self-service. You want to repo? Great. Go click on this Jenkins job. It'll
create it for you. It will give you all the permissions you want. We don't have
to do anything. Our team. So we made it easy for you. And we made it easy for
us.
But with every migration, there are challenges. So I mentioned artifactory. As
many of you know, it's been around for a while. It's a really good product. It's
very feature rich, and we are moving to our Artifact Registry, which is also
awesome. But it's new. It doesn't have quite every single feature that we need
or that we're used to from artifactory. So those are some challenges. Another
challenge that we've had is, I've mentioned the feature problem. Well we have to
work closely with the artifact registry team itself, and base our roadmap of
migration with their roadmap of feature inclusion, and if artifact registry
roadmap is delayed, we have to get creative. So I'm going to cover a few
challenges. So remote repositories, artifactory already has them. Artifact
registry does as well. They currently, or at least for Maven. We're going to
talk just about the maven right now. They are currently only support maven
central. I've mentioned before that we have lots of external repositories or
remote repositories. We are actually accessing 18 different repositories. So
Maven Central is not enough for us. So we can either wait, until everything's
available, which is a possibility. But if the timeline gets constricted, we need
to get creative and that's when we start mirroring. So we're basically working
on scripts to mirror what's currently in artifactory or externally, and put them
putting them into artifact registry, at least temporarily, until everything that
we want is there.
All right, challenge number two, we use a lot of metadata on our artifacts. Here
are some examples. We like to know who owns what. We know who owns the code. But
you create artifacts and Docker images. And when you realize there's something
wrong with it or it's missing or it's too big, or you've been storing them for
the past two years and haven't been deleting them. Who do you reach out to?
Well, our owners labels tell us that. Okay. I've mentioned that you've been
keeping images for the past two years. That's a lot of storage to keep. So we
need to be able to actually do some cleanups from time to time. But some teams
really want certain things. Even though they are two years old, they don't want
them removed. So we need to also tag them as do not ever delete. So now we have
this challenge. We don't have tagging. We're not tagging, but labeling metadata
in an artifact registry yet. So. We've requested the feature. It will get done.
But in the meantime, we still need to support this. So our plan, of course,
we're hoping that the artifact registry will do it before we have to actually do
this, is to create our own service that is going to provide this metadata custom
to be just rest API. We're just going to do a simple database. Here's my image
is the path and artifact registry and use the metadata that I need associated
with it. Oops.
Challenge number three. Well, that's all very nice and well, when it's just you
and your team migrating stuff. But everybody has code, 2000 repositories,
publishing or downloading things from Artifactory. Well, we have to get
everybody to move. How easy do you think that is? Not easy. Now, people do want
to do it. People don't read their emails, they seriously never read their
emails. So we start easy. We document everything. We write scripts to automate
as much of the things as possible. Some of the code. We give code examples. All
people have to do is test. Just pick your projects and everything you do, test
to make sure everything runs smoothly and correctly. So we did that. We set a
hard deadline. We sent out emails, we had announcements in our weekly or
biweekly meetings. I was actually super impressed for the teams for the money
repo. 75% actually tested. It's actually really impressive. But out of the teams
owning all the other repos, so the 1999, how many people tested those? Zero. So
I don't know what's happening there, but they're going to have to hurry up
because Artifactory is going to be shut down soon. So speaking of Artifactory
being shut down, we're going to do some brownouts. We're going to shut down
Artifactory on purpose, for a short period of time, to just remind folks you
need to do this. It's a strategy. It actually works when you don't read your
email, forcing.
All right. So with any migration, things don't ever go smoothly. So we started
out, there's multiple types of repositories, right? We started out with Maven.
We started migrating. We pushed the code. The monorepos is there. It's set. I
did mention people don't read their emails. Right. Well, in order for your local
machine to work with this, you need to authenticate with Google Cloud. There's a
little script that you have to run, pretty easy. But again, people do not read
their emails. The day after we merge the code. Somewhere between three and ten
requests. My build isn't working. I can't download this. It's not working. Why
am I getting 404s? Did you read your email? Here go to the script. Here's the
troubleshooting page. So that wen't actually pretty smoothly. But remember I
said earlier, we have a monorepo, and in the monorepo we have a monolith. The
monolith is very, very complex. And even though technically I asked the team to
test it, they tested only a little bit. They didn't go through their whole
workflow. So the day after they were trying to do a release of the monolith, it
was failing, like okay well, let's see what's going on. We'll help you out.
We'll fix the issues. So we'll work with them closely, we submit more PRs to fix
their issue. It looks like it's going well and then it fails again. And then
again. And then we're about two weeks in. And why do we want to roll back? Do we
want to continue to try to fix this? Because we don't know where we're we're
playing Whac-A-Mole. I'm a little nervous at this point because. You know, we
have a monorepo, so the master branch is not only our changes but everybody
else's changes. So now we have to cherry pick and figure out who's done anything
related to this. And we have other repos that depend on the changes that we've
made and we really, really don't want to revert. So anyways, this incident
created, it was a major incident, the monolith was not released for how long?
Can you guess? Somebody guess? Sorry. What? No, not five years. That's a
different story. That's a different story. It was a month. It was a month-long
incident. We were actually ready on that last day of the incident to roll back.
We had it ready. And somehow, miraculously everything passed. So we didn't have
to roll back. But. Oh, I was going to mention one thing. There was a mystery
compilation errors. It's an interesting story. So one of the first failures,
this thing that schema update job was failing complation error. What is going
on? Like this worked yesterday. Why is it failing? And usually when we have
mystery compilation issues and generally to third-party dependencies, it means
that something changed and we don't pin our versions, which is both good and
bad. It turns out that when we introduced Artifact Registry and some of the
Gradle plugins that come with it, we also introduced a new version of Guava. We
went from guava maybe 18, something like that to like 30 Android. That broke
everything. So. Now that I'm thinking about it, I wish I had a screenshot.
Gradle Enterprise came in really handy. You can actually compare two different
builds, so we compare the build from a month ago to the current build and you
can actually compare the dependency versions. And that's when we noticed the
guava, it's like oh, I bet you this breaks everything. So we pinned it.
Everything works. So just. Just letting you know about useful tools.
All right, so we were sort of done with Maven. We have some interesting
incidents. Hopefully, we'll learn some things from it, but the show must go on.
We've only migrated Maven. We still have docker, python, go, debian 2000
repositories. We still have to implement the workarounds that we mentioned to
get things going. And the one big lesson out of all of this is do not
underestimate legacy code. You think it's easy, but it just nips you in the
backside. Thank you. And if anybody has questions.