What’s inside?
Simona discusses the signals she and her team observed by capturing test telemetry data which indicated that PR (Pull Request) builds were failing due to flaky tests. This also explained why their releases were being blocked.
Description
Google Fitbit’s developer productivity team shares their successful strategies for improving developer experience in the last year. Learn about Google’s best practices around Flaky Test management. Also, discover lessons learned from the migration from Artifactory to Google Clouds Artifact Registry.
About Simona
Simona Bateman is a software engineer passionate about making developers’ lives easier by identifying pain points and building tools to alleviate them. She has worked in a variety of roles across startups and at larger companies like LinkedIn and Fitbit/Google, with a recent focus on Gradle and Jenkins performance. In her spare time, you’ll find her hiking and camping with her two young boys, training for Spartan races, video gaming, and reading as many books as she can.
Gradle Enterprise Solutions for Flaky Test Management
The management of non-deterministic test management is a two-step process. Not only do flaky tests need to be discovered, but there must be practices in place to proactively deal with them once they’ve been detected. Gradle Enterprise’s Test Failure Analytics provides a comprehensive solution that detects flaky tests and provides dashboards. Increased observability allows teams to prioritize and eliminate flaky tests more effectively.

Interested in Flaky Test Management? Try these next steps:

Simona Bateman: Well, ok awesome. So it's more than flaky tests today. I'm actually going to be talking about two stories, two stories that I've covered, some of them in the past five and a half years, some just the past year. But let me go through a quick agenda first. So I'll introduce myself and my team and what we do at Fitbit/Google. I'll give you some background info about our infrastructure, the tools that we support. And then we're going to go into our two stories, the flaky tests and migrations. Everybody's favorite, right? All right. So my name's Simona. I've been on the Fitbit Developer Productivity Team for the past five and a half years. For those who don't know, at the beginning of 2021, Fitbit was acquired by Google. So now we are part of the Google Devices team. Our team still supports Fitbit. So all of the talk from now on is really referring to specifically the Fitbit code and workflows.

So what do we own? Everything. So it sounds like an exaggeration, but we own a lot of tools that are around basically everything that the entire workflow that goes between writing your code, committing your code, all the way to deployments. So some of these tools include infrastructure. We actually own the infrastructure for Jenkins, binary repositories. Right now we're using Artifactory, we're moving to artifact registry, which is what I'm going to talk about later. And we own the source control. Bitbucket. We're recently moving into Google Gerrit and then I've got more things, we own our own tools. Some are built internally, some are third-party. We've got code quality, which is on our cube. We have build analysis, like Gradle Enterprise, and you'll see it comes in really handy. The deployment automation standardization, that's an internal tool that we've built, trying to make it very easy for our engineers to go from that code to actually having things deployed in production. Flaky Test reporting, which is a little bit integrated with Gradle Enterprise. We own a lot of Gradle Plug-ins, to help make things easier for our developers, and then also lots of Jenkins libraries. Jenkins does it. Who here has heard of Jenkins and uses Jenkins? Excellent. Pretty much everybody. So with Jenkins pipelines, you can write them on your own, or you can have some common code, some common libraries that makes it a little bit easier for folks to write their jobs. All right. So we told you what we do, what we own, all the infrastructure, all of that stuff. Let's talk about a little bit more details. We've got a lot of code at Fitbit, we've got Java, we've got Python, we've got Go. Node. We are in theory, a mono repo company, and yet we have over 2000 repositories. Yes, not exactly monorepo. But let's talk a little bit about the monorepo. The monorepo is supposed to focus on our Java backend services, all of our rest APIs, and just mentioned of the Java-based lots of microservices. We have about 200 microservices in this monorepo, but we also have a monolith. Which has been around for maybe, well, I've been here for five and a half years. When I joined, we were saying we're getting rid of it. We are starting with microservices. We will be done in 2 to 3 years. We're not done. I honestly don't know when we will be done. So hopefully soon.

But let's talk about but more details about our monorepo. We have over 2000 Gradle projects in it. And when I say Gradle projects, I mean build.Gradle files. In terms of actual project, services, I already told you there are about 200 microservices. Quite a lot of other libraries go into it. In terms of tests, we have lots of tests, over 20 K tests, unit tests, integration tests, component tests, automation tests, everything that you want. Custom build Gradle plugins over 40, that are teams you use to make things a little bit easier to build there. There little projects. And of course I've mentioned we have so many projects and this huge monorepo when it comes to building, every time you have to make a change, you have to build all of these projects. That sounds pretty scary. And I think if we were going to do that every single time, it would probably take over 3 hours to build everything, which is not very reasonable. So we build our own internal tool called a build orchestrator, basically determines based on project dependencies. When you change your code, what should I be building? Really? Do I really need to build everything or do I just need to build these two little projects? It saves us a lot of time on builds. But we still have a few things that result in the whole monorepo having to build. And then when that happens, here comes flaky tests. So I don't know if you've attended the previous talk, but what are flaky tests?

Flaky tests are tests that sometimes pass, sometimes fail, but you don't actually know why. You would expect them to really always pass, but they don't. So we've had this issue. It's probably everybody's vain of existence. We've had this issue for a long time. So ever since I joined, we had this nice little chat where everybody could come in a Slack channel. Say hey my build is failing. Well, okay great, how can I help? Well, these tests are failing, but the code that I just changed has absolutely nothing to do with these tests. Where. Why is my build failing? Well, go talk to the team who owns the tests, and the team says, well I don't know why my test is failing, it should be passing. So we try the build again, like, okay, build your pull request. Let's see what happens. Oh, it passed this time. Oh, well, that's not good. We have so many requests every single week, and that was basically what we spent most of our time on, is trying to debug, find out if these tests are flaky or not, and talk to the teams to fix them. It was very time consuming. We did not have a lot of time to work on our actual projects. So what are we going to do about it? Releases are blocked. Everybody's asking us. We're all spending time trying to figure things out. How many flaky tests do we have right now? I mean, it's been five and a half years. We tried to do so many things, try to reach out to the teams. We still have over 200 flaky tests. So. We really don't like it when builds are blocked. So what have we done? Before Gradle Enterprise. We build our own internal flaky test detector. But here's how we did it. Our monorepo. We said it's over 2000 projects, and okay, tests are failing. We need to somehow rerun these tests. But we're not sure if it's really the tests, or there's something else in the infrastructure, so what are we going to do? Is we're going to rerun the build, three times and then find out how those tests fared. But notice that I've said, we are rebuilding the whole build three times. So everything takes three times as long. So while we were finally getting some reports and we knew which tests were flaky. That didn't really help us much when it came to developer productivity because everything took three times longer and here it comes, Gradle Enterprise. Since then, we have been using the test retry plugin. So now whenever there's a test, we try them three times. But instead of running the entire build again, we're just running the tests, speed things up considerably, and we have really good data. All right great, we have data. That's wonderful. Do you think people go and look at this data? Do they ever fix these? Nope. So we've had this tool for about a year and a half, actually more than a year. But for a year and a half, we had this tool. Nobody ever fixed flaky tests. There was still a problem. And yes, we help a little bit by doing the test retries, but we still had issues.

So. What are we going to do about it? Here's that nice little dashboard. The slide should have been removed. Anyways, how are we going to tell people? Well, here's here's some issues. The Gradle Enterprise tool is amazing. It really helps you explore the test to figure out when did this start being flaky? How long does it have a problem? What are the other tests that are flaky or flaking in coordination with this test? Awesome tool. But again, people don't look at it. And also, people don't know, wait, is this my test or is this some other team's test? So we had to actually make it very clear for our engineers that it is not enough to. We're not going to tell it. But they need to know, you team needs to know that you have failing, you have flaky tests and you need to fix them. So we created a little small application that generates a report with integrated with our owner system. So we took the data that Gradle enterprise provides, and based on the test, the test class names, we determined the owners. And then every week we sent out an email. And every two weeks we have a basically engineering presentation that in theory all engineers should attend. And we put out a slide and say, hey teams, you three teams, you've got a lot of flaky tests there. Please fix them. And of course, you're going to want that to actually help somewhat. What we did see is, is an improvement in people's awareness. We've seen people saying, hey, I see that I have a problem. I'm going to create a ticket. Here is the ticket. So now we have a communication that, yes, somebody claimed responsibility and they have tickets that they're going to work on it eventually. But at least there's awareness. So. So now what? People have tickets. It's great, they're aware, but they still don't actually focus on it. So now being part of Google, we like to look at Google's processes. And one of the things that they have is this priority codes.

Does anybody remember log4j? Okay. Funny story. I was on call, log4j comes in, I get on notification on my watch. I'm also in the middle of a parkour class. I'm about to jump and get the notification. My watch and it says code red? I'm like, I don't know what Code Red is, I've never heard of it. I jump, I break my ankle. Actually, that is very true story. I still haven't fully recovered, but I've learned about the code. So what are these code things? Code red. Code Yellow. This is something that the organization does, usually with support from directors saying, we have a problem, we need to fix this depending on the gravity of the issue. So code red, everybody works nonstop until log4j is resolved, huge security issue. Code yellow, we have a problem. We're going to continue working in our office hours or working hours on this until it's resolved and complete. So for our next steps, this is what we're doing. We're currently basically creating the code yellow and working on getting the director's support. We're almost there to spend several weeks. Every single team will spend their working hours fixing those flaky tests. So hopefully, maybe by next year, I will have an answer on how that went. We hope that it works well.

All right. That was my first story. Second story, repository migration. And I'm talking about binary repositories. I'm talking about whatever it is you're using to store your build artifacts, Docker images, jars, python packages, you name it. Right now we're using Artifactory. But of course, the question is, why are you migrating? How many people here use Artifactory? Okay, people use something else. Those who didn't raise their hands. Okay. Maybe internal or external. You don't know, does JetBrains have their own, like, artifact repository that thing. Okay. All right cool cool. All right. So, in general, when it comes to migrations, what are the reasons you may want to migrate? That can be that the current tool that you're using is not really good. It can be because developers are complaining, or sometimes it comes from decisions made by those higher ups. For us, the reasons were twofold. One, we're part of Google now, so one of the requirements is we should try as much as possible to use internal or first-party tools. And if you really use a third-party tool, you have to really build your case. Now. How many people have heard of GCP Artifact Registry? Okay. A few. That's good. It's a pretty new product. I think it's got about a year. So last year when we joined Google, we weren't really considering it because it wasn't quite there yet. So let's go nuts. So one was we need to integrate. We have the option to everything already that we own is in GCP. So this makes it easy from a security standpoint. Use the same IAM roles, very easy access. But there were some other things that we wanted to do with this. So since we're moving, and as we're thinking about designing and planning on migration, what should we take into account? Right. So, okay, how many artifacts do we have? I'm not even going to do a count is probably in the millions. But let's just look at repositories. We've got 21 local repositories in artifactory. Basically, that's where we store our internal things. 57 remote. So that are the repositories that connect to some external source, usually external dependencies. 13 virtual, basically the virtual repositories combined local and remote, makes a little bit easier to use. What else do you need to think about when it comes to migration? Well, security. I don't know about your employers, but Google is very, very strict about security. We can not just do anything we want. We need to go through a repeat process because we want to make sure that Google's data and our users have confidence in us, that all the data that we own about them is safe and secure. So it's usually about a two-month process to get approvals. Whenever you design something. Okay. So that's those are the first two things. Now we need to talk about existing features. Features that we use in artifactory right now, and features that we would want to use artifact registry, but maybe they don't exist yet. And there are still quite a few that don't exist yet. They will. But what do you do when you have a hard deadline? And you need to move to a new product, but it doesn't have the features and challenges. We'll talk about that. I pulled a little note here, making it self-service. So right now, the way we're monitoring our artifactory is we do everything, our team, we are basically the managers of it. We own the infrastructure. When it comes to somebody asking, Hey, I need a new repository in artifactory. They say, well, we say, okay, we'll create it for you. What do you need? What do you want? Who do you want to have access to it? It's a lot of work on us. So we decided that with this wonderful opportunity of being integrated with GCP, we're going to make everything self-service. You want to repo? Great. Go click on this Jenkins job. It'll create it for you. It will give you all the permissions you want. We don't have to do anything. Our team. So we made it easy for you. And we made it easy for us.

But with every migration, there are challenges. So I mentioned artifactory. As many of you know, it's been around for a while. It's a really good product. It's very feature rich, and we are moving to our Artifact Registry, which is also awesome. But it's new. It doesn't have quite every single feature that we need or that we're used to from artifactory. So those are some challenges. Another challenge that we've had is, I've mentioned the feature problem. Well we have to work closely with the artifact registry team itself, and base our roadmap of migration with their roadmap of feature inclusion, and if artifact registry roadmap is delayed, we have to get creative. So I'm going to cover a few challenges. So remote repositories, artifactory already has them. Artifact registry does as well. They currently, or at least for Maven. We're going to talk just about the maven right now. They are currently only support maven central. I've mentioned before that we have lots of external repositories or remote repositories. We are actually accessing 18 different repositories. So Maven Central is not enough for us. So we can either wait, until everything's available, which is a possibility. But if the timeline gets constricted, we need to get creative and that's when we start mirroring. So we're basically working on scripts to mirror what's currently in artifactory or externally, and put them putting them into artifact registry, at least temporarily, until everything that we want is there.

All right, challenge number two, we use a lot of metadata on our artifacts. Here are some examples. We like to know who owns what. We know who owns the code. But you create artifacts and Docker images. And when you realize there's something wrong with it or it's missing or it's too big, or you've been storing them for the past two years and haven't been deleting them. Who do you reach out to? Well, our owners labels tell us that. Okay. I've mentioned that you've been keeping images for the past two years. That's a lot of storage to keep. So we need to be able to actually do some cleanups from time to time. But some teams really want certain things. Even though they are two years old, they don't want them removed. So we need to also tag them as do not ever delete. So now we have this challenge. We don't have tagging. We're not tagging, but labeling metadata in an artifact registry yet. So. We've requested the feature. It will get done. But in the meantime, we still need to support this. So our plan, of course, we're hoping that the artifact registry will do it before we have to actually do this, is to create our own service that is going to provide this metadata custom to be just rest API. We're just going to do a simple database. Here's my image is the path and artifact registry and use the metadata that I need associated with it. Oops.

Challenge number three. Well, that's all very nice and well, when it's just you and your team migrating stuff. But everybody has code, 2000 repositories, publishing or downloading things from Artifactory. Well, we have to get everybody to move. How easy do you think that is? Not easy. Now, people do want to do it. People don't read their emails, they seriously never read their emails. So we start easy. We document everything. We write scripts to automate as much of the things as possible. Some of the code. We give code examples. All people have to do is test. Just pick your projects and everything you do, test to make sure everything runs smoothly and correctly. So we did that. We set a hard deadline. We sent out emails, we had announcements in our weekly or biweekly meetings. I was actually super impressed for the teams for the money repo. 75% actually tested. It's actually really impressive. But out of the teams owning all the other repos, so the 1999, how many people tested those? Zero. So I don't know what's happening there, but they're going to have to hurry up because Artifactory is going to be shut down soon. So speaking of Artifactory being shut down, we're going to do some brownouts. We're going to shut down Artifactory on purpose, for a short period of time, to just remind folks you need to do this. It's a strategy. It actually works when you don't read your email, forcing.

All right. So with any migration, things don't ever go smoothly. So we started out, there's multiple types of repositories, right? We started out with Maven. We started migrating. We pushed the code. The monorepos is there. It's set. I did mention people don't read their emails. Right. Well, in order for your local machine to work with this, you need to authenticate with Google Cloud. There's a little script that you have to run, pretty easy. But again, people do not read their emails. The day after we merge the code. Somewhere between three and ten requests. My build isn't working. I can't download this. It's not working. Why am I getting 404s? Did you read your email? Here go to the script. Here's the troubleshooting page. So that wen't actually pretty smoothly. But remember I said earlier, we have a monorepo, and in the monorepo we have a monolith. The monolith is very, very complex. And even though technically I asked the team to test it, they tested only a little bit. They didn't go through their whole workflow. So the day after they were trying to do a release of the monolith, it was failing, like okay well, let's see what's going on. We'll help you out. We'll fix the issues. So we'll work with them closely, we submit more PRs to fix their issue. It looks like it's going well and then it fails again. And then again. And then we're about two weeks in. And why do we want to roll back? Do we want to continue to try to fix this? Because we don't know where we're we're playing Whac-A-Mole. I'm a little nervous at this point because. You know, we have a monorepo, so the master branch is not only our changes but everybody else's changes. So now we have to cherry pick and figure out who's done anything related to this. And we have other repos that depend on the changes that we've made and we really, really don't want to revert. So anyways, this incident created, it was a major incident, the monolith was not released for how long? Can you guess? Somebody guess? Sorry. What? No, not five years. That's a different story. That's a different story. It was a month. It was a month-long incident. We were actually ready on that last day of the incident to roll back. We had it ready. And somehow, miraculously everything passed. So we didn't have to roll back. But. Oh, I was going to mention one thing. There was a mystery compilation errors. It's an interesting story. So one of the first failures, this thing that schema update job was failing complation error. What is going on? Like this worked yesterday. Why is it failing? And usually when we have mystery compilation issues and generally to third-party dependencies, it means that something changed and we don't pin our versions, which is both good and bad. It turns out that when we introduced Artifact Registry and some of the Gradle plugins that come with it, we also introduced a new version of Guava. We went from guava maybe 18, something like that to like 30 Android. That broke everything. So. Now that I'm thinking about it, I wish I had a screenshot. Gradle Enterprise came in really handy. You can actually compare two different builds, so we compare the build from a month ago to the current build and you can actually compare the dependency versions. And that's when we noticed the guava, it's like oh, I bet you this breaks everything. So we pinned it. Everything works. So just. Just letting you know about useful tools.

All right, so we were sort of done with Maven. We have some interesting incidents. Hopefully, we'll learn some things from it, but the show must go on. We've only migrated Maven. We still have docker, python, go, debian 2000 repositories. We still have to implement the workarounds that we mentioned to get things going. And the one big lesson out of all of this is do not underestimate legacy code. You think it's easy, but it just nips you in the backside. Thank you. And if anybody has questions.