Speaker Landing - Developer Productivity Engineering Summit 2024

What’s inside?
Valera talks about how addressing tech debt to maintain a healthy codebase enables their software engineers to ship software faster to customers. Learn how to accumulate code analysis data from sources like static code analysis, style checks, and linting into a single, usable metric.

Description
All codebases have technical debt. Sometimes developers plan to clean it up later. Other times they act on the urge to go on a refactoring binge. Addressing technical debt can be rewarding and useful. But how can we make sure that it is not simply left to the whim of a good samaritan? And can such work be consistently seen and encouraged by engineering leadership? read more

Following the trope of “you can’t improve what you don’t measure,” Slack employs a code health score: an empirical measure that provides visibility into cumulative and team-based technical debt to address these questions. But measurement alone isn’t enough. Quality is important to integrate these measurements into the day-to-day engineering culture of the team.

This talk will provide the formula for building a code health score for any codebase, touch on the nuances of its implementation on Android/iOS, and share what lessons learned about integrating this tool into our engineering process.

About Valera
Valera Zakharov leads the mobile developer experience team at Slack. Prior to Slack, he led the development of Espresso at Google and contributed to the infrastructure that runs hundreds of android tests per second. He is passionate about building (and presenting about!) infrastructure that makes the lives of developers more pleasant and productive.

Valera Zakherov: Rooz is doing an amazing job here, and he just, like, knows everyone here. Like, I don't know. Raise your hand if you haven't met Rooz. See, no one is raising their hand anyway. Um. All right. It loaded up. We had a little glitch actually on my way here. I was getting a little nervous if I was going to make it on time because, you know, Muni, San Francisco Muni, not the most reliable thing ended up taking an Uber anyway. But one of my friends, Nick, who may be in the audience here, he volunteered to give the presentation instead of me if I don't make it. He's never seen the slides. I thought it was a great idea. But eventually I did make it. I'm here to talk to you about Code Health Score, which, by the way, is a concept, but it's also an implementation. I'm going to talk about how we implemented this concept. So both things are true. All right, let's get started. Before we dive in, though, let's talk about technical debt. Raise your hand if you've worked in a code base that has technical debt. All right. See those people that are not raising their hand. They've never worked on a code base or they just don't want to raise their hand, which is totally fine, too. I don't think I would. But yeah, technical debt exists. Some people say when, you know, whenever code is merged into the main branch just instantly becomes technical debt. Best repo. No code. Right. But yeah, you know, unfortunately, though, we you know, we're not here to clean up technical debt and have a perfect code base right at the end of the day. We're here to ship features to users. And I put this I started putting this in the beginning of a lot of my presentations that I give internally at Slack. I work on the mobile developer experience team, and I think sometimes these kinds of teams like developer productivity, infrastructure teams, we kind of forget that at the end of the day, everything that we do is in service of this. You know, we ship user, I mean, our developers ship features to customers. These customers pay us money. Or the you know, we mind their data and sell it to advertisers, but you know what I mean? But it's really important to have features like in everything that we do is in service of that. Fortunately, though, for us as developers, it is this is also true. A healthy code base allows us to ship these features to customers faster and safer. There's many takes on this. I think Facebook, you know, what, does it move fast with good infrastructure? It used to be break things, but they finally, you know, they understood that it's really important to do it fast and safe. Yeah. So we have kind of both things are true. I just wanted to show you more. More pictures of messy kitchens. I was thinking, like, what kind of analogy could I make here? And just like, imagine yourself cooking in a kitchen like this, right? And doing it quickly. Right? Like, as you were, I don't know, trying to cook breakfast for your kids in the morning or something like that. It would be absolutely impossible. So unfortunately, though, code bases are not exactly kitchens. And it's it's not always evident. Like in the kitchen, you walk in, you can see the mess right away in a code base. It's not really, you know, I mean, when you work in it, right, when you're trying to build a feature and you're trying to ship it and, you know, you encounter some scary, you know, class file and you go on a two week refactoring binge and then your manager is like asking you, where's that feature that you were supposed to ship? And you're like, Well, I'm sorry, but here's a really cool abstraction layer that I built. Not that this ever happened to me, but. Anyway. Where was I going with this? It's not evident, like especially for people who are not regularly in the code base, for example, engineering leadership, what the scale of the technical debt problem is and really like whether we're making where we're getting worse or better, it's not like a bunch of dirty dishes on a counter. So what if? Oh, yeah. There's a first time I'm giving this presentation, by the way. First time I'm giving a presentation since the pandemic. Really awesome to actually have a live audience. Amazing feeling like the energy. But anyway, unfortunately though, ad hoc cleanup doesn't really work like there are good Samaritans who once in a while will go and address some technical debt and do a refactoring. There are maybe whole teams dedicated to this, but you know, at a certain scale, maybe this would work at a scale of like two or three developers, but at a certain scale, it just doesn't work. We cannot rely on ad hoc cleanup. And that's why, you know, for example, professional kitchens, right? They clean as they go. Like, if they didn't clean as they go, they just wouldn't be able to do what they do. So what if we could quantify this problem with a single score? I don't know why we have a decimal point there. It's really weird. We like to be really precise at Slack. And we could break the score down into sub-scores that actually kind of tell us what the problems are in our code base. And we could show trends over time as changes are merged into the main branch. How the score is changing. Is our technical debt getting worse or better? And we could also show those trends for the sub-scores. Then we could show you the worst files in our code base. Everyone knows those files, right? I don't know. Ours was called Messages Fragment for a long time. Because we do messages in Slack. And what if. You know, since we don't have the luxury of Elon Musk reviewing all of our code. Yes. I'm so sorry for my Twitter friends. Anyway. What if we could show you on your PR is a comment that said, hey, you might be actually making the technical debt problem worse or you might be improving it. Yay. And by the way, this concept I did not invent this concept, I wish I did, was brought to Slack by, I believe, an ex Twitter engineer, Ryan Greenberg, who is also a very funny person. You can follow him on Twitter. He does not talk about health score. He mostly makes dad jokes just like me. He's a dad. And I have to also say that there are third party solutions out there and it's like actually we're very big on using third party software, including Gradle enterprise. If you know, we can pay money to solve the problem, you know, we usually prefer that way for some reason. With Health Score, though, we decided to implement our own solution. But if some of these work well for you, you know, that's great. So now let's let's talk about the implementation of what health score might look like in a very simple kind of nutshell. You count the problems. You weigh them by how bad they are. And then basically the score is zero minus the sum of the weighted problems. So the score is negative, and the best you could ever do is zero. And a naive implementation of this might look something like this. You examine every file in the code base and then for each check, you know, the sub-scores, you return the weighted score. You aggregate these things by file and then sum it all up, upload it to a back end and display it in a pretty dashboard data pipelines. I mean, I feel like 90% of what I do is this actually. Clicker. Here we go. So I have to show a little bit of code. You know, there are developers here. Very simple interface. You're right. We define a check that just has a calculate score method. Returns a negative integer given a file path. And here's a very simple implementation of an actual health score check that we have in our code base that basically checks for a file limit. So if you're over the limit by a number of lines and basically like returns the penalty for how many lines you're over because, you know, we found that, you know, usually very large files correlate with unhealthy code. But there's a slight problem, you know, with this very naive take. By the way, the naive take is fine. Nothing wrong with it. We actually had it for a while and. Even if you just have that and you start with that, it's it's really worth doing it. But we found over time that, um, introducing, you know, there are some certain issues. For example, here, you know, developers, iOS developers, Android developers would have to go and edit some Python code and figure out how to deploy the new health score. So there was a bit of a barrier to maintaining health score and also introducing new checks. So of course, the solution for this is to make a declarative, you know, configuration file. I don't know, YAML or JSON or whatever you prefer. And we found a lot of our scores are actually basically just checking for usage like certain regex, like code patterns in our code base and doing so by line. So we pulled that out into a configuration file and, you know, implemented it like this. Basically, you can just go through every line and apply the regex and then you weigh you get the count and you weigh it. Super simple stuff. And I have to say, if you did just this and this is really easy to implement, this will take you a really long way. In fact, on iOS, where, you know, we're still solely using this method right now and, you know, there hasn't been a strong need to actually do anything else. So this works really well. But can we do better than regex matching? Because some of, you know, astute listeners might notice that. Well, you know, string matching. Come on. That's a little bit, you know, simplistic. There are static analysis tools out there that work on the abstract syntax tree. So you can work with concepts like methods and classes and, you know, basically compiled code. And an example of this is Android Lint. Android Lint ultimately produces its output in either an XML file or SARIF file, which is basically just a blob of JSON that tells you what the violation is and tells you the file path of the violations on. And our first version of health score on Android, actually, unlike iOS, was built entirely on top of Lint, so we would basically take that XML or SARIF output. And we had a Gradle task essentially that, you know, read that in and produced the health score. The advantages of the advantages of this is that it's it's really easy to wire up dependencies, you know, that Lint has to run and then your health score task has to run. And there's a really strong ecosystem around Lint. For example, Lint warnings show up in the IDE. The problem, though, for us was that introducing new checks for Android was even harder than, you know, writing new Python. You actually had to write a Lint rule and raise your hand here if you have written a Lint rule. Okay. Few people it's not I'm sure it's not that bad, but somehow it's still a barrier to entry that most developers didn't really want to overcome. And also, you know, Gradle happens to not work on iOS and, you know, other tools out there. So that was also an issue. We kind of like as a centralized team, we were maintaining this tool. We now had to maintain two different versions for Android and iOS. So our second take on Lint, including Lint into health score, essentially integrated the output into our existing model. We wrote a new health score check called Better Check that basically instead of opening up every file in the code base, it actually just took the the SARIF file. And then with some sort of configuration which told you which Lint violations to look for how to weigh them. I don't expect you to read this block of code, but essentially we can now produce the same kind of output that our normal health score expects, but just parsing the SARIF file. And this is this is actually really cool because now, you know, with Lint, you can turn certain warnings into errors and prevent them from entering the system. But most of the time, you know, you don't want to sort of make that noise on PRs. So you usually have a baseline file, right, where you kind of ignore all the existing violations and then you sort of try to not make things worse. The cool thing about health score and Lint integrating together is that now you can actually track and burn down that that baseline file and eventually burn down Lint violations, Lint warnings to zero and then turn them into errors. This system works really well. Another problem with the simplistic approach is that in a large codebase, I don't claim that Slack code base is the largest in the world by any means, but examining each file is costly. This is a pretty easy problem to solve in our configuration for health score. We essentially tell it which sort of directories to look at, which directories to avoid, and also each health score check basically has a filter. I applied to these files. All right. And yeah, again, another block of text of how this is implemented. As I said, you know, this this makes it much faster to crawl the code base. Oh, by the way, it also would work well in a monorepo situation where you can basically say, oh, look at only this path for Android and this one for iOS. With this approach, it takes about 44 seconds for our health score to run on our, you know, medium sized Android code base with 16,000 kotlin files, 1.3 lines of code 44 seconds, maybe like not the fastest in the world, but it's also pretty decent. So so far we haven't felt a strong need to optimize this further and like go and rewrite things in Rust or something like that. Next problem with the naive approach. But, Dad, I didn't make this mess. Why do I have to clean it up? I don't know if there are any parents out there. They will know. And here we apply a concept of code ownership. So, you know, you could kind of tell, you know, it's health score is everyone's problem. And, you know, don't don't make a fuss, clean it up. You can be like that strict parent, but in a sufficiently large organization, you know, teams will prefer to burn down their technical debt first before maybe going out and helping others. And it's also really useful to see how health score kind of aggregates by different teams. So our solution to this is essentially we have a mapping. Every single file in our code base must be mapped to a regex that basically tell us which team it belongs to. And by the way, a valid mapping is that it's unowned. You know, we do want all of our code to be owned eventually, but we don't have to start with that. We can basically say we have our own code, no one that's going to solve that problem. But, you know, the vast majority of it is owned. And what that gives us, as you could see on the right side, is basically now we can have tech debt by team. This is also very useful for other tooling, by the way, like assigning auto, assigning bugs or tickets to teams. But it's really cool to see this is powerful because, you know, there may be teams out there who are just have a lot of legacy code that they have to deal with. And they may be understaffed. They may not have enough people. This gives you the sort of organizational visibility to say, okay, this team is understaffed. Look at all the stuff they have to deal with. With the sort of the code ownership filter, that metadata that we add to our health score to every file we can now show you this dashboard but broken down by team. And so engineering managers for that team can go and basically examine it as they wish. As I alluded to, we can't prevent the mess in the first place. Code review is not going to catch everything. But you know, we can help developers by essentially showing them a diff on the PR and we had two different versions of this. I actually are in the process of rewriting it right now. Lynne, who is in the audience, may actually be working on it as I speak, but the first version of this, we we were kind of concerned with runtime and we only applied it to files that were that were changed. So basically getting a good diff, so a modified editor deleted files and we would run basically the health score on the local version of the file and on the main branch version of the file and then post the comment saying that there are diffs. The problem with this approach is that, first of all, it didn't work for our Lint checks. It works well in that model where you only have usage checks and everything is operating on files that are checked into source control. But our SARIF file, as you remember, is not checked in source control. So this model did not work well for that at all. And it also wouldn't show diffs when you're changing the health score recipe itself. Let's say you change some weights or you introduce a new health score check. You wouldn't get a PR comment for that. So our second approach, we we now on every main branch built, we store the health score output, an JSON blob. And then nice thing about, you know, our CI tool BuildKite that makes it really easy to retrieve artifacts from other builds. So on our dev build, we can download the baseline. What is the latest main build? Run the health score locally, run a differ on the two things and then post a comment to GitHub. So same output, just slightly different approach. We hope this is going to work better for us. And the next two problems are actually not technical at all. They're more organizational, if you will. So, first of all, how do we how do we get started? How do we align the health score with our current priorities? The first thing you'll want to do is you'll want to gather with the group of engineers to work on that code base and try to come up with a V1 of what kind of things you think are problematic. And it may take time to align. I wouldn't overindex too much on getting consensus from everyone. The important thing is to get started, get a dashboard out there, start looking at it actively, and then you'll notice things that are off. And adjust them as needed. So some ideas of what we track. It's like in our mobile code bases, disabled tests and Lint warnings, deprecated API usage, accessibility violations. The last two there come from Lint. A number of files in our app module, which is like this monolithic sort of module that always compiles at the end. So it's a problem for Build Times. And then, for example, for large migrations when we were migrating away from Objective-C to Swift, thankfully we're 100% Swift now. Um, we used Health Score to track that migration. Some thoughts on weighing how should we weigh things? And this is something that I talked to Ryan Greenberg about, and he gave me this advice. So, first of all, you want to consider how bad is this thing? Right now, like how much do you care about it? Second thing, and this is not very intuitive is how hard is it to fix? So imagine if you if you assign a lot of weight to something that is like super trivial and a small way to do something that may take days to fix or weeks. Well, guess what? Like, naturally, engineers are going to gravitate to burning down the easy thing first, which is not necessarily a bad thing, but you have to think about that. And then the last one is, yeah, it has to do with the current priorities is how much do you want people to fix it right now? And ultimately, whatever formula you come up with, you're going to want to put it through this litmus test. You're going to get the 50 files that are considered to be the worst by your recipe and engineers in your company. You should be able to look at those files and be like, Yeah, I know that file. It is bad. If you're not getting that reaction, consider going back to the drawing board. Re-weighing things, maybe introducing new checks. And the last sort of tricky bit maybe is the moving target problem. We actually when we launched Health Score on Android, we didn't really lock down the Lint violations that we sort of took in consideration for the health score. So as new Lint violations maybe were added and warnings, you know, came in, the target kept moving. And this is kind of. You know, this is an issue organizationally because I don't know how many of your companies, but probably a lot of them operate on this kind of quarterly planning cadence. And so teams that slack will commit to saying, like, oh, I'll burn down 10,000 points this quarter. If you keep moving the target, they're going to get really annoyed because guess what? Like they may have burned things down, but the score didn't improve. So our take on this is that we maintain two versions of health score. There's the current one and that one is frozen for the quarter. And then we have the next one running at the same time. Just going through a different data pipeline and that one we can experiment with, we can tune out during the quarter for the next one. And then once the new quarter begins, which is very recent, we launch, we sort of switch things over. That new version becomes the new dashboard. So at the end of the day, a health score. It's just a tool. I borrowed this from Ryan Greenberg's presentation, and at the time when I was when he was putting together his presentation and maybe like three or four years ago, that hammer cost $10 at Home Depot. Now it's $15. Inflation. But yeah, if we don't use it, then it's kind of worthless. So here are some tips on how to actually integrate this into your organization. First of all, weekly reports really help here. You know, we do sort of show PR diffs, but as an organization, we might be interested in like how what were the top contributors to, you know, improving health score during the week or you know, we might have some conversations, oh, this PR regressed health score. What's going on there? You know, and the managers or, you know, other people who care can actually peek in there and ask some questions. At the beginning. It really helps to have engineering leadership engaged here. This is very simple. It probably takes about 5 minutes. This is one of our VP of engineering who would just on a regular basis kind of drop in to the dashboard and, you know, maybe notice a few things and ask questions and like saying, oh, the score changed from this to this. Do you feel that this accurately reflects the progress and just this little bit of attention already, you know, signals to the organization that this is important, that you have cycles to you know, you should allocate cycles to tackle this problem. You can also lean into existing programs. So it's like actually that same person, Peter, he started a better engineering program, which is just a word for saying during the quarter. You should do some work sort of to pay back technical debt to make your team more productive, you know, automate something. Um, everyone needs to do this. And the company, it's not just up to the tooling teams. And the cool thing is that you can kind of use health score as a vehicle for this. You know, you can offer people while if you don't have any ideas of how to improve your productivity, you can always improve health score. And so you could plug that in there and get things going that way. I already alluded to this, but you could also use health score to track large migrations. We had a project called Duplo, which modernized and modularized our mobile code bases and health score was a big vehicle for tracking progress there. And actually, one of the one of the engineers on iOS, he had a pretty cool idea. Sometimes you just want to track things, but you don't want it to impact the score. I forgot to make a slide about this, but he created a thing called Insights. So essentially it's like you're using the health score data pipeline to collect data but not actually add it to the score. And we haven't fully gotten there, but it's a score and there are points and a lot of people here like to play games and. What if we turn it into a game? You know what if we have a leaderboard, you know, at our all hands meetings, what if we reward people by basically mentioning their names and saying, hey, they're the top contributor from burning down tech debt this quarter. What if we give them swag? These are all concepts that I have in my head, but we haven't actually implemented yet. But I think it should be pretty easy to do so. At the end of the day, I want to leave you with this. I think we all and most developers out there care about code health because we want to ship features faster. And this is not just the statement. We run a quarterly survey on developer experience. We've been doing it since 2018, and in 2020 we started asking a question: Are you slowed down by technical debt? And you can see this call out here. It is consistently our lowest scoring question and. Yeah. I mean, engineers feel the pain, but also through our investments, you can see that things are actually improving over time. So I believe that, you know, there is and we all want to do we all want to live in a healthy code base and we all want to have a pleasant and productive experience. We all want to work in the kitchen like this, not the kitchen that I showed at the beginning of the presentation. Well, thank you.

Code Health Score: How Slack Tracks and Manages Code Tech Debt at Scale

Transcription