Valera Zakharov talks about how addressing tech debt to maintain a healthy codebase enables their software engineers to ship software faster to customers. Learn how to accumulate code analysis data from sources like static code analysis, style checks, and linting into a single, usable metric.
All codebases have technical debt. Sometimes developers plan to clean it up later. Other times they act on the urge to go on a refactoring binge. Addressing technical debt can be rewarding and useful. But how can we make sure that it is not simply left to the whim of a good samaritan? And can such work be consistently seen and encouraged by engineering leadership?
This talk provides the formula for building a code health score for any codebase, touches on the nuances of its implementation on Android/iOS, and shares lessons learned about integrating this tool into an engineering process.
Valera Zakharov leads the mobile developer experience team at Slack. Prior to Slack, he led the development of Espresso at Google and contributed to the infrastructure that runs hundreds of Android tests per second. He is passionate about building infrastructure that makes the lives of developers more pleasant and productive and loves to present on this topic.
Generating a health score for your codebase provides a valuable addition to a comprehensive Developer Productivity Engineering practice. In combination with Gradle Enterprise’s Performance Acceleration, Performance Continuity, and Failure Analytics technologies, code health scores promote consistent, faster, and more reliable builds.
Interested in addressing tech debt and improving build quality? Try these next steps:
- Read how the Micrometer OSS project used Gradle Enterprise to reduce tech debt by speeding up builds with the Build Cache and improve build reliability.
- Check out this short video on how Gradle Enterprise’s Failure Analytics can keep your builds healthy and reliable.
- Sign up for a live, hands-on, instructor-led training on build/test performance acceleration to tackle tech debt. The Gradle and Maven Build Cache Deep Dive training talks about how to accelerate builds and tests.
Valera Zakherov: Rooz is doing an amazing job here, and he just, like,
knows everyone here. Like, I don't know. Raise your hand if you haven't met
Rooz. See, no one is raising their hand anyway. Um. All right. It loaded up. We
had a little glitch actually on my way here. I was getting a little nervous if I
was going to make it on time because, you know, Muni, San Francisco Muni, not
the most reliable thing ended up taking an Uber anyway. But one of my friends,
Nick, who may be in the audience here, he volunteered to give the presentation
instead of me if I don't make it. He's never seen the slides. I thought it was a
great idea. But eventually I did make it. I'm here to talk to you about Code
Health Score, which, by the way, is a concept, but it's also an implementation.
I'm going to talk about how we implemented this concept. So both things are
true. All right, let's get started.
Before we dive in, though, let's talk about technical debt. Raise your hand if
you've worked in a code base that has technical debt. All right. See those
people that are not raising their hand. They've never worked on a code base or
they just don't want to raise their hand, which is totally fine, too. I don't
think I would. But yeah, technical debt exists. Some people say when, you know,
whenever code is merged into the main branch just instantly becomes technical
debt. Best repo. No code. Right. But yeah, you know, unfortunately, though, we
you know, we're not here to clean up technical debt and have a perfect code base
right at the end of the day. We're here to ship features to users. And I put
this I started putting this in the beginning of a lot of my presentations that I
give internally at Slack. I work on the mobile developer experience team, and I
think sometimes these kinds of teams like developer productivity, infrastructure
teams, we kind of forget that at the end of the day, everything that we do is in
service of this. You know, we ship user, I mean, our developers ship features to
customers. These customers pay us money. Or the you know, we mind their data and
sell it to advertisers, but you know what I mean? But it's really important to
have features like in everything that we do is in service of that. Fortunately,
though, for us as developers, it is this is also true. A healthy code base
allows us to ship these features to customers faster and safer. There's many
takes on this. I think Facebook, you know, what, does it move fast with good
infrastructure? It used to be break things, but they finally, you know, they
understood that it's really important to do it fast and safe. Yeah. So we have
kind of both things are true. I just wanted to show you more. More pictures of
messy kitchens. I was thinking, like, what kind of analogy could I make here?
And just like, imagine yourself cooking in a kitchen like this, right? And doing
it quickly. Right? Like, as you were, I don't know, trying to cook breakfast for
your kids in the morning or something like that. It would be absolutely
impossible. So unfortunately, though, code bases are not exactly kitchens. And
it's it's not always evident. Like in the kitchen, you walk in, you can see the
mess right away in a code base. It's not really, you know, I mean, when you work
in it, right, when you're trying to build a feature and you're trying to ship it
and, you know, you encounter some scary, you know, class file and you go on a
two week refactoring binge and then your manager is like asking you, where's
that feature that you were supposed to ship? And you're like, Well, I'm sorry,
but here's a really cool abstraction layer that I built. Not that this ever
happened to me, but. Anyway. Where was I going with this? It's not evident, like
especially for people who are not regularly in the code base, for example,
engineering leadership, what the scale of the technical debt problem is and
really like whether we're making where we're getting worse or better, it's not
like a bunch of dirty dishes on a counter.
So what if? Oh, yeah. There's a first time I'm giving this presentation, by the
way. First time I'm giving a presentation since the pandemic. Really awesome to
actually have a live audience. Amazing feeling like the energy. But anyway,
unfortunately though, ad hoc cleanup doesn't really work like there are good
Samaritans who once in a while will go and address some technical debt and do a
refactoring. There are maybe whole teams dedicated to this, but you know, at a
certain scale, maybe this would work at a scale of like two or three developers,
but at a certain scale, it just doesn't work. We cannot rely on ad hoc cleanup.
And that's why, you know, for example, professional kitchens, right? They clean
as they go. Like, if they didn't clean as they go, they just wouldn't be able to
do what they do. So what if we could quantify this problem with a single score?
I don't know why we have a decimal point there. It's really weird. We like to be
really precise at Slack. And we could break the score down into sub-scores that
actually kind of tell us what the problems are in our code base. And we could
show trends over time as changes are merged into the main branch. How the score
is changing. Is our technical debt getting worse or better? And we could also
show those trends for the sub-scores. Then we could show you the worst files in
our code base. Everyone knows those files, right? I don't know. Ours was called
Messages Fragment for a long time. Because we do messages in Slack. And what if.
You know, since we don't have the luxury of Elon Musk reviewing all of our code.
Yes. I'm so sorry for my Twitter friends. Anyway. What if we could show you on
your PR is a comment that said, hey, you might be actually making the technical
debt problem worse or you might be improving it. Yay. And by the way, this
concept I did not invent this concept, I wish I did, was brought to Slack by, I
believe, an ex Twitter engineer, Ryan Greenberg, who is also a very funny
person. You can follow him on Twitter. He does not talk about health score. He
mostly makes dad jokes just like me. He's a dad. And I have to also say that
there are third party solutions out there and it's like actually we're very big
on using third party software, including Gradle enterprise. If you know, we can
pay money to solve the problem, you know, we usually prefer that way for some
reason. With Health Score, though, we decided to implement our own solution. But
if some of these work well for you, you know, that's great.
So now let's let's talk about the implementation of what health score might look
like in a very simple kind of nutshell. You count the problems. You weigh them
by how bad they are. And then basically the score is zero minus the sum of the
weighted problems. So the score is negative, and the best you could ever do is
zero. And a naive implementation of this might look something like this. You
examine every file in the code base and then for each check, you know, the
sub-scores, you return the weighted score. You aggregate these things by file
and then sum it all up, upload it to a back end and display it in a pretty
dashboard data pipelines. I mean, I feel like 90% of what I do is this actually.
Clicker. Here we go. So I have to show a little bit of code. You know, there are
developers here. Very simple interface. You're right. We define a check that
just has a calculate score method. Returns a negative integer given a file path.
And here's a very simple implementation of an actual health score check that we
have in our code base that basically checks for a file limit. So if you're over
the limit by a number of lines and basically like returns the penalty for how
many lines you're over because, you know, we found that, you know, usually very
large files correlate with unhealthy code. But there's a slight problem, you
know, with this very naive take. By the way, the naive take is fine. Nothing
wrong with it. We actually had it for a while and. Even if you just have that
and you start with that, it's it's really worth doing it. But we found over time
that, um, introducing, you know, there are some certain issues. For example,
here, you know, developers, iOS developers, Android developers would have to go
and edit some Python code and figure out how to deploy the new health score. So
there was a bit of a barrier to maintaining health score and also introducing
new checks. So of course, the solution for this is to make a declarative, you
know, configuration file. I don't know, YAML or JSON or whatever you prefer. And
we found a lot of our scores are actually basically just checking for usage like
certain regex, like code patterns in our code base and doing so by line. So we
pulled that out into a configuration file and, you know, implemented it like
this. Basically, you can just go through every line and apply the regex and then
you weigh you get the count and you weigh it. Super simple stuff. And I have to
say, if you did just this and this is really easy to implement, this will take
you a really long way. In fact, on iOS, where, you know, we're still solely
using this method right now and, you know, there hasn't been a strong need to
actually do anything else. So this works really well.
But can we do better than regex matching? Because some of, you know, astute
listeners might notice that. Well, you know, string matching. Come on. That's a
little bit, you know, simplistic. There are static analysis tools out there that
work on the abstract syntax tree. So you can work with concepts like methods and
classes and, you know, basically compiled code. And an example of this is
Android Lint. Android Lint ultimately produces its output in either an XML file
or SARIF file, which is basically just a blob of JSON that tells you what the
violation is and tells you the file path of the violations on. And our first
version of health score on Android, actually, unlike iOS, was built entirely on
top of Lint, so we would basically take that XML or SARIF output. And we had a
Gradle task essentially that, you know, read that in and produced the health
score. The advantages of the advantages of this is that it's it's really easy to
wire up dependencies, you know, that Lint has to run and then your health score
task has to run. And there's a really strong ecosystem around Lint. For example,
Lint warnings show up in the IDE. The problem, though, for us was that
introducing new checks for Android was even harder than, you know, writing new
Python. You actually had to write a Lint rule and raise your hand here if you
have written a Lint rule. Okay. Few people it's not I'm sure it's not that bad,
but somehow it's still a barrier to entry that most developers didn't really
want to overcome. And also, you know, Gradle happens to not work on iOS and, you
know, other tools out there. So that was also an issue. We kind of like as a
centralized team, we were maintaining this tool. We now had to maintain two
different versions for Android and iOS.
So our second take on Lint, including Lint into health score, essentially
integrated the output into our existing model. We wrote a new health score check
called Better Check that basically instead of opening up every file in the code
base, it actually just took the the SARIF file. And then with some sort of
configuration which told you which Lint violations to look for how to weigh
them. I don't expect you to read this block of code, but essentially we can now
produce the same kind of output that our normal health score expects, but just
parsing the SARIF file. And this is this is actually really cool because now,
you know, with Lint, you can turn certain warnings into errors and prevent them
from entering the system. But most of the time, you know, you don't want to sort
of make that noise on PRs. So you usually have a baseline file, right, where you
kind of ignore all the existing violations and then you sort of try to not make
things worse. The cool thing about health score and Lint integrating together is
that now you can actually track and burn down that that baseline file and
eventually burn down Lint violations, Lint warnings to zero and then turn them
into errors. This system works really well.
Another problem with the simplistic approach is that in a large codebase, I
don't claim that Slack code base is the largest in the world by any means, but
examining each file is costly. This is a pretty easy problem to solve in our
configuration for health score. We essentially tell it which sort of directories
to look at, which directories to avoid, and also each health score check
basically has a filter. I applied to these files. All right. And yeah, again,
another block of text of how this is implemented. As I said, you know, this this
makes it much faster to crawl the code base. Oh, by the way, it also would work
well in a monorepo situation where you can basically say, oh, look at only this
path for Android and this one for iOS. With this approach, it takes about 44
seconds for our health score to run on our, you know, medium sized Android code
base with 16,000 kotlin files, 1.3 lines of code 44 seconds, maybe like not the
fastest in the world, but it's also pretty decent. So so far we haven't felt a
strong need to optimize this further and like go and rewrite things in Rust or
something like that. Next problem with the naive approach. But, Dad, I didn't
make this mess. Why do I have to clean it up? I don't know if there are any
parents out there. They will know. And here we apply a concept of code
ownership. So, you know, you could kind of tell, you know, it's health score is
everyone's problem. And, you know, don't don't make a fuss, clean it up. You can
be like that strict parent, but in a sufficiently large organization, you know,
teams will prefer to burn down their technical debt first before maybe going out
and helping others. And it's also really useful to see how health score kind of
aggregates by different teams. So our solution to this is essentially we have a
mapping. Every single file in our code base must be mapped to a regex that
basically tell us which team it belongs to. And by the way, a valid mapping is
that it's unowned. You know, we do want all of our code to be owned eventually,
but we don't have to start with that. We can basically say we have our own code,
no one that's going to solve that problem. But, you know, the vast majority of
it is owned. And what that gives us, as you could see on the right side, is
basically now we can have tech debt by team. This is also very useful for other
tooling, by the way, like assigning auto, assigning bugs or tickets to teams.
But it's really cool to see this is powerful because, you know, there may be
teams out there who are just have a lot of legacy code that they have to deal
with. And they may be understaffed. They may not have enough people. This gives
you the sort of organizational visibility to say, okay, this team is
understaffed. Look at all the stuff they have to deal with. With the sort of the
code ownership filter, that metadata that we add to our health score to every
file we can now show you this dashboard but broken down by team. And so
engineering managers for that team can go and basically examine it as they
wish.
As I alluded to, we can't prevent the mess in the first place. Code review is
not going to catch everything. But you know, we can help developers by
essentially showing them a diff on the PR and we had two different versions of
this. I actually are in the process of rewriting it right now. Lynne, who is in
the audience, may actually be working on it as I speak, but the first version of
this, we we were kind of concerned with runtime and we only applied it to files
that were that were changed. So basically getting a good diff, so a modified
editor deleted files and we would run basically the health score on the local
version of the file and on the main branch version of the file and then post the
comment saying that there are diffs. The problem with this approach is that,
first of all, it didn't work for our Lint checks. It works well in that model
where you only have usage checks and everything is operating on files that are
checked into source control. But our SARIF file, as you remember, is not checked
in source control. So this model did not work well for that at all. And it also
wouldn't show diffs when you're changing the health score recipe itself. Let's
say you change some weights or you introduce a new health score check. You
wouldn't get a PR comment for that. So our second approach, we we now on every
main branch built, we store the health score output, an JSON blob. And then nice
thing about, you know, our CI tool BuildKite that makes it really easy to
retrieve artifacts from other builds. So on our dev build, we can download the
baseline. What is the latest main build? Run the health score locally, run a
differ on the two things and then post a comment to GitHub. So same output, just
slightly different approach. We hope this is going to work better for
us.
The next two problems are actually not technical at all. They're more
organizational, if you will. So, first of all, how do we how do we get started?
How do we align the health score with our current priorities? The first thing
you'll want to do is you'll want to gather with the group of engineers to work
on that code base and try to come up with a V1 of what kind of things you think
are problematic. And it may take time to align. I wouldn't overindex too much on
getting consensus from everyone. The important thing is to get started, get a
dashboard out there, start looking at it actively, and then you'll notice things
that are off. And adjust them as needed. So some ideas of what we track. It's
like in our mobile code bases, disabled tests and Lint warnings, deprecated API
usage, accessibility violations. The last two there come from Lint. A number of
files in our app module, which is like this monolithic sort of module that
always compiles at the end. So it's a problem for Build Times. And then, for
example, for large migrations when we were migrating away from Objective-C to
Swift, thankfully we're 100% Swift now. Um, we used Health Score to track that
migration. Some thoughts on weighing how should we weigh things? And this is
something that I talked to Ryan Greenberg about, and he gave me this advice. So,
first of all, you want to consider how bad is this thing? Right now, like how
much do you care about it? Second thing, and this is not very intuitive is how
hard is it to fix? So imagine if you if you assign a lot of weight to something
that is like super trivial and a small way to do something that may take days to
fix or weeks. Well, guess what? Like, naturally, engineers are going to
gravitate to burning down the easy thing first, which is not necessarily a bad
thing, but you have to think about that. And then the last one is, yeah, it has
to do with the current priorities is how much do you want people to fix it right
now? And ultimately, whatever formula you come up with, you're going to want to
put it through this litmus test. You're going to get the 50 files that are
considered to be the worst by your recipe and engineers in your company. You
should be able to look at those files and be like, Yeah, I know that file. It is
bad. If you're not getting that reaction, consider going back to the drawing
board. Re-weighing things, maybe introducing new checks.
And the last sort of tricky bit maybe is the moving target problem. We actually
when we launched Health Score on Android, we didn't really lock down the Lint
violations that we sort of took in consideration for the health score. So as new
Lint violations maybe were added and warnings, you know, came in, the target
kept moving. And this is kind of. You know, this is an issue organizationally
because I don't know how many of your companies, but probably a lot of them
operate on this kind of quarterly planning cadence. And so teams that slack will
commit to saying, like, oh, I'll burn down 10,000 points this quarter. If you
keep moving the target, they're going to get really annoyed because guess what?
Like they may have burned things down, but the score didn't improve. So our take
on this is that we maintain two versions of health score. There's the current
one and that one is frozen for the quarter. And then we have the next one
running at the same time. Just going through a different data pipeline and that
one we can experiment with, we can tune out during the quarter for the next one.
And then once the new quarter begins, which is very recent, we launch, we sort
of switch things over. That new version becomes the new dashboard. So at the end
of the day, a health score. It's just a tool. I borrowed this from Ryan
Greenberg's presentation, and at the time when I was when he was putting
together his presentation and maybe like three or four years ago, that hammer
cost $10 at Home Depot. Now it's $15. Inflation. But yeah, if we don't use it,
then it's kind of worthless.
So here are some tips on how to actually integrate this into your organization.
First of all, weekly reports really help here. You know, we do sort of show PR
diffs, but as an organization, we might be interested in like how what were the
top contributors to, you know, improving health score during the week or you
know, we might have some conversations, oh, this PR regressed health score.
What's going on there? You know, and the managers or, you know, other people who
care can actually peek in there and ask some questions. At the beginning. It
really helps to have engineering leadership engaged here. This is very simple.
It probably takes about 5 minutes. This is one of our VP of engineering who
would just on a regular basis kind of drop in to the dashboard and, you know,
maybe notice a few things and ask questions and like saying, oh, the score
changed from this to this. Do you feel that this accurately reflects the
progress and just this little bit of attention already, you know, signals to the
organization that this is important, that you have cycles to you know, you
should allocate cycles to tackle this problem. You can also lean into existing
programs. So it's like actually that same person, Peter, he started a better
engineering program, which is just a word for saying during the quarter. You
should do some work sort of to pay back technical debt to make your team more
productive, you know, automate something. Um, everyone needs to do this. And the
company, it's not just up to the tooling teams. And the cool thing is that you
can kind of use health score as a vehicle for this. You know, you can offer
people while if you don't have any ideas of how to improve your productivity,
you can always improve health score. And so you could plug that in there and get
things going that way. I already alluded to this, but you could also use health
score to track large migrations. We had a project called Duplo, which modernized
and modularized our mobile code bases and health score was a big vehicle for
tracking progress there. And actually, one of the one of the engineers on iOS,
he had a pretty cool idea. Sometimes you just want to track things, but you
don't want it to impact the score. I forgot to make a slide about this, but he
created a thing called Insights. So essentially it's like you're using the
health score data pipeline to collect data but not actually add it to the score.
And we haven't fully gotten there, but it's a score and there are points and a
lot of people here like to play games and. What if we turn it into a game? You
know what if we have a leaderboard, you know, at our all hands meetings, what if
we reward people by basically mentioning their names and saying, hey, they're
the top contributor from burning down tech debt this quarter. What if we give
them swag? These are all concepts that I have in my head, but we haven't
actually implemented yet. But I think it should be pretty easy to do
so.
At the end of the day, I want to leave you with this. I think we all and most
developers out there care about code health because we want to ship features
faster. And this is not just the statement. We run a quarterly survey on
developer experience. We've been doing it since 2018, and in 2020 we started
asking a question: Are you slowed down by technical debt? And you can see this
call out here. It is consistently our lowest scoring question and. Yeah. I mean,
engineers feel the pain, but also through our investments, you can see that
things are actually improving over time. So I believe that, you know, there is
and we all want to do we all want to live in a healthy code base and we all want
to have a pleasant and productive experience. We all want to work in the kitchen
like this, not the kitchen that I showed at the beginning of the presentation.
Well, thank you.