A major priority for Meta is to help their development teams move quickly and safely. To do this, they use a standardized metric that quantifies developer velocity and enables them to expose issues like regressions, flaky tests, and problematic areas of the codebase. Learn how Meta leverages DPE with practices like categorizing “diffs” to show code that is most often affected by bugs and incidents, tracking knowledge dispersal within a team, and exposing a team dashboard that helps teams visualize the state of their code, processes, and velocity.
Measuring developer productivity can be hard. Actually, improving developer productivity can be even harder. Meta tackles this challenge at scale with their Productivity Organization, whose focus is on uncovering inefficient processes and removing bottlenecks.
This talk explains how adopting DPE practices enabled Meta to increase developer velocity using a Code Gestation Time (CGT) metric to track, observe and optimize a growing code base across teams. You will also hear real-life stories about the challenges and pitfalls of overly focusing on small metrics while broader developer productivity challenges across teams remain unaddressed.
Karim Nakad is a Software Engineer at Meta where he works to empower developers to be able to move effortlessly through their codebases. At Meta, his group focuses on improving developer productivity and experience by focusing on code quality and tooling.
Gradle Enterprise customers utilize Gradle Enterprise Build Scan ™ to rapidly get deep insights into metrics like build and test performance, flaky tests, external dependencies, failures and regressions. From there, features like Build Cache reduce build and test times by avoiding re-running code that hasn’t changed since the last successful build, and Test Distribution further improves test times (often 90% of the entire build process) by parallelizing tests across all available infrastructure. You can learn more about these features by running a free Build Scan™ for Maven and Gradle Build Tool, watching videos, and registering for our free instructor-led Build Cache deep-dive training.
Check out these resources on keeping builds fast with Gradle Enterprise.
Watch our Build Scan™ Getting Started playlist of short videos to learn how to better optimize your builds and tests and to facilitate troubleshooting.
See how Test Distribution works to speed up tests in this short video.
Sign up for our free training class, Build Cache Deep Dive, to learn more how you can monitor the impact of code generation on build performance.
Karim Nakad: Bruce is awesome, by the way. I told him I was cold this morning, and he gave me his jacket. And I’ve been wearing it for three hours. He’s fantastic. So is everyone excited? No, it’s not good enough. Let’s try again. Everyone excited.
Yeah. All right. There we go. That’s what I’m talking about. So my name’s Karim. I’m from Meta. I’m from the productivity Org specifically. We help people move fast and safe, and it’s great. Fun quick note about my avatar over there. So thank you Peter for cropping out the beautiful McCall that was on my shoulder. I feel bad about that one. Also, Nick’s a bit of my hair. I don’t have a lot of that. So that, no, I’m not very happy about that one. [chuckle] Alright, big team. Lots of folks. Now I’m gonna go through everyone. Akshay is here. Hi Akshay. All right. So before we go through any slides, why are we here? We’re here for two reasons. A, measuring productivity is hard, and B, finding ways to improve that productivity equally hard. So today we’re gonna talk about measuring productivity. We’re gonna talk about a use case with Debbie, the developer. We’re gonna talk about her velocity, her liability, her knowledge issues.
We’re gonna talk about a few other metrics. We’re gonna show you our dashboard ’cause everyone has a dashboard these days. And then we’re gonna talk about what the future holds. So how do we measure productivity? I’m a developer. I like simple solutions. I’ve built software, design software. I need them to be simple. So I figured, what? Let’s have a single metric. Yes, that’s what I was waiting for. All the grimaces in the audience. It’s the same face. You’re like, oh, no. Same face you make when you see an athlete run into a wall. I predicted the grimaces, by the way, and I put that one up there. Yeah, metrics are hard and there are a lot of them. I did a bunch of research. I found all these, some of them are terrible. Some of them are okay. I see some more grimaces that I don’t have an athlete this time to show you, but these are not good. Some of them can be dangerous. Let’s talk about why and how. So, a good friend of mine told me about a story about a language server. So language server, if you’re not familiar with it, it’s an API sits behind an IDE, feeds your IDE with information of what the function definitions are and all that jazz. So here’s a quick example with call the function, and it shows you a little popup. It shows you the function itself, and you can see that same function below it.
So my friend told me about the story, and he talks about the latency of this. So the latency of this call was 10 seconds. Now that’s pretty horrible. You can’t really do anything with 10 seconds. So their goal was to reduce it, and they did. They reduced it down to 10 milliseconds, which is great. So I asked him, how’s productivity after you reduced it? He told me people weren’t much more productive. And I was like, “Why?” It turns out it was 10 milliseconds, but it was actually just wrong now. And that’s completely pointless. So the reason I tell this story is it’s important to make sure you don’t over index on the wrong things. Again, if you focus on latency, you might miss accuracy. Now this isn’t a story from Meta. I say this both because it isn’t a story from Meta, and because legal wouldn’t approve the deck if I did. [chuckle] Let’s talk about DORA. So DORA is a productivity framework. You have deployment frequency, you have lead time for changes, meantime to recovery, change failure rate. This is basically a productivity framework meant to measure team efficiency. Now the first thing you should notice, the first thing I noticed is the letters don’t match. That’s okay. That’s ’cause it actually stands for DevOps Research and Assessment.
And so this team was created by Nicole Forres Green, I believe, 2014 and the metrics came out 2017. And it was great. But she figured we can do better. So she set up another team at GitHub. Or excuse me, myself research, I believe. And they developed the space framework. Space has a number of different metrics. The idea here is to measure more holistically, how productivity is going. And so they published this in March, 2021. It was Nicole Forsgren, Margaret Hans Story, Peggy, Chandra Mandela, Thomas Zimmerman, Brian Ho, Jenna Butler, bunch of folks. It was great. So what’s Meta doing? Well, it wouldn’t be a proper deck without an XCD. So hopefully everyone’s seen this one. The idea is you have 14 standards and folks are like, oh, it’s ridiculous. We gotta combine them. Have a single unified universal standard, and then you get 15 standards because that’s how that works. So instead of adding to the pile, we decided to take a step back and look at the problem. So let’s do that. Let’s talk about Debbie, the developer. So Debbie’s a Dev. She works on the.com website, otherwise known as the Blue App internally. Not that one. There we go. [chuckle]
So she wants to fix tech Dev. So she’s up to an entire week. She blocks everyone’s calendar and she decides, let’s fix all the code. So what can she pick from? She can pick from dozen things. Let’s do some audience participation here. Who migrated a framework in the last six months? All right, now my favorite part, who is still migrating their framework? All right. Yep. Yeah. All right, there we go. So lots of things to work on. Possibilities are absolutely endless. So how do we help her? How do we tell her what code is problematic so she can fix it? Let’s talk about code. Can anyone tell me what this is? What language? What does it do? What does it solve?
It’s first, it’s fizz-buzz. It’s just, it’s fizz-buzz. So it’s a nightmare. Like trying to understand that as a pain. What about this code? A lot simpler. So you can optimize for a bunch of different things. You can optimize for readability, you can optimize for speed, you can optimize for, I don’t know what that one’s optimizing for. [chuckle] So let’s talk about speed. Let’s start with speed. How would we go about measuring speed?
So I wanna solve this problem. I remember the words of a aware, very smart woman in my life, which was basically trust your instincts. Well, my first instinct was to give up, I assumed she meant my second instinct. My second instinct was to start with velocity and see what we can get. So before we talk about co gestation time, just some background, if you were here earlier you heard these terms as well, a diff is a PR and landing a diff is pushing a PR. I’m gonna use that terminology because that’s what I’m used to. So, CGT co gestation time is meant to measure, how many working hours does it take to produce and land a diff. And what we do is we, we look at when a diff starts working on a diff and then we measure up until that diff lands, and then there’s gonna be a lot of noise with that calendar time is gonna be very noisy.
And what we do is remove all the time when the user’s not there. So lunch, PTO, weekend holidays, anytime the user’s not there, we just cut that out and that creates a much more stable metric, which we call CGT. So, big question is, is it correct? Well, it’s time, it’s kind of hard to argue, but we wanted to argue for it anyway. So we looked at a few situations. We looked at CGT for the company around the holidays, and you can see that with company holidays, with performance reviews. CGT tends to spike. This is not because people are out on vacation, it’s because they’re blocked by folks who are out on vacation, or it’s because they are there and they’re working, but they’re working on performance reviews, on thoughtful feedback for their peers and managers and self-assessments.
They’re not shipping code. We also looked at tenure. So at Meta, when you start at meta, you join a bootcamp. And what you end up doing is you work on, very small guided tasks meant to teach you how it is to develop at Meta, teach you all the tools and all that jazz. And you can see CGT, the time it takes is really low in the beginning, under six weeks. After six weeks, a boot camper will graduate and they’ll start working on a team. And what they’ll run into is they have to learn about all this new code and all these new processes, maybe with some tooling that they didn’t experience in Boot Camp. So you can see CGT spike from there. It goes down, it kind of stabilizes around a year. And after a year, folks have hit their stride. So they know what they’re doing, they’re not familiar with their tools, et cetera. They know the XFN partners, things like that.
All right, let’s talk about Wasabi. So, Wasabi is a Python language server, not that one. Wasabi had a huge latency reduction while preserving correctness, not that one. And they launched to a subset of users and they wanted to see, does this affect CGT? And it does. So they noticed that the time it takes to author code between the group that experienced the improvements in this Python language server versus the group that didn’t, they found a 20% improvement. Now there’s a QR code. If you’re interested in reading more about this, there’s a blog post about it. I’m gonna give you about 10 seconds and then I’ll move on. And make sure to take a photo, save this, and then be completely distracted for the rest of the talk. Don’t listen to me at. All and time’s up.
All right. Let’s talk about one more story. We’re a little bit short on time, so I’m gonna skip right through this. Basically big complexity deduction, and we saw improvement in CGT. Yay. Not a lot of time. We know diff CGT, we know the time it takes to write a diff how can we use that? Because it’s not very valuable on its own. Let’s take a look at example. So we have diff 1, diff 1 CGT is about 3 hours. We know the files in that diff. So in this case, file A and file B. Let’s add a couple more diffs to the mix. We have diff 2, diff 3, diff 2 took nine hours, three took one hour. We can start to weigh the weight, the CGT, and then apply that to the different files.
Now I’m gonna use averages for this. Please don’t use averages, but I’m gonna use averages for this talk just to simplify things. By this logic, let’s take some, let’s take file edits into account. Let’s say file B was edited twice, and the CGT for each was three hours. And it’s like, and nine hours. Or six hours, all of a sudden you have an sort of an effort spent kind of, wait, I’m being very vague in these slides to simplify. And that’s about 12. And you can do the same thing for file A that’s three and do the same thing for file C that’s one. All of a sudden you can start to rank how much time folks are spending on specific files, find the hotspots effectively. And again, please don’t use averages. Back to Debbie. Her team was planning on working on fringe recommender.PHP. She noticed it has a, it was expensive to work on. But she opened her dashboard and she realized, oh, actually there’s a file that’s more expensive friend generator at about 300. Excuse me. Now, the reason for that was recency bias is hard. You work on a file, you haven’t touched friend generator in a month or so. You’ve forgotten how painful it is to work on. So what this allows her to do is actually find the hotspots in a quantifiable objective manner.
All right. So we can measure time taken to write a diff and attribute it to files. That’s great. But velocity is not everything. Let’s talk about reliability. In fact, it’s been a while and Debbie’s team is now underwater, she can move really fast, but she’s stuck fixing a bunch of bugs, a bunch of stuff, and it’s just painful. So how could we help her find the problematic code in terms of reliability? For that, we’re gonna have to start categorizing diffs. What if we could find the intent of a diff the reason it was created. Q diff bird. There’ll be another QR code here in a minute. Diff bird is a model built around diffs. It can see more data when we’ll talk about that for now. It can actually tell us what is a bug fix, what is a feature, what is a tech dead improvement, et cetera, et cetera. And we’re using ML because nobody’s gonna categorize it their own diffs. Like, I don’t do that. I don’t think anyone will. Speaking of ML, no QR code first 10 seconds.
Now, speaking of ML, I have a fun story. I was on stage maker for a bit, and there was a notebook there to talk about how ML is actually better at figuring out what plants are, which than humans. So let’s have a vote here. We’re gonna look at parsley and cilantro. Who believes that that is parsley? Show of hands.
All right. By definition, half of you are wrong. Who believes that that is parsley? Okay, so everyone who raised their hands is correct. They’re both parsley. That’s cilantro.
Yeah. I couldn’t tell either. All right, let’s talk about CGT and reliability. So we have CGT for a diff, now we can, now we have intent. Let’s take a look at the same diffs from earlier. If we can categorize them as these are bugs and these are features. Now all of a sudden we can figure out, okay, well file A has had, I simplify this a lot as well, has had six say hours of bugs, CGT, whereas file C had one hour of feature CGT. All of a sudden you can actually track which files are the most expensive in terms of on-call load, in terms of bug fixes, things like that. And what’s more is you can do the same thing with bug count. So we talk about CGT and velocity. You can actually start tracking this file has had X problems in the past, in the last six months, for example.
Alright, back to Debbie. So Debbie used that information to stop the bleeding, which is great. She’s shipping features, fixing bugs, designing legs for avatars. It’s great. But she’s worried [laughter], she’s worried because all the information is trying to silo in getting out of drowning they ended up siloing the knowledge. And so Debbie is the expert on credit card processing. But her teammate, Shrek is actually the expert on gift card processing. I’m gonna pause for a minute. That’s actually Shrek, that’s an early concept of Shrek. So I ruined my own childhood. I’m gonna ruin everyone else’s here. So she doesn’t want the knowledge to be siloed. She asked the team to write two wikis each, and then nobody reads them. That didn’t really help. So how can we help her? Let’s talk about knowledge for a little bit here.
What if you could tell her exactly what files are starting to lose knowledge? Well, we can, surprise, we can. So if we take the authors and the reviewers of specific files, and then cross reference that with the folks that are still on the team, you can actually start to build a data set of who has knowledge on specific files. And so, for example, in this case, you have Debbie and Shrek [laughter] Debbie and Shrek have knowledge on the UI layer. Debbie’s an expert on credit cards. Shrek is an expert on gift processing. Nobody uses bank processor for some reason, person left. And so we can actually tell Debbie, you know, what, these files are at risk of folks leaving. So you can look at a chart of how many files are about to lose knowledge and how many of them are safe, if you will. So that’s great. But what if we didn’t have to wait? We didn’t have to have Debbie do any manual work. That’d be great if we could just do it for her, spread the knowledge automatically, that’d be fantastic.
Let’s say we actually recommended reviewers and had that just automatically spread the knowledge within a team. That’d be fantastic. Now, gotcha. We’re not actually doing this yet. We are recommending reviewers, but we’re not using knowledge diffusion yet. So that’s one of our next steps. All right, let’s talk about some other metrics for a bit. Very briefly we talked about velocity, reliability, knowledge. There are so many more. We can look at code quality metrics, all the static analyzers, test coverage, code complexity, things like that. You can look at the work type breakdown. How much time are folks spending in meetings versus coding versus gathering context, right? Focus time. Focus time is incredibly important. How many gaps of say two hours or more does a developer have? And more than that, how much time do they actually have uninterrupted? Because you could have a block of two hours and you could just interrupt yourself every five minutes, which by the way, is what I do. I, it’s like, oh, do I have any IMS? It’s a problem.
Then of course, you can look at tooling duration, how much your build run time is your test run time, all that jazz, very important. All right, so I don’t actually have any data on this slide, but there was a really cool template for a Meta Quest headset. So I just put it on there. I literally have no data on this slide, so just, you’ll just have to stare a blank screen for a bit. Super important, don’t tell your devs what to do. I’ve seen this before, not at Metaverse but I’ve seen this before where folks will just say, you know what everyone has to reduce their code complexity by X amount or to X amount, and that just doesn’t work. So, so here’s an example. Let’s say you’re a cart team at rainforest.com. The problems that you have to deal with are not related to the complexity of your code. Generally speaking, the complexity lives in the infrastructure layer, right? And so that’s where your life is. So telling that team to reduce your code complexity is not really gonna get you anywhere, they already have low complexity code.
Take a look at different teams. Say tax, the tax there are 195 countries in the world 50 states in this one alone. Each one with hundreds of rules about tax. I made up the hundreds number. I don’t actually know a lot of rules around tax. Telling them to reduce their code complexity is gonna be impossible. It’s not gonna get you anywhere. So again, having a single person dictate and prescribe what everyone else needs to do is just not gonna get you anywhere. Instead, take advantage of everyone’s brains. You’re paying your devs a lot of money to come up with all these ideas, trust them. So my recommendation is give them, say 10, 20% of their time to do whatever they think makes sense, ’cause they’re gonna be the experts in their field and they’re gonna be able to figure out exactly what makes the most sense to go ahead and fix. It could be test coverage, it could be code complexity, it could be meetings, it could be whatever it is.
So we talked about Debbie, but really Debbie was just a placeholder. These were real teams at Meno. Now, when I say real teams, I mean many, many different types of real teams. I’m not even gonna talk about personas, which is actually what’s on the slide for a second here. In my alone, there are dozens of personas. Every developer is different. A lot of folks will a lot of developers in this area will start to think, oh, I’m a developer. I know how all my users are gonna act because I’m a developer they’re developers. But that’s not how that works. There are so many different personas. As a developer, you need to know your customer. You need to know exactly who you’re building things for, and you need to put yourself in their shoes. That’s incredibly important. Once size does not fit all.
So we have all these metrics. How do we show them? I’m gonna talk about the team productivity dashboard. Again everyone has one these days so I might as well talk about ’em for a bit. We wanted to build a tool where action was at the forefront. We wanna make sure that we didn’t just show you some data and think oh that’s cool. Here’s an analogy that I like to give. Let’s say I could tell you exactly how many dishes you’ve tried over time right? It would probably be an increasing graph hopefully because logic and your first reaction would be like okay needs. Now what? Well now what your first question should be how do I compare to others? So I’ve had X dishes. I am above friend A I’m a below friend B. That’s great. Your next question should be what do I do with it right? Do I want to increase it? Do I care about it?
And the next question is how do I increase it? How do I reduce it? Whatever it is how do I keep it flat? Because you can go eat for 24/7 for a little bit to get all the dishes up or you can be smart about it and have a new dish every month or so. So we want action at the forefront. We decided to group our metrics into productivity metrics and investment area metrics. So productivity metrics are gonna be things like velocity on call load sentiment, right? Things that indicate the overall health of a team but not things that you want to change. ‘Cause you can’t just go not directly ’cause you can’t just go Hey team move faster or Hey team feel better. I mean you can but you shouldn’t. So instead we wanted folks to invest in the investment area Metrics. These are gonna be code quality metrics tooling duration documentation quality encode wikis what have you meeting hours focus time all that jazz. And then improving those is going to improve the overall team health, happiness all that jazz.
So this is a heavily sanitized dashboard. It’s a single page. So it looks a little bit boring to be very honest. But we have a layer where we show all these metrics and then you can deep dive into every specific one to understand how can I improve this one? How can I reduce my meeting hours? How many of my meetings are one-on-ones? How many of my meetings are interviews? How many of my meetings are just bloggers? How much time do I actually have to work on code? All that Jazz. All right we’re gonna talk about what’s next. We’re almost there. You all are troopers. So what next? We wanna predict things. We wanna be able to predict not just what code you should work on but what kind of fixes you should make. We can highlight file A file B that’s fine. But we wanna also be able to say, “Hey removing this deprecated module is gonna allow you to move faster ’cause it’s blocked you in the past.”
Or refactoring this class out is gonna reduce the amount of code churn and this specific file number of authors that are working on it and conflicting. We don’t wanna do that. We also want to predict exactly what kind of improvements. Not just tech debt but hey you’re in a lot of meetings. You don’t need to be in this many meetings for example. And then we want integrations. Oh I forgot to okay. And then we want integrations. So we also wanna bring these directly to the developers. Having someone go out to a different tool to understand what to improve. You know it works but it’s not right there in their face. So what we wanna do is integrate with the IDEs show; Hey developer you’re working on this file by the way. Here’s kind of a problem. You should fix it. If you click this button create a the diff for yourself and you can get started right away. We’re gonna do the same thing with say calendar, right? You’re about to block someone’s focus time. Pretty badly surprised we do this. You probably shouldn’t. All right, so Q&A time I have a hidden slide. So if folks ask the right question that one will pop up. Thank you.