At LinkedIn, many diverse teams exist–backend, frontend, mobile, etc, each with patterns and nuances that no single team can understand. Leveraging four kinds of analytics–descriptive, diagnostic, predictive, and prescriptive–the Developer Insights team is now building next-generation developer experience dashboards that leverage modern data science and AI models.
The Linkedin Developer Insights team shares how they capture observability/telemetry metrics from builds, CI jobs, merges, artifacts, and source repositories, across all developers/platforms/projects to build developer experience dashboards with actionable insights for application teams.
Grant Jenks is a technical leader with 15 years of experience in turning research and product ideas into high-performance software. For the last three years, he has worked at LinkedIn in the Developer Productivity and Happiness organization within the Developer Insights team. Developer Insights works like a “Fitbit for engineering teams” to identify and improve pain points in developer workflows. Prior to LinkedIn, Grant founded an adtech analytics company and applied his expertise in distributed systems and machine learning to predict search engine rankings. He pivoted from his initial role as a compiler engineer on the Midori OS research and incubation project at Microsoft to his current work in analytics.
Shailesh Jannu is a hands-on technology leader with more than 20 years of experience in engineering, architecture, product management and product development. He has a passion for developing solutions using AI, Machine learning, IoT and Big Data. He is an expert in building and releasing world-class enterprise software applications, as well as platform, middleware and infrastructure components for leading global enterprise brands.
At LinkedIn, the team uses Gradle Enterprise along with other tools to capture developer data and transform it into actionable goals that optimize developer productivity engineering efforts. You can also use the Gradle Enterprise API to collect structured data from a Build Scan™ and aggregate it across all builds and tests, whether from developer machines or CI. This data includes dependency management, system resources, and infrastructure, among other things.
Interested in Developer Productivity Engineering? Try these next steps:
- Learn more about how you can use the Gradle Enterprise API to capture data across all builds/tests run on both developer and CI machines.
- Watch DevProdEng Lowdown with Grant Jenks from LinkedIn which dives deeper into how the LinkedIn developer insights team practices DPE.
- Check out Gradle Enterprise for Developers training, to see what structured data a Build Scan captures that can be pulled out of Gradle Enterprise via the API to generate DPE/DevEx observability metrics.
Grant: Good morning, all. I guess I should say good afternoon, huh?
Everybody have a good lunch? Yeah. I thought it was fantastic. I'm excited to be
here. You are here for Descriptive to Prescriptive Analytics and Beyond for
developer productivity. So let's get started. As Rooz mentioned, we are part of
an org called developer productivity and happiness. Shailesh and I are both
senior staff engineers on the team. He is the tech lead for the data science
portion of it. He'll be sharing later today. I am the tech lead kind of for the
team overall, engaging with partner teams and with teams across the company. And
I love this quote from Richard Hemming back in 1962. He said, The purpose of
computing is insights, not numbers. The purpose of computing is insights, not
numbers. And I want to remind us today that even as we do all this discussion of
developer productivity, this is not about reducing people to a set of numbers or
reducing work to a set of numbers. This is about insights. And for us, we call
that actionable insights. Here's the scale of LinkedIn engineering. We're
talking, you know, thousands of engineers across multiple product lines and
product orgs. We have tens of thousands of repositories. We do tens of millions
of builds. These are actually local builds that developers are doing. We have
tens of millions of CI jobs that are kicked off as part of the development
cycle. There's hundreds of millions of lines of code and so on. Now, I will
caveat some of these are snapshot numbers. We're not creating hundreds of
millions of lines of code every quarter. And some of these are per quarter. So
just keep that in mind. What's most important about scale is that not everything
grows linearly. Right. We know that the number of social connections within a
network actually grows with the square of the size of the network. And likewise,
software complexity often grows, sometimes with dependencies exponentially. And
so we're in this position of trying to measure a very complex ecosystem and
trying to make sense of it. And the way I think about these numbers is kind of
like like if you think about the weather, you have thermometers and you have
barometers and you have altimeters and you have things that are measuring
salinity and all these different kinds of things. And together those paint a
picture for you of an ecosystem. And so that's kind of how these numbers are for
us. These are different kind of indicators. But by themselves, they don't tell
us much. This is how we see developer productivity. This is kind of the what
framework, if you will. This is what we're trying to measure.
Developer productivity is broken down at LinkedIn into three different components. There is the happiness. I call this the satisfaction, it works better in the acronym The Happiness of Developers. How satisfied are they? This is fundamentally qualitative, right? We need to ask people, are you excited to come in to work in the morning? Is the tool working well for you? Are you glad that you're using it? We also ask how effective are our tools? And sometimes we boil that down into a success rate. When you do a build, how often does the build succeed? When you submit a CI job? How often does the CI job succeed? When you and we even think of that in terms of reliability, you know, there's some parts of CI that teams are responsible for, like your test you're responsible for. And there are some parts of CI that we're responsible for, like you're checking out the code as part of the CI job. And when you check out that code, if that fails, we need to page someone who's on call to address that as quickly as possible. A lot of these metrics are both component and aggregate. We know the success rate of an individual test, and we know the success rate of the overall CI workflow that works also for the last one here, efficient. That's all about durations. How long did that take you? Also fundamentally quantitative and we measure things like how long did it take to get an individual response to a code review or to an update to a PR in even the larger context to of how long did it take for the whole PR to go through code review, get merged, go through CI, get ready for deployment, get deployed and then go out into the world. So these get broken down into both component and aggregate metrics. There's a whole nother framework actually for the how of how you come about, like how do you make these? We call that goals, signals and metrics. And what you don't see here, I think this is really worth calling out. What you don't see here is you don't see volume. The count of builds, the count of pull requests, the number of deploys. That's not something we put a ton of emphasis on. Do we know the number? Absolutely. We have to know the number. But we don't go and put it in some big dashboard. We don't email it to engineering managers across the company because ultimately we're about velocity, not volume. And there's a whole set of quality metrics that are not really part of today's talk. But we could go through all the usual suspects of coverage. Meantime to resolution, meantime to detection and everything else. This is one of my favorite charts. This is a part of descriptive analytics. This is showing the Gradle build download speed over time. So under the x-axis here is weeks, and on the y-axis is internet bandwidth speed. And what you see is that the company experienced a massive slowdown. You can maybe guess what that slowdown was. This was shelter in place. You all have terrible Internet at home compared to the office. And that's one of the things we had to compensate for and even invest in differently. I want to say also a big thanks to Gradle Enterprise because it was Build Scans that gave us this kind of data. If you're interested, we actually are having a talk. Tomorrow, 11 a.m., Shivani and Swati will be doing a talk about remote builds, and we'll talk about how we completely eliminated this problem. I don't just mean how we improve things. I mean how we completely eliminated this problem. We have tried to completely get rid of cold builds at LinkedIn.
This next slide here is a little peek into our qualitative assessment, our qualitative measurement. What you're looking at here is a bunch of CSAT scores for different tools. We have a personas framework where we kind of divide engineering into mobile engineers, backend engineers, SRE types, tools engineers. And then we go and we assess, hey, what is their opinion of this tool or this service or this workflow? And we gather that feedback on a constant kind of basis. What's important to us is that we gather that feedback as quickly as possible and as relevantly as possible. You know, imagine if you went on an Uber ride and then three months later you are asked for your opinion about it. That is too often how these surveys kind of work. So we have a framework in place so that just as you finish something, we can kind of get your feedback on it and all of that rolls up here. What's not pictured here is the hundreds of dimensions that are also recorded, things like the size of a change, the size of a PR, the region, or the time zone that the author or the builder is in download bandwidth. As we've seen, that's all part of this framework. Now we have tried to move into predictive analytics. This was something that we called the developer journey. And the developer journey meant that we tried to track what a developer is doing as they go through their entire day to day flow. It's kind of a mess, right? I mean, you look at this chart and you think, wow, what developers do is chaotic. They're moving all through different kinds of systems and things. And there's no huge pattern that emerges here. And we had honestly kind of mixed results with this analysis. It taught us a lot of new things, but it didn't work the way we wanted it to. We had really hoped that we'd be able to create what's called a next best action type platform where we could predictably recommend to developers, This is the thing you should do next. And as we analyzed this, we realized that's not going to be possible. It is interesting, though, from a refer perspective. If you're familiar with the Internet, you know that we track links, we track visits across different websites. If you start to think of these nodes, which are different tools and services as web pages, we can kind of understand what you were doing before you got here. So in the upper corner, you'll see one called supportal. That's our support portal for internal tools and all of the arrows feeding into that. Were incredibly interesting for us in terms of what's the likelihood someone goes to blank and then is going to go open a support ticket or what's the likelihood someone was doing why? And then they had to go open a support ticket. On the prescriptive side, we are developing something that we call Insights Hub or Ihub for short. And this was really in response to a question we would get of how do you accumulate or how do you accommodate? How do you accumulate all the information at LinkedIn, in engineering, in this analytic space? And how do you accommodate both the executive at 10,000 feet who's trying to get an understanding of the health of the overall ecosystem all the way down to, say, the senior IC, or the software engineer who wants the nitty gritty details. So one has to be able to go and look at the overall build time across the entire LinkedIn. And somebody else needs to be able to go and open the individual, build scan, or Bazel build report, or whatever it is to understand, oh, this pull request faced this issue, and that's how we can resolve a problem on a team or in a workflow. This combines both objective. Those would be scores like the developer build time with subjective like the CSAT score. So we're trying to kind of blend and provide this holistic view of both objective and relative measures. We're also doing that on the left sidebar. If you notice, there's something called the teams experience. This is scored out of one through five. This is what we call the experience index. It is not a performance measure in any regard. It doesn't say here's the performance of the team, but it tries to gauge kind of what's the overall experience of the team. That is an objective measure. We actually assign what those scores should be. We tell people, Hey, your builds are slow. Combine that with a relative measure which is lower on that sidebar, you'll see that we actually showed the org. We show where the manager is, that we show where a manager ranks with their peers and we can kind of see, hey, where are you in the middle of the pack? Or are you kind of at the bottom or the top?
Productivity again is not the same as performance, and we stay away from employee performance as much as possible. We we really do coach managers and engineers not to look at these and think if the number is low, you're doing badly. What's really the case is if the number is low, then you need to prioritize your developer productivity and we have to do additional work to make sure that when it comes to promotions and when it comes time to reviews, this information is not used and we do that. It's important not to make a dashboard that emphasizes something like lines of code, or number of PRs or number of deploys for all to see. So we don't do that. That doesn't stop engineers from doing it right. You can't really stop you can't hide the number of commits that are happening to your repos. Engineers will do that occasionally, but you don't want to broadcast to engineering managers in emails. You don't want to put that emphasis on it. So I'm going to hand it over now. If analytics types, we've just gone through four of them, descriptive, diagnostic, predictive, and prescriptive. If analytics types are a progression, what comes next? Shailesh.
Shailesh: Thank you. Thanks alot Grant. Thank you. Yeah. So this type of analytics have served us very well so far and it will continue to do so. But what is next? So we think it's augmented analytics. What is augmented analytics? Basically, we are using AI and ML models to assist us to build out deeper insights, augment our tools, make them more intelligent, create new intelligent tools. Let's look at some of the use cases when we have been using this. So the chart on the left is a normalized distribution of pull request created from a developer using an IDE, or non IDE, and the x-axis shows the tenure of the developer at Linkedin. So that all the hypotheses we wanted to test. Does it mean that a person who had long tenure used less IDE and more command line? But the chart clearly shows that this hypothesis is disproved when we looked at some of the statistical models and we figured out that doesn't make that much sense. We use a lot of frequent pattern matching algorithm like data mining algorithm like FP growth, to detect change in dimensions like what dimensions frequently occur together. So this is detecting patterns in the data. This is especially very useful when you are going to do remote root cause analysis. Like, for example, the chart here shows the deployment failures. So when you see the deployment failures, what are the error types that constantly appear, frequently appear together with other dimensions like location, the type of the data fabric and all those things. And this is extremely important when there are lots of lots of dimensions and they keep changing. All of us are running at different frequency. We use anomaly detection engine like 3rd eye, which LinkedIn open source few years ago to detect anomalies at scale. So these anomalies are all happening at our metrics level, at a pipeline and every tool that developer uses goes through the instrumentation. And if we find any anomaly in that code or in any pipeline, we'll tag it and take action of it. So these anomaly detection algorithms take care of seasonality and also the historical trends. We also do a lot of deep learning algorithms to create embeddings. So one such example on the chart here looks at, we take all the repos and we try to create an actual language representation of a repo, to see what are the repos that frequently you see that build together, how similar dependencies that use the same language, they are trying to do the same thing. So this helps us in doing causal experimentation.
One such example is that let's say there are few repos with multiple remote devs and we can actually gain some kind of impact and calculate the impact. How much a new repo, which is very similar, will gain by moving to remote development. Similarly, we build collaborative graphs. For example, we started building a graph to detect communities of developers who frequently review the code authors and code reviewers. So we wanted to see the cohorts of people who worked together. This not only helps us find the cohort of people, but also the people and also the repos that collaborate on together. So recently we started using data, coda's data. So far we're only looking at metrics based on instrumentation of our tools. But now we treat code as a standard dataset, so that we can track code changes. So here is an example of one such tool we built recently, it's a semantic natural language code search. So most of the developers were new to LinkedIn, the struggle, trying to find out how to do certain things. Documentation is great, but it's very difficult to keep documentation up to date. And here we have we already have a very successful code search, keyword based code search. But now how can we use the natural language thing to help new developers come and adopt to LinkedIn codebase? So here is an example, where we have a semantically, a statement like How would I want to send a bunch of messages to Kafka? And you see the model exactly. Tries to find all semantically relevant code, and surfaces for the user. So this has been, help a lot for the new engineers also, and also for the tenured engineers. So I have one such the quotation from one of the testimonials from a recent developer. He said that he just joined the company two months ago, and this tool has been extremely helpful for him to find the right snippet. So here what we did was we took an entire LinkedIn source code and we built a language model out of it. We took every snippet of a code, we extracted comments, the function name outside of the code, and we built a model that filled the model, the natural language part and the code part. And this model is trained to find out exactly a similar code based on the natural language. Taking it further right. Developers not only use IDE. They also use a lot of command line tools, and when we know they been used a lot of command line tools, we have like tens of hundreds of command line tools at LinkedIn. And it is extremely, very difficult for you to remember the command especially when we have so much options available. Here is an example of one such experimentation that we are doing, is like this is a CLI. So in the command line a developer wants to know how to add a memory to the host. And we just recorded our internal tool name for this product Coda on how to add memory to the Host. And the examples here are the real examples, from successful commands that has been executed earlier in the past, by some other person at LinkedIn. So how this works is we have as a part of our quantitative data pipeline, we gather every command line tool that we built at LinkedIn, has the instrumentation pipeline. So we also track which are the successful commands, which are the failure commands. We took all of those commands and we try to map to the help text from the command. So we build a dataset of all the commands that are there and all the help that is available for this command. And now here you see very clearly that these are the two examples coming out from the dataset. And there is no mention anywhere, like the model doesn't know what tool is it. The model is so generic enough that it can adhere to any new command line tool. So here, we just ask how to add memory to the host. And the model is trying to find what is exactly the most relevant thing here and what could the command that could be more relevant to you.
So here we show the help about what is basically. This command is used by a lot of our SRE engineers to increase the memory or to decrease the memory for particular hosts. So this is experimentation, this recently won the LinkedIn hack day dream big award, so I'll move to the next. Yeah. So we know code is more important, but also our developers are also spending a lot with knowledge. It's extremely difficult to find knowledge at one place. We have an internal tool called Supportal, where we gather all the JIRA tickets, we gather all the StackOverflow questions, we are also indexing wikis, where we want to see if a developer experiencing any error or is stuck somewhere, he can come and search here. So it was earlier using a keyword kind of a search, but we revamped the search to use a semantic, natural language search. So we indexed everything, all the wiki pages, the stackoverflow and JIRA tickets. We build kind of a large language model on the LinkedIn internal dataset. And here the example here is that a developer comes in and starts searching. He says that I don't know how to do certain things and what the model is trying to do. The model is trying to find the JIRA ticket for the stackoverflow question answer. But exactly this is matching internally with LinkedIn's data. So after we upgraded this new model, we have seen an uptick in engagement. It is still early to call out the bigger impact because it just went into production a few weeks ago. Then we started looking into code intelligence, like, how can we build models to help developers be more productive? What are the new kind of tools that we can build on it? So we started collaboration with Microsoft Research called Intelligence Team, and we look at what are the two here, we are talking about two different models. One of their core similarity model. So what this model does is trying to find and exactly similar code you have in your codebase based on the based on the given function. And it will try to find the most similar function. So how is this model trained? So we take a function, we obfuscate the function and we'll try to tell the model that these two functions are the same. And eventually the model learns that it should not be looking at the variable names. It should be looking more of a function names. The block of how the code flows, how the data flows inside the code, and eventually the model done. So we have close to 84% accuracy on this model to detect a similar code in our codebase. And the next model is about a code reviewer model. Wouldn't it be great if the A.I. system can just look at the code reviews in the past and try to recommend reviews on the code next time? So here we took code diffs and all the code diffs that are happening at LinkedIn. So we have almost more than ten years of code review data and it's very high-quality data. So we looked at how we can build a model where we can recommend reviews based on the earlier past years. So this model is trained using contrasting learning objective, and it's a generative model. So it looks at the code diff, it looks at the review that was given. And it can do three tasks. One task it can do is it can generate a review for you. The second task, it can even do like code refinement. Like, for example, you have a diff that was done before the review and the developer accepted the review and the review changed code diff. Based on that a model it can recommend, ok, this is a review, but if you do this, this will fix the review as well. And the next thing is the quality. So it can tell you what the most, if you have a code diff, that is hundreds and hundreds of lines. It is very difficult for a developer to look at which part of the code you think most likely get reviewed. So based on that, the model can tell what are the probability that you should be looking at these many lines because these lines are reviewed much, much at a higher rate in the past. So this is open source, this model available from Hugging Face, it has thousands of downloads so far. Also, it's called unixcoder and code reviewer, and we have open source the model. And also there is a paper outlining we presented this paper in the FSC 2022 conference on code review activities, automating code review activities at large scale pre-training.
Now, sometimes it's extremely difficult for a developer when he's in the IDE. So if you have seen an example of a model working in a web browser, we also see an example of a model running inside the command line. Now this is within the IDE, and sometimes when you're looking in a coding, you'll want to look at the examples of an API that you are using, or you're looking at some third party API, how people have used internally. And this is not like it is something extremely important to know what is the exact code snippet and how can it search using code to code. And it is difficult to express code in natural language many times. So here what we are trying to do is you have a code, you highlight the code and say show internal usages. Here the deep learning model tries to look into our code similarity model and tries to find what are the similar code that is there and it will try to bring it in. So this is our next to one of the experimentation on the code review that we have built an A.I. bot called Casper. So Casper is using the extractive, so we are in deep learning, there are two ways right now. One of them is called extract the information retrieval. The other is called is agenda radio. In the information retrieval, what you do is that we look at a similar thing and try to surface it. It's like a semantic search kind of a domain. And in the generative thing you view the model, are like some certain past piece of the code and tell the model to complete it. So here we have an extracted way. So we have mine lots of code reviews and we try to build a model where Casper is our A.I. reviewer. What it looks at the given, at the pull request, tries to find is it a similar pull request that was done in the past. What are the reviews from it and surface it to the user. And we have kind of, we call it A.I. assistant because it doesn't create any pull request on your behalf, but it will still help you to review the code reviews that happened. So yeah, we started with this. And where do we think we know in this, is augment. Thank you so much, and I would like to thank, this was the effort of a lot of teams, a lot of collaboration at LinkedIn, many different teams involved in this. And I will definitely thank LinkedIn leadership for giving this opportunity to present here. And also please tomorrow please attend. Shivani and Swati talking about remote development at 11:00 right here.
Grant: Thanks for letting us share.