What’s inside?

Dumping all your code into a single shared mono-repository is all the rage these days driven by the desire to lessen the pain of “dependency hell” and version control conflicts. However, monorepos present challenges to project modeling, performance, and the user experience that modern developer tooling is only just beginning to address. 

Learn how large repositories (aka “Gigaprojects”) can threaten to break IDEs like IntelliJ IDEA. Discover the drawbacks of working with a monorepo and get some new ideas and approaches to solving these problems by looking at them through the DPE lens.

Summit Producer’s Highlight

Developers use an IDE to be more productive, using time-saving features like fast access to code, shortcuts, and plugins. Yet developer productivity is often compromised by deploying a large monorepo. Factors like high memory consumption, protracted indexing processes, and the need to manage large file systems and thousands of fine-grained modules, often cause your IDE to operate at a snail’s pace, if it doesn’t break down completely. 

Learn how the IntelliJ IDEA team is exploring DPE concepts to mitigate some of the major disadvantages of monorepo environments. Approaches discussed include reducing the working scope of the codebase, remote development practices to speed up the environment, and parallelizing work to a CI server with pre-generated indexes.

About Justin

Justin believes in “Tools before Rules.” This means automating the development toolchain to remove the friction caused by manual processes. He works on this goal as team lead for IntelliJ IDEA Bazel support and as a customer success engineer at JetBrains.

More information related to this topic
Gradle Enterprise Solutions for Developer Productivity Engineering

Gradle Enterprise customers utilize Gradle Enterprise Build Scan® to rapidly get deep insights into the state of monorepos with various DPE-based metrics and concerns, like build and test performance, flaky tests, dependencies, failures, and regressions. Additionally, build and test performance acceleration features, including Build Cache, Predictive Test Selection and Test Distribution, address the pain of slow builds and excessive context switching due to avoidable disruptions to the developer’s creative flow. Learn more about how these features can help you boost productivity by starting with a free Build Scan for Maven and Gradle Build Tool, as well as watching videos, and checking out our free instructor-led Build Cache deep-dive training.

Check out these resources on keeping builds fast with Gradle Enterprise.

  1. Watch our Build Scan  Getting Started playlist of short videos to learn how to better optimize your builds and tests and to facilitate troubleshooting.

  2. See how Test Distribution works to speed up tests in this short video.

  3. Sign up for our free training class, Build Cache Deep Dive, to learn more how you can monitor the impact of code generation on build performance.

JUSTIN KAESER: All right. DPE Summit, that’s us. We need to talk about how your monorepo breaks the IDE. But first I want to know a bit better who I’m talking to. I’ve talked to a lot of you, but not everybody. So first of all, are you familiar with the monorepo concept? Yeah. I figured that’s most of you. That’s not the case everywhere. Who works with one? Okay. Wow. So many. Your fault. So I don’t think I need to explain much about what a monorepo is in this audience. So first a bit about me, I’m Justin. I work at JetBrains. I joined about six years ago. I worked on the Scala plugin for most of this time mostly regarding build tool integrations, specifically SBT. And yeah, when I think of what projects in SBT or other classical build tools like Maven and Gradle, I’ve mostly been working in JVM, look like. So you download some dependencies maybe once or when you add some, when you initially open the project, it does a little work. And in a minute or a few minutes, you have your project, you can start working with it. The projects are mostly just maybe a few dozen modules that you defined yourself. Usually maybe dozens of megabyte, usually not more than a gigabyte and code and other stuff.

So life was good kind of, but the projects and complexity keep growing. So another know your audience question. So who here still works with small projects, fly-sized megabytes? No. Yeah. Wow. Okay, tens or maybe hundreds of megabytes. A few. Are you all on gigabytes? Tens of gigabytes? Too large for your laptop. Okay. Yeah. Wow. Your fault. So that’s not the only relevant metric, but one that’s fairly easy to measure. Now, more of my tragic backstory. So I got into a bit of a rut over the pandemic, I wasn’t feeling super productive in the Scala and SBT world.

So I started something we call at JetBrains, Customer Success Engineering. The idea is basically, I talk to important customers like you all and listen to your problems and understand your concerns. And I say, yeah, “I feel you, man.” And sometimes I also try to like effect change internally. So put you in touch with the right people or poke someone to say, hey, we have this problem here that is maybe not affecting the standard user, but everyone in this large enterprise company is blocked by this. So please put a bit of effort into that. That’s part of what my role is. And now talking to all these people, we found out many of you are now into Bazel you’re already using Bazel. You’re at least interested in it, looking at it, and yeah.

So did I ask who uses Bazel here? Again, quite a few, but not everybody who uses large monorepos. That’s also interesting. Who’s like thinking of going to Bazel in their organization? Also a few. Okay, right. This also varies wildly by audience. I was at another conference recently, I had to explain to almost everybody I talked to what Bazel even is. So that’s kind of interesting, right? I mean where everybody is like developers truly wouldn’t be following trends and fashions, we’re rational people, right?

Anyway, we made a Bazel team because there’s need for it. And somehow I became the team lead. So that’s what I mostly focus on these days. Here I was gonna explain what Bazel is, but I hope most of you already know. If not, ask your neighbor or me later. So, monorepos maybe you already know why you’re using a monorepo. From what I understood, what’s the promise of a monorepo to ascend from dependency hell. So instead of publishing individual libraries’ artifacts to some Artifactory or some other artifact repository you’ve made in central or publicly you just depend on a bunch of other sources in the same repository so you don’t have to deal with versions and version upgrades.

And because you have no versions, you don’t have cascading dependency upgrades where you have to work, update each individual repository when you have a little bug fix in one deep transitive nested dependency. And of course no version conflicts. So yeah, from talking to you and what I heard today, the trend is now that we put microservices in a monorepo and still… Instead of building a monolith from microrepos, right? Anyway, some of your problems are now solved. Awesome. Instead we get on new problems because your monorepo broke my IDE.

You broke longstanding assumptions about how a project works, how you… Or how you work with a project. I’ll contrast that a bit. The classical repository versus the complex, huge neuro monorepo. So in the classic project, yeah, you’ll have a limited project scope. You typically build maybe one primary artifact that comes out of it. You publish it somewhere, you consume it somewhere else. You define a few chunky modules like balls of code they are often highly interdependent. You do it manually mostly, you have primarily binary dependencies that you consume. And yeah, mostly static project structure. Of course, in Gradle, unlike in Maven, you often do program the project structure somewhat or in SBT as well. But anyway, this has… Sorry. Yeah. Well, in the monorepo, from talking to a bunch of you, I found there’s different approaches how you deal with it.

So some of you, if the repo isn’t too large, like maybe 5, 10, 20 gigabytes, you just download it on your laptop and you try to open it in the IDE all at once, that sometimes doesn’t go so well. Others try to define arbitrary subset, like a bunch of targets or subdirectories that you open as your working set. Yeah, you’ll typically or at least in Bazel, you’ll often have like, not dozens, but hundreds, thousands, tens, hundreds of thousands of individual build targets that do a specific task or define a small set of code and its dependencies. Yeah, you’ll have source and binary dependencies that you depend on.

So I mentioned, ascend from dependency hell, don’t use binary dependencies, just depend on some other part of your whole repository. And lastly, yeah, the project structure often has more dynamic aspects. It’s not fully defined at definition time, but has to be computed. So maybe a short bit what a build tool actually does. In general, you could view it as a kind of a data flow engine, a little spider crawling across directed acyclic graph, running little programs and feeding these program outputs usually saving them as a file or just feeding them directly to the next part of the graph and running a little program on the outputs there.

The idea of this is embodied already in the make tool, the good old classic Unix make and the build tools we have are usually some kind of complex extension of that build for more generalizability, scalability and so on. So that’s why, but usually often we have files as inputs and outputs. Of course. Yeah, lots of specific depending on the build tool, on the ecosystem you’re working on and so on. Now, it’s different about a IDE. Primarily, it’s interactive. You want fast feedback when you work with it.

You want to type and you immediately want to get your error message. You want to navigate the whole code base at once. You want to jump around, you’re not batching through the whole thing node by node, you want to follow links between pieces of code forward and backward. And so this requires, yeah, that you have at least part of this project structure in memory, or at least in very fast, on disk access, structure, data structures. And, yeah. So and this necessitates that you kind of import the build, you map the structure represented in the build tool to one that you can utilize in the IDE. So this we call, yeah, importing, loading, reloading, or sync. So sync, yeah. You’re probably familiar with this. It’s kind of… It feels like this old-timey slow process, like this girl here transcribing maps manually to the iPad.

One problem you have here, the structure isn’t necessarily stored in a format that is easily to access in the IDE. So you have to do a partial build. Maybe you have to walk through the dependency tree, to map it to like a flat structure that you can load into, basically what is a map, an index on the IDE side. Yeah. And this query of the build structure is already kind of expensive. So that’s one major problem, but we got more.

So with bigger repositories, you need to build more larger indexes such as these. And generally you need more metadata in your IDE, so you get more memory overhead. So you need more powerful computers. When you have hundreds or even thousands of developers working on the same code base, all checking in code at the same time. This requires more updates on the side of the tool, more updates of the indexing. And the indexing, if you use IntelliJ IDEA, you’re probably also familiar with that, it can take a while.

So another thing we often run into with monorepos because you have source dependencies, but not only, we also have this and other tools like Gradle, SBT, is shared sources. So you might often run into this if you cross compile the same bit of code to multiple platforms like JVM, JS, Native. For the build tool, it’s not a problem. It just walks the code graph, the dependency graph says, okay, in this target, I take this code, I compile it to JS with this compiler version, fine. And then you do the same thing for JVM. Fine.

In the IDE, at least in IntelliJ IDEA, we have this longstanding assumption that one code file belongs to exactly one module. And this module is compiled with a specific set of dependencies and has… And a specific compiler version, specific libraries and no others. So you can only really edit it in one context at once, even though you would want to do so in multiple contexts.

Another interesting challenge is something like refactoring or generally repository wide actions. I mentioned the approach of only loading a small part of the repository. While you work on that, that works fine, but when you have like a dependency of many other parts, like some library and you want to change the signature there in theory and the monorepo is a great use case for this.

Because you can change all the code over the whole code base at once. But in practice, it’s a bit hard to do from the IDE if you can’t load the whole code base at once. Likewise, finding usages or running completions can be such things that cause problems. And let’s not forget the UX, especially navigation. If you have a huge code base, and open it in your editor in the IDE, you’ll… You might just have like hundreds of directories, but it’s kinda hard to find the directories you’re actually interested in working on right now, or especially the set of directories. So usability can become an issue for some.

And lastly, not leastly, configuration complexity balloons. You probably have someone or even are someone who configures the builds for many others in your organization. So not any… Not every individual developer is even able to understand the build configuration. I’ve already had this problem in a fairly small team of 10 people and a build just of the Scala plugin that only two… One or two of us really knew their way around the SBT build configuration. And I imagine that must be even more interesting for very large repositories.

So honestly, I’m a bit overwhelmed by all these issues. Where should I even start? Well, nothing a little brute force can’t solve. So performance. Yeah. I heard some of you found with the new M1 Apple silicon MacBooks. Oh, actually this works fine now. But some aren’t quite at that point yet, or it’s not enough. So we are actually doing a lot of optimization work on the IDE, old school optimization. We do benchmarking performance tests. We report freezes when a user’s IDE freezes or some action takes too long, then that gets sent as an exception to our servers. So please do share your telemetry. It helps us identify performance problems and yeah, we’re working to fix these performance bottlenecks all across the code base. It’s lots of individual little fixes.

And yeah, a code base that is some 20 years old, so it takes a while. There’s more things we can do about it. So I mentioned, free up little optimization. So some fields are UI responsiveness in general. So we have a old school architecture with a UI thread. Historically, many people have been doing all kinds of operations in that UI thread, which can cause freezes. So we just kind of free that up. The indexing is probably very interesting to many of you. So we have multi-core indexing. We’re working on that, and more incremental re-indexing of parts of the project.

And lastly, the project model. So in the background, we’ve been working on a new project model for IntelliJ and migrating some of the build tool support to this more efficient API. We’ve already done this for Maven, and got some measurable performance gains there. And we’re currently also developing a new plugin for Bazel, which makes use of this API.

Now, that was what I could find out when I talked to our people internally. But as customer success engineer, I’ve been also talking to many of you, many customers, and I asked them, how do you deal with your giga projects? And some of the ideas I’ve collected is so common. One is reducing the working scope. So only importing parts of the whole repository is a fairly obvious one, like only some directories or something. Or importing a limited depth of dependencies. So you might not want to download, import into the IDE and index every symbol in fourth level nested transitive dependency because chances are you’re not gonna look at it.

And in that vein, you might try to import only the interfaces of your dependencies, though that might have some downsides, like how do you debug this? How do you understand the code deeply if you’re trying to figure out some bug that might be even not in your code, but in the dependencies. A bit more on what we call fragments. Selecting a limited working scope. And the basic idea is it’s great here. You select a bunch of targets, modules, directories to work with you save that selection into a file, and you can share that with a team or use it just locally.

And the IDE will only load some defined subsets in the Bazel plugin. This is supported as our so-called project view. We’re also working on supporting this more generally in IntelliJ. So having a mechanism that also works somewhat for Maven or Gradle, because from what I hear, there are still some monorepos that are built in Maven even.

Another popular approach is moving more work to a server. So you have pre-generated indexes that you can generate on a server for a certain state of the repository and download them from the IDE. So this is a feature that’s supported in IntelliJ, but it does require some infrastructure investment for your own code or for general libraries, because you would need to generate these indexes, basically for every state of your project code and for every dependency. And also the way they are currently for every version of IntelliJ and your plugins, anything that influences the indexes. So unfortunately at the moment, these indexes are portable but not super portable. So you might have some slight difference in your configuration that causes basically the equivalent of a cache miss.

But we’re working on making this more stable. Yet another approach, pre-warming the working environment. We’ve heard great talk from LinkedIn folks on remote development. So have you seen that one? Yeah, quite a few of you. That’s exactly a use case we want to support, and we’re working on making more stable at this moment. So the basic concept is you have a headless IDE running on a server. You connect to the server through SSH, for instance, and you get a client, relatively thin client that handles most of the user interface.

Now, the server can be pre-warmed like they do at LinkedIn. But yeah, as you heard, this requires a lot of infrastructure investment and the way it is now, it took them, what, a year or two? Almost two years until it was at this state. But I heard people love it. And yeah, we’re making… We’re working on making this easier in the future. We also have like with JetBrains Space, our own orchestration environment that can help with this. Now, more into the realm of, I hesitate to say science fiction, but vague ideas.

When we talk in the team about what else we could be doing in the future once we get the basic problem solved. There’s some ideas like lazy project loading. So why would you want to preload a whole huge monorepo or even a fragment of it? If you only work in a small part of that fragment, why can’t you just navigate to some piece of code, and it loads the more deeper metadata about that. But it does seem to require already some preloading to understand like the structure of the code in the first place. You could further incrementalize the syncs, so you could understand in the build tool, which part of the project has actually been recently updated.

And then I can only update that particular part of it. In the Bazel plugin, this is already kind of supported, but you have to do it manually. You can navigate anywhere, and then you can manually click, yeah, do partial sync, and then you get the data and dependencies of that part of the project. I heard some companies work like that. Another interesting one is you can only generate indexes on the server, but also, the project structure itself, like the export part, the query of the project structure that we currently do locally.

And then dump that, for instance, into the Bazel cache or some other area where you can easily download it and consume it. And another idea, service side code actions. So some refactorings across the whole repository. Maybe you don’t want to load the whole monorepo into your, IDE but have a server that understands the repository and can apply a change in batch. As I understand, our source graph has some features like that.

Okay, so I’m through with what I wanted to say pretty much. And now I have a confession. The true reason I’m giving this talk is not to just to teach you or scold you for breaking my IDE, but to learn a bit from you. So far I’ve given you my perspective. I’d like to hear yours even more like what kind of issues you’re having with monorepos or IntelliJ and monorepos, how you’re working around these issues, what you would like to see. But also, of course, feel free to answer questions as well.