Monday 20 November 2017

Tooling for Monorepos

One argument against monorepos is that you need special tooling to make them work. This argument commonly gets presented in a variety of ways, but the most frequent boil down to:


  1. Code size: a single repo would be too big for our source control system!
  2. Requirement for specialised tooling: we're happy with what we have!
  3. Reduces the ability of teams to move fast and independently
  4. Politics and fiefdoms


Let’s take each of these in turn.


Code size
Most teams these days are using some form of DVCS, with git being the most popular. Git was designed for use with the Linux kernel, so initially scaled nicely for that use-case, but started to get painful after that. That means that we start with some pretty generous limits: a fresh clone of linux repo at depth 1 takes just shy of 1GB of code spread between in over 60K files (here’s how they make it work!). Even without modifying stock git, Facebook was able to get their git repo up to 54GB (admittedly, with only 8GB of code). MS have scaled Git to the entire Windows codebase: that’s 300GB spread between 3.5M files and hundreds of branches. Their git extensions are now coming to GitHub and non-Windows platforms.


Which is good news! Your source control system of choice can cope with the amount of code a monorepo contains. Hurrah!


But how long does that take to check out? I’ll be honest, checking out a repo that’s 1GB large can take a while. If that is, you check out the whole 1GB. Git, Mercurial, Perforce, and Subversion support “sparse” working copies, where you only clone those directories you need. The sparse checkout declarations can either be declared in files stored in source control, or they can computed. They likely follow cell boundaries within the monorepo. It should be clear that in the ideal case, the end result is a working copy exactly the same size as a hand-crafted repository containing just what’s needed, and nothing more. As a developer moves from project to project, or area to area, they can expand or contract their current clone to exactly match their needs.


So your checkouts don’t necessarily get larger. They may even get smaller.


But, what if you do have everything checked out? Your source control tool needs to know which files have changed. As the size of the repository grows, the slower these operations become, impacting developer performance. Except both Git and Mercurial have support for filesystem watching daemons (notably “watchman”) These allow file checking operations to scale linearly with the number of files changed, rather than with the number of files in the repository (I’d hope that even those using a “normal” large checkout would consider using this)


So everything is fine with the raw tooling. But what about your IDE?


I mean, yeah, if you’ve checked out the entire source tree, surely your IDE will grind to a halt? First of all, don’t do that --- use a sparse clone --- but if you insist on doing it, update your tooling. Facebook spent a chunk of resources to help make IntelliJ more efficient when dealing with large projects, and upstreamed those changes to Jetbrains, who accepted the patches. It was possible to pull in the source code for every Facebook Android app at the same time in IntelliJ. You may have a lot of code, but it’s unlikely to be that much. Other editors can also happily work with large source trees.


So, code size isn’t the problem you might imagine it is.


Requirement for specialised tooling


Quite often when people talk about monorepos, they also talk about the exotic tooling they use, from custom build systems, tricked-out source control servers, and custom CI infrastructure. Perhaps a giant company has the time and resources to build that, but you’re too busy doing your own work.


Except a monorepo doesn’t require you to do any of those things. Want to use a recursive build tool you’re already familiar with? Go ahead. Paul Hammant has done some interesting work demonstrating how it’s possible to use maven (and, by extension, gradle and make) in a monorepo.


Switching to a build tool such as buck or bazel does make using a monorepo simpler, because these tools provide mechanisms to query the build graph, and can be simply configured to mark various parts of the tree as being visible or not to particular rules, but using one of these isn’t required. One nice thing? You don’t need to write buck or bazel yourself --- they’re both already out there and available for you to use.


Similarly, if you’re comfy with jenkins or travis, continue using them. Admittedly, you’ll need to configure the CI builds to watch not just a repo, but a subdirectory within a repo, but that’s not too hard to do. If you’re using a graph-based build tool, then you can even use jenkins or buildbot to identify the minimal set of items to rebuild and test, but, again, there’s no need to do that. Just keep on trucking the way you do now.


Reduces the ability of teams to move fast and independently


Having a repository per-project or per-team allows them to operate entirely independently of one another. Except that’s not true unless you’re writing every single line of code yourself. It’s likely you have at least a few first and third party dependencies. At some point, those dependencies really should be updated. Having your own repo means that you can pick the timing, but it also means you have to do the work.


Monorepos naturally lead people to minimising the number of versions of third party dependencies towards one, if only to avoid nasty diamond dependency issues, but there’s no technical reason why there can’t be more than one version of a library in the tree. Of course, only a narcissist would check in a library without making an effort to remove the old versions. There are a pile of ways to do this, but my preferred way is to say that the person wanting the update manages the update, and asks for help from teams that are impacted by the change. I’ll cover the process in a later post. No matter how it’s done, the effect of having a single atomic change amortises the cost of the change over all the repos, reducing the cost of software development across the entire organisation by front loading the cost of making the change.


But perhaps it’s not the dependencies you enjoy freedom on. Perhaps it’s the choice of language and tooling? There’s no reason a properly organised monorepo can’t support multiple languages (pioneers such as Google and Facebook have mixed language repos) Reducing the number of choices may be an organisation-level goal, in order to allow individuals to cycle quickly and easily between teams (which is why we have code style guidelines, right?), but there’s nothing about using a monorepo that prevents you from using many different tool chains.


As a concrete example of this, consider Mozilla. They’re a remote-first, distributed team of iconoclasts and lovely folks (the two aren’t mutually exclusive :) ) Mozilla-central houses a huge amount of code, from the browser, through extensions, to testing tools, and a subset of the web-platform-tests. A host of different languages are used within that tree, including Python, C/C++, Rust, Javascript, Java, and Go, and I’m sure there are others too. Each team has picked what’s most appropriate and run with those.


Politics and fiefdoms


There’s no getting away from politics and fiefdoms. Sorry folks. Uber have stated that one of the reasons they prefer many separate repositories is to help reduce the amount of politics. However, hiding from things is seldom the best way to deal with them, and the technical benefits of using a monorepo can be compelling, as Uber have found.


If an organisation enthusiastically embraces the concept of collective code ownership, it’s possible to avoid anything other than purely social constructs to prevent ego being bruised and fiefdoms being encroached on. The only gateways to contribution become those technical gateways placed to ensure code quality, such as code review.


Sadly, not many companies embrace collective code ownership to that extent. The next logical step is apply something like GitHub’s “code owners”, where owners are notified of changes before they are committed (ideally. Using post-commit hooks for after the fact notification isn’t as efficient) A step further along, and OWNERS files (as seen in Chromium’s source tree) list individuals and team aliases that are required to give permission to land code.


If there is really strong ownership of code, then your source control system may be able to help. For example, perforce allows protection levels to be set for individual directories within a tree, and pre-commit hooks can be used for a similar purpose with other source control systems.

Getting the most of a monorepo


Having said that you don't need to change much to start using a monorepo, there are patterns that allow one to be used efficiently. These suggestions can also be applied to any large code repositories: after all, as Chris Stevenson said “any sufficiently complicated developer workspace contains an ad-hoc, informally specified, bug-ridden implementation of half a monorepo”


Although it’s entirely possible to use recursive build tools with a monorepo (early versions of Google’s still used make), moving to a graph-based build tool is one of the best ways to take advantage of a monorepo.


The first reason is simply logistical. The two major graph-based build tools (Buck and Bazel) both support the concept of “visibility”. This makes it possible to segment the tree, marking public-facing APIs as such, whilst allowing teams to limit who can see the implementations. Who can depend on a particular target is defined by the target itself, not by its consumers, preventing uncontrolled growth in access to internal details. An OOP developer is already familiar with the concept of visibility, and the same ideas apply, scaled out to the entire tree of code.


The second reason is practical. The graph-based build tools frequently have a query language that can be used to quickly identify targets given certain criteria. One of those criteria might be “given this file has changed, identify the targets that need to be rebuilt”. This simplifies the process of building a sensible, scalable CI system from building blocks such as buildbot or GoCD.


Another pattern that’s important for any repository that has many developers hacking on it simultaneously is having a mechanism to serialise commits to the tree. Facebook have spoken about this publicly, and do so with their own tooling, but something like gerrit, or even a continuous build could handle this. Within a monorepo, this tooling doesn’t need to be in place from the very beginning, and may never be needed, but be aware that it eases the problem of commits not being able to land in areas of high churn.


A final piece in the tooling puzzle is to have a continuous build tool that’s capable of watching individual directories rather than the entire repository. Alternatively, using a graph-based build tool allows a continuous build that watches the entire repository to at least target the minimal set of targets that need rebuilding. Of course, it’s entirely possible to place the continuous build before the tooling that serialises the commits, so you always have a green HEAD of master….


Thanks

My thanks to Nathan Fisher, Josh Graham, Paul Hammant, Will Robertson, and Chris Stevenson for their feedback and comments while writing and editing this post. Without their help, this would have rambled across many thousands of words.