Monday 26 November 2018

Many Months in Selenium: to November

Well, it's been a long time since I sat down and wrote a post about the adventures in Selenium-land. Time for an update!

Since I last wrote, the work of updating the JSON handling code in the java tree has been completed, and it appears to be stable. However, it would be a terrible waste of time if that was all that we had done, and fortunately it's not :)

The big thing is that we've now finished the 3.x release cycle, and we're getting ready for 4.0. Someone foolishly let me pick version numbers, so the last few releases of 3 tended to get ever closer to π. Judging by some of the bug reports, the initial jump, from 3.14 to 3.141 appears to have confused some folk, but now that we've reached 3.141.59 I think the point has been made (and, just maybe, the joke is wearing thin)

The 4.0 release is going to be a lot of fun. My main focus has been the new Selenium Grid, which features a more modern design, for use with things such as AWS and Kubernetes (and docker compose). Of course, maintaining a simple user experience has been high on our list of goals, so people used to the existing approach of "hub" and "node" will continue to be able to run the system like that. The biggest change is that under-the-covers, the new standalone server and the new Grid are exactly the same software, which is a huge change from the current approach where we have two not-terribly-well-integrated codebases in the same binary.

Another big change is that we're exploring the move from using Buck to build much of Selenium to Bazel. This hasn't been something I've been keen on doing, since I used to be the tech lead on Buck, and I think it has a number of useful properties. Despite this, Bazel has a comparatively huge amount of community support, and that means that people wanting to hack on the project have a smaller learning curve to climb.

Saturday 31 March 2018

A Month in Selenium: March

The month from February to March has been a fun one. At the beginning of March, I attended SauceCon, and gave a keynote on "Lessons From a Decade in Selenium". While the original talk had focused on milestones such as when we first started shipping code, or when we switched to Git, or when someone joined the project, as I sat in the airport waiting to fly, I realised that this was an incredibly dull talk; surely the point of keynote is to give people something to think about and consider?

This explains why I was busy rewriting the entire thing at 12km above the ground in a metal tube zipping along at a smidge over 900km/h.

In the end, I spoke about what makes working on Selenium so rewarding, focusing on the themes of "Joy", "Serendipity", "Thankfulness", "Community", "Growth", and "Striving". I've yet to see the official feedback, but I believe that the talk was well received, as people kept returning to the main themes throughout the conference.

SauceCon itself was a lot of fun. We were lucky to have some of the Selenium committers (old and new), and also supporters of the project from companies such as Sauce Labs itself, and Applitools (who are providing almost all the effort going into the new Selenium IDE) In addition, the Appium developers were well represented too. It was great to be surrounded by so many people who have spent so much time pouring energy into Open Source Software, and to catch up with some of my favourite people. There's a lovely photo of Jonathan Lipps and myself in matching bowling shirts, which I'm happy to see he tweeted.

Since we had so many people in the same place, we decided to release Selenium 3.10. The main highlights of this release were behind the scenes for most users, as we focused on the continued clean up of the internal of Grid, and the continued use of our own abstractions to handle HTTP and JSON. Having said this, there were user-supplied patches, notably moving us from Selenium's own "Duration" class to the one that ships with the JRE. Deleting code is a lot of fun.

One reason for shipping 3.10 was to lay the groundwork for a terrible dad-joke: releasing Selenium 3.11 on the 11th March (3/11 in US date format). Jim Evans and I had noticed that 3.11 was also one of the most famous of the Windows releases, so we decided to lean into the joke, and shipped "Selenium for Workgroups" as well in March. The Selenium server even reports this to users. 3.12 won't have this feature.

In a bid to help our Windows developers ship the Selenium jars, I merged a ton of upstream changes to our fork of Buck, and then spent some time attempting to resolve the issue where zips created on Windows create unreadable directories when unpacked. My fix doesn't resolve the issue, so I filed an issue with the upstream Buck project in the hope that they'd fix it for me. If I get some free time, I'll try this as well.

I've also been working on replacing GSON within our tree (though on my local machine). By the end of the month of work, I had a forked version of Selenium that didn't use GSON at all for outputting JSON. Sadly, I was a little over-ambitious when attempting to finish the work by also deserialising from JSON to proper types. It turns out that there's a bunch of code in Grid that relies on the current semantics of GSON to function. Stepping back, it looks like most of this is because GSON isn't aware of our own types, and it should be relatively easy to replace some of this. At least I know I should be working on next….

Well, that, and a new way of starting sessions that allow users to properly make use of all the features that the W3C New Session command offer.

Tuesday 13 March 2018

A Month in Selenium: February

January was a quiet month for Selenium hacking, but it laid the groundwork for February's efforts. These largely centred around code cleanup in the Grid server, and migrating the project to make better use of our own abstractions over JSON and HTTP.

Why do we have our own abstractions for these incredibly common tasks? There are two main reasons. The first is that we'd like freedom to be able to choose our underlying implementation for these things, without needing to extensively rework our own first-party code. The second is that third party libraries offer generalised APIs that need to meet the needs of all users, whereas we have very specific needs met by these APIs and may need to work around some of the sharp edges (for example, in the java code, lots of classes that need to be serialised to JSON have a toJson method that GSON knows absolutely nothing about). This is typically done by writing adapters.

We started using the Apache HttpClient by default as it's the HTTP library used by HtmlUnit, which we used to ship as part of the core Selenium distribution. In keeping with the other drivers out there, the HtmlUnit team now work on the HtmlUnitDriver, so it's no longer kept in the main project source repo. The interesting thing is that since we made the choice a long, long time ago to use the HttpClient, the HTTP standard has moved forward. HTTP/2 is now a thing. HTTP/2 support is coming as part of HttpClient 5. In order to take advantage of the new options and capabilities, we'd have to rework our existing abstractions anyway, so why not take a look around for something else to use? Better yet, if we use an HTTP library that isn't a dependency of one of our dependencies, we're less likely to end up with clashing versions.

One of the reasons that Java has a terrible reputation for start up speed is because people have massively bloated classpaths. As it stands, the Selenium standalone server weighs in at a portly 24MB. The Apache HttpClient weighs in at about 1.4MB of this total, before we do the update. After the update, the beta of 5 is a touch under 1MB. In comparison, OkHttp (which already supports HTTP/2) with its dependencies is approximately 500kb. In other words, OkHttp is smaller, already supports HTTP/2, and isn't a dependency of our dependency.

So, we switched the project to use OkHttp instead of the Apache HttpClient.

Within the client code, making this change was relatively trivial. The problem is that the server-side code had leaked Apache's APIs into the code. Before we can replace the Apache HttpClient, we need to first of all replace all those usages. That's made somewhat harder by the fact that it's exposed as part of the public APIs of various classes that other libraries extend.

Fortunately, we have a process for deprecating and deleting APIs. First of all, we mark the methods to be deleted as "deprecated" for at least one release. And then we delete them. Of course, if you're going to deprecate a method, you really should provide an alternative and migrate as many uses as can be found to use the replacements. A bulk of my work this month was spent making these changes.

Of course, we needed to do a release, so we lined up 3.9 to start the process. In order to do the release, we needed to actually build it. There had been reports of some issues building the release artefacts on Windows. To resolve this, I had to update our fork of Buck to pull in the latest changes from Facebook, and then to try and work around those issues. Naturally, the Buck developers aren't aware of our fork, so merging in their changes was a somewhat time-consuming affair. Once that I was done, I wrote what I thought was a fix and pushed a new version of our fork of Buck.

I didn't work. Oh well.

The final step in doing a release is trying to get our CI builds green. These take an incredible amount of time to run, and I wondered whether we could speed them up. Travis has support for caching, so it would be nice to use that. My attempts to use caching were foiled because the cache takes into account environment variables, which we use to separate our builds. There's a bug open in the Travis tracker to allow us to name builds, which would have allowed us to work around this, but it's still open. Ho hum. As a work around, I wrote a simple wrapper around Buck that we can call within our CI servers. This makes better use of Buck's ability to parallelise work automatically, and this has helped bring our build times down. Hurrah!

Sunday 21 January 2018

Two Months in Selenium - November and December

You may have noticed a distinct lack of an update last month. It's because I was focused on client work, Christmas, and the New Year, and took some time away from the keyboard. But I'm back now!

The W3C WebDriver spec is now at the stage where we need to demonstrate multiple compatible implementations. Realistically, this means that we need two passes for each test in our test suite. The browser vendors are working hard to get things working, and progress is being made. There's not been a huge amount for me to do here, so this is more of a waiting game than anything else from my perspective. Having said that, I'm on the hook for some sections in Level 2, so I should really sit down and write those (and the matching tests)

The main thing I've been focused on has been the Selenium Grid. There are a couple of things that we really need to solve with Grid. The first is that the code is complex and hard to deal with. When we originally released it, it took a huge amount of work to review the code for thread-safety and to debug many of the issues. That code has not become easier to reason about, which makes it harder to foster Open Source contributions.

Of course, that'd be fine if we didn't care about making any changes, but we do. When Grid came into being, it was normal to have a physical server for each node in the grid. If you were lucky, you might have a massive server with VMWare running on it, which you'd cycle virtual machines on to keep the Grid healthy. The world has changed. Docker is now A Thing, and there are multiple "Selenium as a Service" (SaaS) cloud providers.

There are some projects out there that implement some of the functionality of Grid. For example, selenoid makes use of Docker, but it doesn't use the W3C dialect of the webdriver protocol, which means it doesn't do protocol conversion, and it doesn't natively support cloud providers. Zalenium builds on top of Grid, and provides support for Docker and SaaS, but they've had to work within the existing architecture, and there are obvious rough edges.

Finally, we've wanted the selenium server to be a "Grid of one". If you go into the code of the server, you'll see that there are two fairly separate trees that live side-by-side. When you start the server, it picks one and then goes with it. It'd made things like supporting the W3C protocol harder than it should be, and it's not an elegant way to run things.

As a solution to this, it seems obvious that there should only be one code path. The problem is that the standalone server is too simplistic about how it assigns work, and (as discussed) the Grid code is too complex. Over the past few releases, I've been landing code to help resolve this:


  • The pass through mode: this makes the server proxy requests without doing an parsing or changes unless necessary.
  • The ActiveSession abstraction has been added. This makes adding new types of provider (SaaS, Docker) far easier to write.
  • A "new session pipeline" has been added, and this is being used to handle things like multiple versions of the webdriver protocol.
The most recent thing I've been working on has been a new scheduler. This will be rolled into the new session pipeline, and is composed of a number of pieces:
  • The "Scheduler": this is responsible for queuing new session requests, handling retries, and waiting until nodes become available. This class is thread-safe and designed to be the main entry point.
  • The "Distributor", which is solely responsible for ranking and ordering available hosts, and the sessions that can run on them.
  • The "Host" abstraction, which represents a physical place where sessions can be run. Each of these has a number of....
  • The "Session Factory", which is responsible for creating a new session.
The scheduler will sit within the new session pipeline. For the standalone server, we just add a single host. For the grid, we can add an arbitrary number of hosts (and therefore session factories)

As well as the new scheduler, we're preparing the 3.9 release. It should be out next week, if everything goes according to plan. :)

Monday 15 January 2018

The Selenium Server & Creating New Sessions

 I've had the pleasure of  being a co-editor of the W3C's WebDriver spec, as well as the original author and one of the current maintainers of Selenium's Java bindings, and one of the main authors of the current Selenium Server, particularly the pieces to do with implementing the W3C spec. So, as one of the few people on the planet who knows how all the pieces fit together, and why they fit together that way, I thought it might be helpful to explain how and why the Selenium Server handles a request to create a new session.

For this discussion, I'll use the terminology from the spec. A "remote end" refers to the Selenium Server, and a "local end" are the language bindings you're probably familiar with --- there's some in all the major programming languages, and about a million of them in the JS space too.

First of all, it's advisable for the local end to send a single request for a new session that includes the expected payloads for both the W3C and the JSON Wire Protocol dialects at the same time.

Consider the case where you just send the w3c payload ({"capabilities": {"browserName": "chrome"}}). In this case, a w3c server would correctly attempt to start a chrome session. However, a server that only obeys the JSON Wire Protocol will see an empty payload, in which case it's free to do whatever it wants.

Sending just the JSON Wire Protcol payload ({"desiredCapabilities": {"browserName": "firefox"}}) will create a firefox session in a server that understands the JSON Wire Protocol, but will cause a "no session created" error in a W3C compliant server (since that expects at least {"capabilities": {}} to be set).

So, we have the expected, legal behaviour of the remote end layed out.

For historical reasons, most bindings only accept a "desired capabilities" hash as the argument when creating a new driver instance. Converting the old-style payloads to legal W3C ones is a non-trivial exercise (for example, {"firefox_profile": "sdfgh"} is now {"moz:firefoxOptions": {"profile": "sdfgh"}}, but what happens if both are set? Also "platform" has become "platformName", but do the values match? Probably only at the OS family level, according to the note in the spec)

Most local end bindings get this mapping wrong, but the user doesn't care why their session isn't as they'd expected it to be, they just know it's not right. What to do? What, my friends, do we do?

The answer is to be generous about what we receive from the user and attempt to do what they want. Knowing that most local ends have at least a few problems converting the old format to the new format, the selenium server creates an ordered list of capabilities, putting the OSS ones at the front of the list to ensure maximum compatability.

So, now you know.

Why Use a Monorepo?

A monorepo helps reduce the cost of software development. It does this in three different ways: by being simpler to use, by providing better discoverability, and by allowing atomicity of updates. Taking each of these in turn….

Simplicity

In the ideal world, all you’d need to do is clone your software repository, do a build, make an edit, put up a pull request, and then repeat the last three steps endlessly. Your CI system would watch the repository (possibly a directory or two within it), and kick off builds as necessary. Anything more is adding overhead and cost to the process.

That overheard starts being introduced when multiple separate codebases  need to be coordinated in some way. Perhaps there’s a protocol definition file that needs to be shared by more than one project. Perhaps there’s utility code that’s shared between more than one project.

In many organisations developers may not have the ability to set up a repo on demand, so there’s a time and political cost in creating one. Then there’s the ongoing cost of maintaining them, backing them up, and so on. Especially if data is being duplicated between repositories, the aggregate total space used by these repos will also be larger.

Multiple repositories are not necessarily “simple”.

One straw man solution to the problems of coordination is to copy all required dependencies into your own repo, but then we’ve a huge pile of duplicated work that opens up the possibility of parallel but incompatible changes being made at the same time.

A better solution is to build binary artefacts that are stored in some central location, and grab those when required. Bad experiences with storing binaries in the VCS make many people shy of just checking in the artefacts, so this storage solution seems attractive. But the alternatives introduce complexity. Where previously we only had to worry about maintaining the uptime of the source control system, there’s now the additional cost of maintaining this binary datastore, and ensuring its uptime too. Worse, in order to preserve historical builds, the binary datastore needs to be immutable after a write. In my experience, rather than being a directory served using nginx or similar, people turn to commercial solutions even when free alternatives are available. The cost of building and running this infrastructure raises the total cost of development.

Another area where monorepos bring simplicity is when a package or library needs to be extracted from existing code. This process is simple in a monorepo: just create the new directories, possibly after asking permission from someone, and check in. Every other user receives that change with their next update, without needing to re-run tooling to ensure that their patchwork clients are up to date. Outside of a monorepo, the process can be more painful, especially if a new repository is needed for the freshly extracted code.

Identifying every place that is impacted by such a code change is also easy in a monorepo, even if you’re not using a graph-based build tool such as bazel or buck, but doing something like “maven in a monorepo”. The graph-based build tools typically have a mechanism to query the build graph, but if the tree is one place and you don’t have code-insight tools, then even “grep” can get you quite far.

There are arguments about monorepos stressing source control software, requiring different tool chains, or not being compatible with CI systems. I addressed those concerns in an earlier post, but the TL;DR is “modern DVCS systems can cope with the large repos, you don’t need to change how you build code, and your CI pipelines can be left essentially ‘as is’.”

 

Discoverability

One of the ways that monorepos drive down the cost of software development is by reducing duplication of effort.

It’s a truism that the best code is the code that is never written. Every line of code that’s written imposes an ongoing cost of maintenance that needs to be paid until the code is retired from production (at the very earliest!). Fortunately, a good software engineer is a lazy software engineer --- if they’re aware of a library or utility that can be used, they’ll use that.

In order to function properly, a monorepo needs to be structured to ease discoverability of reusable components, as covered in the post about organising a monorepo. One of the key supporting mechanisms is to separate the tree into functional areas. However, just because a monorepo is structured to aid discoverability, it doesn’t do anything to prevent “spaghetti dependencies” from appearing. What it does do is help surface these dependencies, which would exist in any case, without fancy additional tooling.

Naturally, a monorepo isn’t the only way of solving the problem of discovering code. Good code insight tooling can fill the same role (go Kythe!), as do central directories where people can find the code repositories that house useful shared code. Even hearsay and guesswork can suffice; after all, the Java world has coped with Maven Central for an incredibly long time.

Discovering code has other benefits. As a concrete example, it becomes possible to accurately scope the size of refactorings to APIs within an organisation: simply traverse the graph of everything impacted by a change, and make the change. What used to be a finger-in-the-air guess, or would require coordination across multiple repositories, becomes a far simpler exercise to measure. To actually perform the change? Well, there’s still politics to deal with. Nothing stops that.

Being able to identify all the locations that are impacted by any change makes CI infrastructure easier to write and maintain. After all, we use CI to answer the questions “is our code safe to deploy? And if not, why is it not safe?” In a monorepo, the graph of dependencies is easier to discover, and that graph can (and should!) be used to drive minimally-sized but provably correct builds, running all necessary build and test and not a single thing more. Needless to say, this means that less work is done by the CI servers, meaninging tighter feedback loops, and faster progress. Do you need a monorepo to build this graph? Of course not. Is building that infrastructure to replicate this something you’ve time to do? Probably not.

There is also nothing about using a monorepo that precludes putting useful metadata into the tree at appropriate points. Individual parts of the tree can include license information (particularly when importing third party dependencies), or READMEs that provide human-readable information about the purpose of a directory or package, and where to go for help. However, the need for some of this metadata (“how do I get the dependencies?”, “what’s the purpose of this package?”) can be significantly reduced by structuring the monorepo in a meaningful way.

 

Atomicity

Occasionally there are components that need to be shared between different parts of the system. Examples include IDL files, protobuf definitions, and other items that can be used to generate code, or must exist as a shared component between client and server.

Now, there’s reams to be written about how to actually manage updating message definitions in a world where there might be more than one version of that protocol in the wild, and having a monorepo doesn’t prevent you from needing to follow those rules and suggestions. What a monorepo allows is a definitive answer to the question of where these shared items should be. Traditionally, the answer has been:
  • In one of the clients
  • In the server
  • In a central location, referenced by everything
  • Gadzooks, we’ll copy the damn thing everywhere
Needless to say, the last approach is remarkably painful, since all changes to the definitions need to tracked across all repositories. In the first two cases, you may end up with unwanted dependencies on either client or server-side code. So the sensible thing to do is to store the shared item in a different repository. This will lead you to the horror of juggling multiple repositories, or, if you’re lucky, taking a dependency on a pre-built binary that someone else is responsible for building.

Interesting things happen when the shared item needs to be updated. Who is responsible for propagating the changes? Without a requirement to update, teams seldom update dependencies, so there’s out-of-band communication that needs to happen to enforce updates.

Using a monorepo resolves the problem. There’s one place to store the definition, everyone can depend on it as necessary, and updates happen atomically across the entire codebase (though it may take a long time for those changes to be reflected in production) The same logic applies to making small refactorings — the problem is easy to scope, and completion can be done by an individual working alone.

 

Summary

Monorepos can reduce the cost of software development. They’re not a silver bullet, and they require an organisation to practice at least a minimal level of collective code ownership. The approach worked well at Google and Facebook because those companies fostered an attitude that the codebase was a shared resource, that anyone could contribute to and improve.

For a company which prevents people from viewing everything and having a global view of the source tree, for whatever reason (commercial? Social? Internally competing teams?) a monorepo is a non-starter. That’s a pity, because there are considerable cost savings to be made as more and more share a monorepo. It’s also possible to implement a monorepo where almost everything is public, with parts selected pieces being made available as pre-compiled binaries or otherwise encrypted for most individuals.

Monorepos help reduce the cost of software over the lifetime of the code by simplifying the path to efficient CI, lowering the overhead of ensuring changes are propagated to dependent projects, and by reducing the effort required to extract new packages and components. As Will Robertson pointed out, they can also help reduce the cost of developing development support tooling by providing a single-point “API” to the VCS tool and the source code itself.

 

Complementary practices

Monorepos solve a whole host of problems, but, just as with any technical solution, there are tradeoffs to be made. Simply cargo-culting what Google, Facebook, or other public early adopters of the pattern have done won’t necessarily lead you to success. On the flip side of the coin, sticking with “business as usual” within a monorepo may not work either.

Although complex branching strategies might work in a monorepo, the sheer number of moving pieces means that the opportunity for merge conflicts increases dramatically. One of the practices that should be most strongly considered is adopting Trunk Based Development. This also suggests that developers work on short-lived feature branches, hiding work in progress behind feature flags.

Software development is a social activity. Merging many small commits without describing the logical change going in makes the shared resource of the repo’s logs harder to understand. This leads to a model that is less common than it used to be --- squashing the individual steps that lead to a logical change to a single commit, which describes that logical change. This makes using the commit logs a useful resource too. Code review tools such as Phabricator help make this process simpler.

Most importantly: stop and think. It is unlikely your company is Google, Facebook, Twitter, Uber, or one of the other high-profile large companies that have already adopted monorepos (but if you’re reading from one of those places, “Hi!”). A monorepo makes a lot of sense, but simply aping the big beasts and hoping for the best won’t lead to happiness. Consider the advantages to your organisation for each step of the path towards a monorepo, and take those steps with your eyes open.

Thanks

Thank you to Nathan Fisher, Josh Graham, Paul Hammant, Felipe Lima, Dan North, Will Robertson, and Chris Stevenson for the suggestions and feedback while I was writing this post.

Wednesday 22 November 2017

Organising a Monorepo

How should a monorepo be organised? It only takes a moment to come up with many competing models, but the main ones to consider are “by language”, “by project”, “by functional area”, and “nix style”. Of course, it’s entirely possible to blend these approaches together. As an example, my preference is “primarily language-based, secondarily by functional area”, but perhaps I should explain the options.

Language-based monorepos
These repos contain a top-level directory per language. For languages that are typically organised into parallel test and source trees (I’m looking at you, Java) there might be two top-level directories.

Within the language specific tree, code is structured in a way that is unsurprising to “native speakers” of that language. For Java, that means a package structure based on fully-qualified domain names. For many other languages, it makes sense to have a directory per project or library.

Third party dependencies can either be stored within the language-specific directories, or in a separate top-level directory, segmented in the same language specific way.

This approach works well when there aren’t too many languages in play. Organisation standards, such as those present in Google, may limit the number of languages. Once the number of languages becomes too many, it becomes hard to determine where to start looking for the code you may depend on.

Project-based monorepos
One drawback with a language-based monorepo is that it’s increasingly common to use more than one language per project. Rather than spreading code across multiple locations, it’s nice to co-locate everything needed for a particular project in the same directory, with common code being stored “elsewhere”. In this model, therefore, there are multiple top-level directories representing each project.

The advantage with this approach is that creating a sparse checkout is incredibly simple: just clone the top-level directory that contains the project, et voila! Job done! It also makes removing dead code simple --- just delete the project directory once it’s no longer needed, and everything is gone. This same advantage means that it’s easy to export a cell as an Open Source project.

The disadvantage with project-based monorepos is that the top level can quickly become bloated as more and more projects are added. Worse, there's the question of what to do when projects are mostly retired, or have been refactored to mostly slivers of their former glory.

Functional area-based monorepos
A key advantage of monorepos is “discoverability”. It’s possible to organise a monorepo to enhance this, by grouping code into functional areas. For example, there might be a directory for “crypto” related code, another for “testing”, another for “networking” and so on. Now, when someone is looking for something they just need to consider the role it fulfills, and look at the tree to identify the target to depend on.

One way to make this approach fail miserably is to make extensive use of code names. “Loki” may seem like a cool project name (it’s not), but I’ll be damned if I can tell what it actually does without asking someone. Being developers, we need snazzy code names at all times, and by all means organise teams around those, but the output of those projects should be named in as a vanilla a way as possible: the output of “loki” may be a “man in the middle ssl proxy”, so stick that in “networking/ssl/proxy”. Your source tree should be painted beige --- the least exciting colour in the world.

Another problem with the functional area-based monorepos is that considerable thought has to be put into their initial structure. Moving code around is possible (and possible atomically), but as the repo grows larger the structure tends to ossify, and considerable social pressure needs to be overcome to make those changes.

Nix-style monorepos
Nix is damn cool, and offers many capabilities that are desirable for a monorepo being run in a low-discipline (or high-individuality) engineering environment, incapable of managing to keep to only using (close to a) single version of each dependency. Specifically, a nix-based monorepo actively supports multiple versions of dependencies, with projects depending on specific versions, and making this clear in their build files.

This differs from a regular monorepo with a few alternate versions of dependencies that are particularly taxing to get onto a single version (*cough* ICU *cough*) because multiple versions of things are actively encouraged, and dependencies need to be more actively managed.

There are serious maintainability concerns when using the nix-style monorepo, especially for components that need to be shared between multiple projects. Clean up of unused cells, mechanisms for migrating projects as dependencies update, and stable and fast constraint solving all need to be in place. Without those, a nix-style monorepo will rapidly become an ungovernable mess.

The maintainability issue is enough to make this a particularly poor choice. Consider this the “anti-pattern” of monorepo organisation.

Blended monorepos
It’s unlikely that any monorepo would be purely organised along a single one of these lines; a hybrid approach is typically simpler to work with. These “blended monorepos” attempt to address the weaknesses of each approach with the strengths of another.

As an example, project-based monorepos rapidly have a cluttered top-level directory. However, by splitting by functional area, or language and then functional area, the top-level becomes less cluttered and simultaneously easier to navigate.

For projects or dependencies that are primarily in one language, but with support libraries for other languages, take a case-by-case approach. For something like MySQL, it may make sense to just shovel everything into “c/database/mysql”, since the java library (for example) isn’t particularly large. For other tools, it may make more sense to separate the trees and stitch everything together using the build tool.

Third party dependencies
There is an interesting discussion to be had about where and how to store third party code. Do you take binary dependencies, or pull in the source? Do you store the third party code in a separate third party directory, or alongside first party code? Do you store the dependencies in your repository at all, or push them to something like a Maven artifact repository.

The temptation when checking in the source is that it becomes very easy to accidentally start maintaining a fork of whichever dependency it is. After all, you find a bug, and it’s sooo easy to fix it in place and then forget (or not be allowed) to upstream the fix. The advantage of checking in the source is that you can build from source, allowing you to optimise it as along with the rest of the build. Depending on your build tool, it may be possible to only rely on those parts of the library that are actually necessary for your project.

Checking in the binary artifacts has the disadvantage that source control tools are seldom optimised for storing binaries, so any changes will cause the overall size of the repository to grow (though not a snapshot at a single point in time) The advantage is that build times can be significantly shorter (as all that needs to be done is link the dependency in)

Binary dependencies pulled from third parties can be significantly easier to update. Tools such as maven, nuget, and cocoapods can describe a graph of dependencies, and these graphs can be reified by committing them to your monorepo (giving you stable, repeatable historical builds) or left where they lie and pulled in at build time. As one of the reviewers of this post pointed out, this requires the community the binaries are being pulled from to be well managed: releases must not be overwritten (which can be verified by simple hash checks), and snapshots should be avoided.

Putting labels on these, there are in-tree dependencies and externally managed dependencies, and both come in source and binary flavours.

Thanks

My thanks to Nathan Fisher, Josh Graham, Will Robertson, and Chris Stevenson for their feedback while writing this post. Some of the side conversations are worth a post all of their own!

Monday 20 November 2017

Tooling for Monorepos

One argument against monorepos is that you need special tooling to make them work. This argument commonly gets presented in a variety of ways, but the most frequent boil down to:


  1. Code size: a single repo would be too big for our source control system!
  2. Requirement for specialised tooling: we're happy with what we have!
  3. Reduces the ability of teams to move fast and independently
  4. Politics and fiefdoms


Let’s take each of these in turn.


Code size
Most teams these days are using some form of DVCS, with git being the most popular. Git was designed for use with the Linux kernel, so initially scaled nicely for that use-case, but started to get painful after that. That means that we start with some pretty generous limits: a fresh clone of linux repo at depth 1 takes just shy of 1GB of code spread between in over 60K files (here’s how they make it work!). Even without modifying stock git, Facebook was able to get their git repo up to 54GB (admittedly, with only 8GB of code). MS have scaled Git to the entire Windows codebase: that’s 300GB spread between 3.5M files and hundreds of branches. Their git extensions are now coming to GitHub and non-Windows platforms.


Which is good news! Your source control system of choice can cope with the amount of code a monorepo contains. Hurrah!


But how long does that take to check out? I’ll be honest, checking out a repo that’s 1GB large can take a while. If that is, you check out the whole 1GB. Git, Mercurial, Perforce, and Subversion support “sparse” working copies, where you only clone those directories you need. The sparse checkout declarations can either be declared in files stored in source control, or they can computed. They likely follow cell boundaries within the monorepo. It should be clear that in the ideal case, the end result is a working copy exactly the same size as a hand-crafted repository containing just what’s needed, and nothing more. As a developer moves from project to project, or area to area, they can expand or contract their current clone to exactly match their needs.


So your checkouts don’t necessarily get larger. They may even get smaller.


But, what if you do have everything checked out? Your source control tool needs to know which files have changed. As the size of the repository grows, the slower these operations become, impacting developer performance. Except both Git and Mercurial have support for filesystem watching daemons (notably “watchman”) These allow file checking operations to scale linearly with the number of files changed, rather than with the number of files in the repository (I’d hope that even those using a “normal” large checkout would consider using this)


So everything is fine with the raw tooling. But what about your IDE?


I mean, yeah, if you’ve checked out the entire source tree, surely your IDE will grind to a halt? First of all, don’t do that --- use a sparse clone --- but if you insist on doing it, update your tooling. Facebook spent a chunk of resources to help make IntelliJ more efficient when dealing with large projects, and upstreamed those changes to Jetbrains, who accepted the patches. It was possible to pull in the source code for every Facebook Android app at the same time in IntelliJ. You may have a lot of code, but it’s unlikely to be that much. Other editors can also happily work with large source trees.


So, code size isn’t the problem you might imagine it is.


Requirement for specialised tooling


Quite often when people talk about monorepos, they also talk about the exotic tooling they use, from custom build systems, tricked-out source control servers, and custom CI infrastructure. Perhaps a giant company has the time and resources to build that, but you’re too busy doing your own work.


Except a monorepo doesn’t require you to do any of those things. Want to use a recursive build tool you’re already familiar with? Go ahead. Paul Hammant has done some interesting work demonstrating how it’s possible to use maven (and, by extension, gradle and make) in a monorepo.


Switching to a build tool such as buck or bazel does make using a monorepo simpler, because these tools provide mechanisms to query the build graph, and can be simply configured to mark various parts of the tree as being visible or not to particular rules, but using one of these isn’t required. One nice thing? You don’t need to write buck or bazel yourself --- they’re both already out there and available for you to use.


Similarly, if you’re comfy with jenkins or travis, continue using them. Admittedly, you’ll need to configure the CI builds to watch not just a repo, but a subdirectory within a repo, but that’s not too hard to do. If you’re using a graph-based build tool, then you can even use jenkins or buildbot to identify the minimal set of items to rebuild and test, but, again, there’s no need to do that. Just keep on trucking the way you do now.


Reduces the ability of teams to move fast and independently


Having a repository per-project or per-team allows them to operate entirely independently of one another. Except that’s not true unless you’re writing every single line of code yourself. It’s likely you have at least a few first and third party dependencies. At some point, those dependencies really should be updated. Having your own repo means that you can pick the timing, but it also means you have to do the work.


Monorepos naturally lead people to minimising the number of versions of third party dependencies towards one, if only to avoid nasty diamond dependency issues, but there’s no technical reason why there can’t be more than one version of a library in the tree. Of course, only a narcissist would check in a library without making an effort to remove the old versions. There are a pile of ways to do this, but my preferred way is to say that the person wanting the update manages the update, and asks for help from teams that are impacted by the change. I’ll cover the process in a later post. No matter how it’s done, the effect of having a single atomic change amortises the cost of the change over all the repos, reducing the cost of software development across the entire organisation by front loading the cost of making the change.


But perhaps it’s not the dependencies you enjoy freedom on. Perhaps it’s the choice of language and tooling? There’s no reason a properly organised monorepo can’t support multiple languages (pioneers such as Google and Facebook have mixed language repos) Reducing the number of choices may be an organisation-level goal, in order to allow individuals to cycle quickly and easily between teams (which is why we have code style guidelines, right?), but there’s nothing about using a monorepo that prevents you from using many different tool chains.


As a concrete example of this, consider Mozilla. They’re a remote-first, distributed team of iconoclasts and lovely folks (the two aren’t mutually exclusive :) ) Mozilla-central houses a huge amount of code, from the browser, through extensions, to testing tools, and a subset of the web-platform-tests. A host of different languages are used within that tree, including Python, C/C++, Rust, Javascript, Java, and Go, and I’m sure there are others too. Each team has picked what’s most appropriate and run with those.


Politics and fiefdoms


There’s no getting away from politics and fiefdoms. Sorry folks. Uber have stated that one of the reasons they prefer many separate repositories is to help reduce the amount of politics. However, hiding from things is seldom the best way to deal with them, and the technical benefits of using a monorepo can be compelling, as Uber have found.


If an organisation enthusiastically embraces the concept of collective code ownership, it’s possible to avoid anything other than purely social constructs to prevent ego being bruised and fiefdoms being encroached on. The only gateways to contribution become those technical gateways placed to ensure code quality, such as code review.


Sadly, not many companies embrace collective code ownership to that extent. The next logical step is apply something like GitHub’s “code owners”, where owners are notified of changes before they are committed (ideally. Using post-commit hooks for after the fact notification isn’t as efficient) A step further along, and OWNERS files (as seen in Chromium’s source tree) list individuals and team aliases that are required to give permission to land code.


If there is really strong ownership of code, then your source control system may be able to help. For example, perforce allows protection levels to be set for individual directories within a tree, and pre-commit hooks can be used for a similar purpose with other source control systems.

Getting the most of a monorepo


Having said that you don't need to change much to start using a monorepo, there are patterns that allow one to be used efficiently. These suggestions can also be applied to any large code repositories: after all, as Chris Stevenson said “any sufficiently complicated developer workspace contains an ad-hoc, informally specified, bug-ridden implementation of half a monorepo”


Although it’s entirely possible to use recursive build tools with a monorepo (early versions of Google’s still used make), moving to a graph-based build tool is one of the best ways to take advantage of a monorepo.


The first reason is simply logistical. The two major graph-based build tools (Buck and Bazel) both support the concept of “visibility”. This makes it possible to segment the tree, marking public-facing APIs as such, whilst allowing teams to limit who can see the implementations. Who can depend on a particular target is defined by the target itself, not by its consumers, preventing uncontrolled growth in access to internal details. An OOP developer is already familiar with the concept of visibility, and the same ideas apply, scaled out to the entire tree of code.


The second reason is practical. The graph-based build tools frequently have a query language that can be used to quickly identify targets given certain criteria. One of those criteria might be “given this file has changed, identify the targets that need to be rebuilt”. This simplifies the process of building a sensible, scalable CI system from building blocks such as buildbot or GoCD.


Another pattern that’s important for any repository that has many developers hacking on it simultaneously is having a mechanism to serialise commits to the tree. Facebook have spoken about this publicly, and do so with their own tooling, but something like gerrit, or even a continuous build could handle this. Within a monorepo, this tooling doesn’t need to be in place from the very beginning, and may never be needed, but be aware that it eases the problem of commits not being able to land in areas of high churn.


A final piece in the tooling puzzle is to have a continuous build tool that’s capable of watching individual directories rather than the entire repository. Alternatively, using a graph-based build tool allows a continuous build that watches the entire repository to at least target the minimal set of targets that need rebuilding. Of course, it’s entirely possible to place the continuous build before the tooling that serialises the commits, so you always have a green HEAD of master….


Thanks

My thanks to Nathan Fisher, Josh Graham, Paul Hammant, Will Robertson, and Chris Stevenson for their feedback and comments while writing and editing this post. Without their help, this would have rambled across many thousands of words.