Sunday 4 November 2012

Ruminations on Code Bases I Have Known

Code bases, eh? Can't live with 'em. Can't live without 'em. Though that's not strictly true, as many generations have seen. Wait. Hang on. This isn't quite the articulate start I pictured in my head. Let's have another go at this.

*ahem*

The style of architecture used for a large code base largely depends on how the third party dependencies of that code base are managed. Specifically, whether those dependencies are handled at the global level or at the team/project level.

Contentious statement out of the way. Let's see if I can explain what the heck I just meant.

Let's take the case of a large code base where dependencies are handled "globally". For the sake of this discussion, let's imagine that this means that there's only ever one version of a particular dependency in use at any particular revision of that code base. When the code base is tiny, this approach is what teams tend towards (IME) and updating a dependency is relatively straightforward. However, as the code base grows, or more projects are rolled into it, updating a third party dependency becomes increasingly difficult and time consuming.

When a code base reaches a certain size, a choice has to be made. Should third party dependencies be handled at the project level or for the code base as a whole? The downside of handling them for the code base as a whole is that updates appear to become increasingly costly. It therefore seems reasonable to split the code base up in some way and then let the chunks handle themselves.

And this is where the problems start....

If your code base is anything like many I've seen, there's a suite of utility functions and library code that's shared between projects. These are often referred to by their code names, but everyone knows that they're reusable components that should be (ummm...) reused by other projects in the code base (let's call them "client projects"). The problem is that each client project is now handling its own set of dependencies, so this suite of library functions must, by necessity, only have the bare minimum number of third party dependencies (to minimise the chances of accidentally requiring multiple versions of the same library in a client project) or be more permissive about dependencies at the cost of more painful integration with client projects later.

Given the evolution of a code base, the former is preferred and the latter is what gets done, at least to start with. Put another way: by the time you realise that there's a need for a shared set of utility functions, the helper dependencies have their tentacles well and truly wrapped round the code to share.

But we know that it's possible to reduce the third party dependencies on core library code to a minimum. Guava libraries, which Google uses, demonstrates this. (Pro tip: to get there, never use XML)(only kidding)(not really: hi xalan and xerces and xml-api.jar) So it's demonstrably possible to have a tiny subset of your code base be sharable without causing a ton of grief when integrating new versions. Hadanza!

That example with the common library of common code (let's call it Odin or something suitably grandiose) is a microcosm of the horror that awaits when reintegrating different projects that have been managing their dependencies separately for some time. Each time there's a third party dependency shared by each project being integrated but at a different version there's more than just code integration to do. Some parts of one or both projects need to be reworked or rewritten, which increases the amount of time and testing required to do the reintegration.

Given the pain of updating even multiple lagging dependency can cause, it's far more likely that the client projects, once split out of trunk, will remain split out indefinitely. Which is fine, until they need to communicate.

One approach to resolve the problem of communicating between different projects at runtime is to use XML. This approach is great, unless schema validation is turned on. Then at least one side or the other is going to claim that a perfectly valid message is complete garbage. Well, nuts! So, let's not validate our XML, but use something like XPath to pull out the bits that we find interesting (using something like Shcematron perhaps, though it's been a looong time since I looked at it) The alternative is to use something like Protocol Buffers or Thrift, which are used by Google and, given the latter's origin, Facebook.

OK. So the various client projects now have a robust mechanism for communicating: by passing some sort of message between instances. That message may take the form of a document (recommended, as this minimises the surface area of API that needs to be agreed between client and server) or an RPC call (less recommended, as it's all too easy to accidentally tightly bind two communicating projects together). As with all things IT-related, it's typically easier to do the less recommended thing.

The astute reader will already be seeing that this is describing an SOA-style architecture. The astute reader would be correct. The astute reader may now have a cookie. Unless they're in the UK, in which case they'll have to give permission for cookies to be used. (joke)(not really). And yes, I know that expanding "SOA-style architecture" leads to a nonsensical phrase. I can live with that.

So: a large code base consisting of lots of independent projects each managing their own third party dependencies likely leads to an SOA-style architecture. Or integration through the database. I know which one I'd prefer.

But what if we stick with the original plan of managing the third party dependencies globally? This leaves us with more options (side note: I like having more options) We'll still need some mechanism for old versions of a project within this one tree to communicate with newer versions of itself (if only across updates to the software, but also if you're scaling horizontally), and, again, XML, protobufs, thrift or some other data interchange format will help a lot here. So, we could also make use of SOA-style architecture, and that may be advisable.

Alternatively, given that everything is using the same third party dependencies it's likely that it's a lot easier to share code between projects within the same tree. Depending on the rigour of code reviews or tools put in place to prevent teams delving into "not public really" APIs another style would be to just use other projects within the same tree in exactly the same way as third party dependencies.

Groovy. So we could use SOA or have a mass of tangled projects? Well, neither sounds that appealing to be honest. So which approach should be chosen?

Let's take our thought experiment through a required update of a third party dependency (let's imagine that not updating the library will accidentally cause a portal to Help Desk to be opened through which The Angry Users may reach you directly)

In the case of everyone being in the same tree it's likely you're already close to the latest release, as there's always someone who wants that latest whizz-bang feature. If you've lots of dependencies --- and you probably do --- then an update to a third party dependency may be a well practised thing. Glibly speaking, the chances are that the update, whilst not necessarily smooth, can be managed with only minimal wailing and gnashing of teeth.

In the other world, of lots of separate projects each managing their own dependencies, things might well be very different. If each project has assiduously been updating their third party dependencies to the latest and greatest release, then things might be easier than in the case of a single, unified tree as you're attempting to update a smaller code base. Hurrah!

However, it was the pain of these library updates that caused the code base to fracture into a world where SOA made sense. Some poor bastard is going to have to do a massive update, and that's going to hurt. IME, the pain of an update grows non-linearly and at a multiple greater than 1 in relation to the number of intervening revisions skipped. That is, skipping one revision hurts, skipping (say) five really hurts and skipping a major version may well be justifiable cause for murder when it comes to integrate the latest and greatest into your tree (aside: that's why people who are gung-ho about branching recommend integrating branches as frequently as makes sense) To make matters more painful, this integration is not being done at a time of the project's choosing: it's being crammed into an already too full release schedule. Yay! A recipe for success if there ever was one.

Taking this back to the start of this post, I hope you can now see that. in my view, how third party dependencies are handled really do have an impact on the architecture of your system.

Now, I've written this "in the Yegge style" (lots of words, aided by a glass or two of wine), so I won't be offended if you all pull this apart in the comments :)