Managing Dependent Libraries with Git
July 07, 2010

When I’m developing something, I like to split out related functionality into modules or libraries whenever I see the opportunity, so that I can share code between projects and generally promote re-use.

Doing this presents a number of interesting challenges, one of which is how to manage changes to these shared modules.

If I have three projects using some module, I can either choose to have them all link to the same version of the module on my disk, or I can give each project its own copy.

On the face of it, linking to the same version sounds like the way to go. After all, we’re sharing right? If we make a fix to the shared module, we want to get it in all projects don’t we?

Well, sort of, but not really. It’s how I used to do it, but not any more, and I’ll explain why in a moment. First though, a diversion:

Code Locations

Even if you do want to do it, there is one problem with having multiple projects link to a common module, which is the need to know the relative location of that module from each project, in any given development environment, so that you can get include paths etc set up correctly.

If you are the kind of person who is happy to just hardcode the location of the source code into your projects and force every developer who ever works on it to set up their disks like you do, then you can stop reading at this point. You are a bad person, and I don’t want to work on your projects!

If you are sane, however, then this poses a little bit of a problem. The way I used to get round this was to organise all my source code by reverse domain, in the java style, relative to some nominal root location (in my case, this was the “projects” folder, which could be anywhere on the disk). So all of my code lived in projects/com/elegantchaos, anything I was using from Matt Gemmell would live in projects/com/mattgemmel/, and so on.

Whilst this gets a bit cumbersome in some ways - the directory hierarchy can get quite deep - it does neatly solve the relative location problem. I can place the projects folder anywhere on my disk and as long as my project uses relative paths, it can still find anything that it’s depending on since it knows that the relative path between the two will remain constant.

I do still organise my work directory this way, but not specifically to solve the shared module problem, as I’ve realised that the best thing to do is for each project to have its own copy of the shared module, and to manage the propagation of changes between the projects using source control, which brings us back from the diversion…

Take Your Own Copy

So why is it better for each project to have its own copy of the shared module? The reason is fairly obvious really - and has to do with the law of unintended consequences.

Basically, if you set things up so that you are physically sharing the same copy of code between multiple projects, it doesn’t change the fact that you are always working on that code in the context of a specific project at any one time. That means that you’ll make changes and check that they work for the project you’re working in, but you won’t necessarily have time to check the other projects. You especially won’t have time to fix the problems that you’ve almost definitely introduced in the other projects without meaning to - if, that is, you are lucky enough to have created problems that are sufficiently obvious that they come to light immediately.

To a certain extent you can mitigate these problems with nice things like unit tests and continuous integration, both of which I’d highly recommend. They will help you find the problems that changes to shared code cause, but they still won’t alter the fact that your immediate deadline is based on a particular project and you just need to get it working now dammit, and clean up after yourself later.

This is hard enough to manage if you’re working on your own, but add in a team and/or multiple platforms, and it becomes almost unmanageable to share the same code. Imagine if, before checking in any change to your shared module, you had to test every permutation of every project that your team is working on, on every platform you support, to make sure that you haven’t broken anything. You could, quite literally, do a one hour code change then spend the rest of the day integrating and testing (I’ve been there, and it’s not fun).

All of which is a long winded explanation for why it makes sense to have a local copy of the source code for a shared module in each project that uses it: doing things this way gives you control over when you take changes to the shared module from other projects, and when you push back your changes to the shared module to them.

When you’re working on the change, you can just concentrate on getting the job at hand done and ensuring that your project is working.

Later, when the dust has settled and you have some allotted time to work on the library, you can integrate your library changes up from your project into the “master” copy of the library and perhaps perform additional testing or cleanup to ensure that it fully meets the standards and guarantees set out by that library.

Later still, at a point that is entirely convenient and safe, other projects can grab the modified library, integrate it into their source base, and resolve any issues.

How I Cope With This In Git

So that’s all fine and dandy, but how do we set things up in Git to make this possible?

Lets say that I have two projects A and B, and shared module X.

What I want is a repository for A, a repository for B, and a repository for X.

I want the project A and project B repositories to contain their own copy of module X so that any changes to it are isolated to just that project. This also means that someone else can just grab project A from a server and get everything they need to build it.

However, I want be able to push and pull from either project’s copy of module X to/from the master module X repository so that I can propagate changes.

Essentially there are (at least) two ways to solve this in Git: Submodules and Subtrees (scroll down to “Subtree Merging”). There are pros and cons with both approaches which are beyond the scope of this post, but are well described in the Pro Git book (follow the links above).

For a number of reasons I’ve chosen to go with the subtree approach, which means that I have installed the git-subtree support, and then have to execute various git subtree commands when I want to work with a shared module.

What it boils down to is that there are three module related actions that I typically have to do in a project: add a module, pull a module, push a module.

To make life simpler, I’ve made myself some shell scripts: subtree-add, subtree-pull and subtree-push. They expect to be run from the root of my project repository, and they expect to find a subfolder called “subtree”, containing a configuration file for each module.

The configurations are very simple, and basically just define three shell variables which the scripts use. For example, the ECFoundation.subtree file for my foundation library looks like this:

URL="<path to your repository>/ECFoundation.git"

The first variable gives a name for the branch that the git subtree system will use to track the module. The second variable gives a location to place the subtree, relative to the root of the repository. The third variable gives the URL location of the repository containing the master copy of the module.

So when I first set up a new project and want to import ECFoundation, I first make a “subtrees” directory and copy into it ECFoundation.subtree.

Next I cd into the root folder for the project in the terminal, and run:

> subtree-add ECFoundation

If all goes well, I end up with a new folder called frameworks/ECFoundation/ containing the latest version of the library.

Later if I want to push changes I’ve made to ECFoundation in my project, I again cd to the root folder of the project, and run

> subtree-push ECFoundation

Finally, if I want to pull the latest version of ECFoundation into my project, I cd to the root folder of the project and run

> subtree-pull ECFoundation

These scripts are pretty simple at the moment, but they do the job.

You can find them on github here.

Comments and improvements to the scripts are welcomed!

Update: These scripts have been updated and rewritten in Python. See my post High Level Git Subtree Scripts for details.