Packaging software using Git

A analysis of schemes people use for packaging

Table of Contents

1 Introduction

"Are you rebasing or merging?" seems to be the 64 thousand dollar question over in vcs-pkg discussions. Various people have offered their preferences, and indeed, several case studies of work flows have been presented, what is lacking is an analysis of the work-flow; an exploration of which methodology has advantages, and whether there are scenarios in which the other work flow would have been better.

Oh, what are all these work flows about, you ask? Most of the issues with packaging software for distributions have a few things in common: there is a mainline or upstream source of development. There are zero or more independent lines of development or ongoing bug fixes that are to be managed. And then there is the tree from which the distribution package is to be built. All this talk about packaging software work flows is how to best manage asynchronous development upstream and in the independent lines of development, and how to create a coherent, debuggable, integrated tree from which to build the distributions package.

The rebasing question goes to the heart of how to handle the independent lines of development using git; since these lines of development are based off the main line of development, and must be periodically synchronized. What follows here is a first look at a couple of important factors that will have bearing on that question, and packaging software for a distribution using Git in general. This is heavily geared towards git (nothing else does rebases so easily, I think), but some of the concepts should be generic. I am not considering the stacked set of quilt patches source controlled with Git in this article (I don't understand that model well enough to do an analysis)

As a teaser, there is a third answer: neither. You can just add an independent line of development, and just let it sit: don't rebase, and don't merge; and in some circumstances that is a winning strategy.

2 Interested constituencies

Take the rebasing issue. First, rebasing is for private branches -- not for anything you publish. Whether or not you should have a private branch that you rebase or not depends on a number of factors -- firstly, there is the question of which of the stake holder constituencies are most important to you. There are at least three constituencies involved here:

  1. Upstream developers. These are the people who are the consumers of the independent lines of development. Usually people go out of their way to feed patches and code in the preferred form acceptable to upstream; and it is in our interests to do so: the more code pushed upstream, the less work there is to do ourselves. People upstream want the patch submissions to be clean, not have extraneous crud that has to be removed, and would like a nice, clean, uncomplicated history. They care about descriptions of each patch in a series, but are not interested history, really (they do not care that it took 15 typographical and logic fixes to arrive at this juncture.) Upstreams want topic branches to be rebased to their latest version, so that the patches apply cleanly and are minimal.
  2. Downstream topic developers. These are people who are basing their work on your topic branches, and they develop code, and feed their changes back to you. If you rewrite history and rebase your topic branches, downstream developers will find it very hard to merge from you back into their development tree. Indeed, if you publish your topic branches for downstream developers to base their changes on, rebasing is not an option.

    If you publish a branch (whether or not you mean downstream developers to base their work on it or not), your published repository will be set up by default to deny any non-fast-forward pushes. This is done for the reason we have been talking about -- it would make it impossible for others to track the code and merge it into their local clones. So, if you plan on publishing rebased branches (with warnings to downstream developers to only cherry pick from it), you still will have to force the push.

  3. The distribution and its users. This is a very important constituency; and most of us packaging software for distributions are doing all this work precisely for this constituency. Through the integration branch, this is a downstream branch – though one interesting case study uses a single rebased patch branch and throw away integration branches, using a patch series in the integration branch. Since you control the integration branch, this is not strictly downstream,

Now, depending on where you are on the totem pole, some of these constituencies are more important than others. At the very top, you don't have an upstream. For example, Git development. They just use a mainline, and a rebased pending updates branch. All their downstream is cautioned never to base work on the pending updates series. So, rebased lines of development work for them, since they are upstream, and that is the most important constituency in their work-flow.

If you are too low on the totem pole to have any downstream developers, and you can live with throw away integration branches, rebased lines of development work as well. But as soon as you have downstream developers or you are working with other people (yourself on your laptop is another person as well), rebasing stops being an option. A major advantage of a distributed SCM is for accessing the repository from different locations; if you do, rebase messes with that.

3 Patch flow characteristics on the topic branches

Are the stake holders the only factor in you decision? Not by a long shot. Take, for instance, the issue of how active your independent lines of development are, and how big the patch series is, and whether you want to do a functional test for each topic branch.

4 Other factors

5 Work-flow details

This section is the result of discussion with Bart Trojanowski, and incorporates material from his private communication (with permission).

In the figures that follow:

Each one of these examples has upstream making four releases, labelled A, B, C and D. There are two topic or feature branches, labeled "topic/a" and "topic/b", one of which has two commits, and the other one has one commit. In all the examples, there are four Debian releases that happen during the interval tracked. I have created these examples with an explicit upstream tracking branch; you can get away with fetch and origin/naster if you do not want a local tracking branch. The dotted lines represent a topic branch earlier in the history of the workflow which got rebased away (to get the final state, pretend that the nodes linked by the dotted lines do not exist.)

5.1 Merging the topic branches

This case is pretty straight forward. All the branches are available to eveyone, and work may be based off any of them. However, the branch history, for any topic branch, is full of merge nodes where code from upstream was merged, which some people see as clutter.

What do maintainers do?

  git checkout upstream
  git pull origin
  git checkout topic/a
  git merge upstream
  git checkout topic/b
  git merge upstream
  git checkout master
  git merge topic/a  topic/b
  # do the release thing
  git commit
  

5.2 Rebasing the topic branches

As mentioned earlier, The dotted lines represent a topic branch earlier in the history of the workflow. I want to stress that his work-flow has limited utility; and only when you do not have people basing their work off your banches. Since you never know who might want to base their work off you in the future …

In this case, the release is made off a throw away release branch, which essentially becomes unusable as soon as a new upstream release comes and there is a rebase operation. However, at any given time, the rebased topic branch is nice, clean, and rebased against the latest upstream, making it really easy to feed patchs into the upstream code.

However, downstream developers are seriously incovenienced, since all the branches are either thro away or rebased.

What do maintainers do?

  git checkout upstream
  git pull origin
  git checkout topic/a
  git rebase upstream
  git checkout topic/b
  git rebase upstream
  # create throw away branch
  git checkout -b release-A upstream
  git merge topic/a  topic/b
  # do the release thing
  git commit
  

5.3 Rebase & merge

The scenario is identical to the rebase case, witht he difference that we carry an additional set of branches around for the downstream folks, but we still release from throw away branches. This is almost strictly worse than the next case, since derivative distributors for not have a stable branch from which the packages are built – since we are building packages from a throw away branch.

In the figure, the two "for downstream" branches are never rebased, always merge with upstream, and cherry pick changes from the current topic branches. If there is overlap, integration work will have to be done separately in the "for downstream" branches (but will often be the same disambiguation logic as was needed when rebasing the topic branch). The "for downstream" branches are just like the branches in the pure merge case above; with the difference that merges into the master branch are not happening (since we do not have a long lasting integration branch in this scenario).

What do maintainers do?

Note:: I do not yet understand what the reset and merge -s are doing.

  git checkout upstream
  git pull origin
  git checkout topic/a
  git rebase upstream
  git checkout topic/b
  git rebase upstream

  git checkout downstream/a
  git branch tmp/a
  git reset --hard downstream/a
  git merge -s ours tmp/a
  git branch -d tmp/a

  git checkout downstream/b
  git branch tmp/b
  git reset --hard downstream/b
  git merge -s ours tmp/b
  git branch -d tmp/b

  # create throw away branch
  git checkout -b release-A upstream
  git merge topic/a  topic/b
  # do the release thing
  git commit
  

5.4 Merge & rebase

This is my preferred mechnism. I only create the submit branches on demand; and the submit branches are cherry picked and rebased from the topic branches when I am ready to submit patches upstream – so in reality, my work-flow is really like the plain merged branches above, with the proviso that I can create the submit branches easily if I needed.

The submit branches, when they exist, look very much like the topic branches in the pure rebase case, except that we never gather them to make a release (since the release is done off the topic branches, not the submit branch set).

What do maintainers do?

  git checkout upstream
  git pull origin

  # the submit branch creation is optional
  git checkout submit/a
  git rebase upstream
  git checkout submit/b
  git rebase upstream

  git checkout topic/a
  git merge upstream
  git checkout topic/b
  git merge upstream
  git checkout master
  git merge topic/a  topic/b
  # do the release thing
  git commit
  

6 Conclusion

Based on which stake-holders you have, and which ones are most important to you, how big your topic branch differences are, and how active the topic development is, you can decide whether or not you rebase or merge, and whether you do delayed integration into topic branches or not.

Now for my personal preferences: I am low down on the totem pole, mostly, but I like to publish my topic branches. So I will not rebase my public topic branches. I will have persistent integration branches, since derived distribution folks are likely to need that. I will also always merge new upstream into my topic branches, just in case someone is basing their work off my public topic branch.

But since I have to cater to upstream as well, I plan on having a private, rebasable submission branch for each topic; and cherry pick original commits from the topic branch on to that. The submission branches will be rebased before submitting to the latest upstream version before submission, or more often if I feel like doing so.

The topic branches will be named "topic/foo". submission branches will be named "submission/foo", and there will be a "tmp/bar" name space for ephemeral branches. This will make it easier to script things like new upstream versions.

7 Acknowledgements

Thanks to Martin F. Krafft and James Vega for reviewing this article, and providing feedback and corrections, and many people involved in the vcs-pkg project for the material this analysis is based on. A special mention should be made about Bart Trojanowski, who provided ideas about concrete solutions for handling topic branches.

Author: Manoj Srivastava <srivasta@debian.org>

Date: <2008-04-01 Tue>