From Mercurial to Git

A few months ago I had to figure out how to migrate a 4 GB repository from Mercurial to Git, and trim the size down along the way. Luckily, I wasn’t the first one to have to do that, so there were a number of resources I could reference, namely these two. But of course, every specific case has its own specific problems.

We had enforced username-only author names for the sake of some TeamCity configuration, so in my case my commit author name was jochan. However, this was only a style rule and not enforced by Mercurial itself, which meant that the commits were rife with violations, from jochan <jonathan.chan@domain>, which was what TortoiseHg would automatically fill in for you, to Jonathan Chan <jonathan.chan@domain> to Jonathan Chan to <jonathan.chan@domain> to jochan [jonathan.chan@domain] to blank author names, somehow ― you get the gist. On the other hand, Git requires a username and an email. The first step was to obtain a table of everyone’s Active Directory username, full name, and emails from IT, then use it to parse the author names currently in the commit logs it into the mandatory username <email> format, and save the mapping "hgformat"="gitformat" to be used later. Luckily, the aberrant author names mostly followed regular patterns ― and I do mean regular.

null_author             = re.compile("^<>$")
full_name_no_email      = re.compile("^([A-Z]\w*\s?)+$")
full_name_null_email    = re.compile("^([A-Z]\w*\s?)+<>$")
full_name_with_email    = re.compile("^([A-Z]\w*\s?)+<.+>$")
username_no_email       = re.compile("^\w*$")
username_null_email     = re.compile("^\w*\s?<>$")
username_with_email     = re.compile("^\w*\s?<.+>$")
username_sqr_email      = re.compile("^\w*\s?\[.*\]$")
username_rnd_name       = re.compile("^\w*\s?\(.*\)$")
username_address        = re.compile("^\w*@.*$")
any_any                 = re.compile("^.+<.*>$")
null_any                = re.compile("^<.*>$")
any_email               = re.compile("^.+\s\S+@\S+$")
any_null                = re.compile("^.+$")

A handlful of tens of thousands of commits were surprisingly quick to parse through to obtain the author names, counting ~200:

$ hg log | grep user: | sort | uniq | sed "s/user: *//" > authors.txt
$ python authors.py authors.txt > reformatted-authors.txt

Now we’re ready to migrate. I initially tried to do it on Windows, but I ran into a bunch of issues about Python and Mercurial imports and whatnot, so I gave up and ran it all on macOS, which worked perfectly (*nix ftw!). Using hg-fast-export to plunk an existing Mercurial repository into an empty Git repository, there’s some tweaking to do first:

$ git init
$ git config core.ignoreCase false

According to hg-fast-export’s warning, with ignoreCase set to false, commits that only change the case of filenames will show up empty, which we definitely don’t want. Finally:

$ ./hg-fast-export.sh -r $source --force -A reformatted-authors.txt
$ git config --bool core.bare true

The --force flag was necessary to deal with closed branches, which prompt the error Repository has at least one unnamed head. To update the Git repository with new changes from the Mercurial repository, it suffices to run the hg-fast-export script again, but there has to be absolutely no changes made to the Git repository, or else it would refuse to update. I set the repository to be bare at the end of my script just so that I could prevent myself from accidentally committing things to it while I was experimenting.

After being absolutely certain I no longer needed to migrate new changes, it was time to strip down the repository. At several points in the past, some large-ish SQL and ZIP files were unwittingly committed into the repository, which inflated its size quite significantly. The .git folder was still ~4 GB. To do this, I used BFG Repo-Cleaner, a pun on the existing git-filter-branch and the Big Friendly Giant, I suppose. The instructions are straightforward:

$ java -jar $bfg_cleaner/bfg.jar --strip-blobs-bigger-than 40M $target
$ git reflog expire --expire=now --all
$ git gc --prune=now --aggressive

I experimented with several file sizes and chose 40 MB based on the size of the smallest SQL file it had found. Some Git-hosting services like GitHub do not allow files any larger than 50 MB. After this file stripping, the .git folder went down to ~1.5 GB in size. Success!

You might be aware that the most fundamental difference between Mercurial and Git is that Mercurial branches are a property of the commit (so a commit belongs to a branch), while Git branches are pointers (so a branch points to a commit). Like any modern repository, we frequently open, merge, and close branches to work on specific features; however, after migrating to Git, every single branch ever created came back into existence, since Git has no concept of a “closed” branch. My solution was to use Mercurial to give me a list of closed branches so that I could tag then delete them in Git.

$ hg heads --closed --template "{branch}\n" | tr " " "_" | sort > all.log
$ hg heads          --template "{branch}\n" | tr " " "_" | sort > open.log
$ comm -2 -3 all.log open.log > closed.log
$ for branch in `cat closed.log`; do \
    git tag "closed/$branch" $branch \
    git branch -df $branch \
  done

Confusingly, hg heads gives you a list of open branches, while hg heads --closed gives you a list of open and closed branches, so branches common between the two files (i.e. the open ones) need to be eliminated to get the closed branches. Additionally, spaces are allowed in Mercurial branch names (for some unfathomable reason) but not Git names, so I opted for an underscore replacement (yes, over the dash. fight me). I tagged them all under the group closed so that they would be easier to find and identify; furthermore, SourceTree appears to let you collapse branches in the same group.

Another annoying difference between Git and Mercurial is that Mercurial uses both glob and regex syntax for .hgignore, while Git only uses glob. And unfortunately, full translation only goes from glob to regex (because globs aren’t regular). While regex is one of my great loves of computing science, why would you need regex in an ignore file? What kind of complex file organization structures are you keeping? Who uses regex in .*ignore files? We do, apparently. We had tons of .hgignore files at different directory levels on every branch, and something had to be done. Albeit rough and naturally incomplete, this worked pretty well, since most of the regexes weren’t that complex, and most of them could have been implemented in glob anyway.

#!/bin/bash

git config --bool core.bare false
for branch in `git branch | sed "s/*/ /g"`; do
    git checkout $branch -f
    find . -name ".hgignore" > hgignore-files.log > gitignore-files.log
    for file in `cat hgignore-files.log`; do
        newfile=${file/hgignore/gitignore}
        echo $newfile >> gitignore-files.log
        cp $file $newfile
        sed -i.bak "s/syntax:/#syntax:/; s/^\^//; s/\$$//; s/\\\w\+/*/; s/\\\\\//\//g" $newfile
    done
    cat gitignore-files.log | xargs git add
    if [[ -s gitignore-files.log ]]; then
        git commit -m "Added .gitignore files."
    fi
done

Absolutely do not quote me on this. I won’t even try to explain what this does because I’ve forgotten most of it. This also commits the newly-minted .gitignore files for you, so you’ll end up with a non-bare repository.

And lastly, to push the repository and clean up all the extraneous files/logs/scripts:

$ git remote add origin <url>
$ git push --all origin -u
$ git clean -df

Now to do away with existing processes, steps piled on and hacked together, and implement Gitflow