git filter-branch --filter-renames ?


I recently needed to extract the git history of a portion of an existing
repository.  My initial attempts using --subdirectory-filter, subtrees,
etc weren't as successful as I'd hoped.

The primary reason for my failures were due to the fact that this
particular source repository has seen a lot of code movement and renames
in-place.  As a result, filters such as subdirectory filter failed to
keep commits prior to the renames.

So, long story short, I've attached below a hacked together script (yes,
it's sad when one writes a script to call a script :-/) that solves the
problem for me.

My hope is that some other poor sob in my position discovers this
script, uses it and moves on.  If enough people think it's useful
despite the cornercases [1], I'd be happy to work on integrating it into



[1] Namely that if two different files held the same full-path name at
different times in the source repo, you'll get some errant commits in
the history.

# git-filter-renames: Similar to --subdirectory-filter but tracks renames
# Basic use:
#  $ git clone path/to/source_repo dest_repo
#  $ cd dest_repo
#  $ git tags | xargs git tag -d # ours are signed, so would fail to verify
#  $ git remote remove origin
#  $ git gc --aggressive --prune=now --force
#  $ git fsck
#  $ git-filter-renames.sh "[PREFIX] " fileA subdirB/ fileC subdirD/subdirE ...
#  $ rm -rf .git/refs/original
#  $ git gc --aggressive --prune=now --force
#  $ git fsck


if [ $# -le 1 ]; then
	echo >&2 "Usage:"
	echo >&2 "    ${0##*/} '[subj prefix] ' fileA fileB dir1 sub/dir2"
	echo >&2 ""
	exit 1

if [ $DEBUG == 1 ]; then
	rm -rf /tmp/git-filter-renames-*

TMP_DIR="`mktemp -d /tmp/git-filter-renames-XXXXXX`"


# take in the list of files to preserve
# note: directories are recursed
echo -n "" >$TMP_DIR/user_list.txt
for arg in $*; do
	if [ -d "$arg" ]; then
		find $arg -type f >>$TMP_DIR/user_list.txt
	elif [ -f "$arg" ]; then
		echo "$arg" >>$TMP_DIR/user_list.txt
		echo >&2 "What the hell is '$arg'?"

echo -n "" >$TMP_DIR/trace_list.txt
while read fn <&4; do
	while read ofn <&5; do
		echo "^$ofn\$"
	done 5< <(git log --format=format: --follow --name-only -- "$fn" | \
		  sed -e '/^$/d' | sort -u)
done 4< <(cat $TMP_DIR/user_list.txt) | sort -u >>$TMP_DIR/trace_list.txt

# stage the filter script
cat >$TMP_DIR/filter.sh <<EOF
git ls-files | \\
	grep -vf $TMP_DIR/trace_list.txt | \\
	xargs -r git rm -qrf --ignore-unmatch
chmod +x $TMP_DIR/filter.sh

# stage the msg filter script
cat >$TMP_DIR/msg_filter.sh <<EOF
sed -e "1 s/^/$PREFIX/"
chmod +x $TMP_DIR/msg_filter.sh

# do the filtering
echo >&2 "Doing filtering"
git filter-branch --prune-empty -f --index-filter "$TMP_DIR/filter.sh" \
	--msg-filter "$TMP_DIR/msg_filter.sh" \
# cleanup
if [ $DEBUG == 0 ]; then
	rm -rf $TMP_DIR