Web lists-archives.com

[PATCH 00/30] Add directory rename detection to git

[This series is entirely independent of my rename detection limits series.
However, I have a separate rename detection performance series that depends
on both this series and the rename detection limits series.]

In this patchset, I introduce directory rename detection to merge-recursive,
predominantly so that when files are added to directories on one side of
history and those directories are renamed on the other side of history, the
files will end up in the proper location after a merge or cherry-pick.

However, this isn't limited to that simplistic case.  More interesting
possibilities exist, such as:

  * a file being renamed into a directory which is renamed on the other
    side of history, causing the need for a transitive rename.

  * two (or three or N) directories being merged (with no conflicts so
    long as files/directories within the merged directory have different
    names), and the "merging" being detected as a directory rename for
    each original directory.

  * not all files in a directory being renamed to the same location;
    i.e. perhaps the directory was renamed, but some files within it were
    renamed to a different location

  * a directory being renamed, which also contained a subdirectory that
    was renamed to some entirely different location.  (And perhaps the
    inner directory itself contained inner directories that were renamed
    to yet other locations).

Also, I found it useful to allow all files within the directory being
renamed to themselves be renamed and still detect the directory rename.
For example, if goal/a and goal/b are renamed to priority/alpha and
priority/bravo, we can detect that goal/ was renamed to priority/, so that
if someone adds goal/c on the other side of history, after the merge we'll
end up with priority/c.  (In the absence of a readily available
libmindread.so library that I can link to, we can't rename directly from
goal/c to priority/charlie automatically, and will need to have priority/c

Naturally, an attempt to do all of the above brings up all kinds of
interesting edge and corner cases, some of which result in conflicts
that cannot be represented in the index, and others of which might be
considered too complex for users to understand and resolve.  For

  * An add/add/add/.../add conflict, all on one side of history (see
    testcase 9e in the new t6043, or any of the testcases in section 5)

  * Doubly, triply, or N-fold transitive renames (testcases 9c & 9d)

In order to prevent such problems, I introduce a couple basic rules that
limit when directory rename detection applies:

  1) If a subset of to-be-renamed files have a file or directory in the
     way (or would be in the way of each other), "turn off" the directory
     rename for those specific sub-paths and report the conflict to the

  2) If the other side of history did a directory rename to a path that
     your side of history renamed away, then ignore that particular
     rename from the other side of history for any implicit directory
     renames (but warn the user).

Further, there's a basic question about when directory rename detection
should be applied at all.  I have a simple rule:

  3) If a given directory still exists on both sides of a merge, we do
     not consider it to have been renamed.

Rule 3 may sound obvious at first, but it will probably arise as a
question for some users -- what if someone "mostly" moved a directory but
still left some files around, or, equivalently (from the perspective of the
three-way merge that merge-recursive performs), fully renamed a directory
in one commmit and then recreated that directory in a later commit adding
some new files and then tried to merge?  See the big comment in section 4
of the new t6043 for further discussion of this rule.

This set of rules seems to be reasonably easy to explain, is
self-consistent, allows all conflict cases to be represented without
changing any on-disk data structures or introducing new terminology or
commands for users, prevents excessively complex conflicts that users
might struggle to understand, and brings peace to the middle east.
Actually, maybe not that last one.

While I feel that this directory rename detection reduces the number of
suboptimal merges and cherry-picks that git performs, there are sadly
still a number of cases that remain suboptimal, or that even newly appear
to be not-quite-consistent with other cases.  The fact that one file
layout might trigger some of the rules above while another "slightly"
different file layout doesn't might occasionally cause some user
grumblings.  I've tried to explore and document these cases in section 8
of the new t6043-merge-rename-directories.sh

Finally, from an implementation perspective, there's another strong
advantage to the ruleset above: it means that any path to which we want
to apply an implicit directory rename will have a free and open spot
for us to move it into.  Thus, we can just adjust the diff_filepair
from an add or modify into a rename (or adjust a rename diff_filepair
to change the target a little more), and then let process_renames and
process_entry do all their magic.  That allows us to rely on all the
heavy testing already done for those code paths to handle a large
variety of edge and corner cases (e.g. D/F, rename/rename, criss-cross
merges, etc.)  The big trick is just making sure to do all the
necessary checks that we can apply directory rename detection, and then
fixing things up to put it in the expected format, with enough test
cases to make sure we actually got it into the right format.

Okay, the last paragraph had a small lie (though I didn't know that when
I originally wrote it): the fact that unpack_trees() aborts early if it
detects an untracked or dirty file would be overwritten by a merge, and
if not it immediately proceeds to start modifying the working tree before
passing control back to merge-recursive, causes some problems.  Not only
has it always made the code more complex, but the fact that
unpack_trees() doesn't understand renames means that it can't
appropriately abort early if a path involved in a rename has untracked
or dirty contents in the way of the merge.  But by the time we detect
renames, it's too late to abort early.  So we have to instead figure out
ways of emitting warnings messages and writing something sensible to the
working copy without overwriting any of their data.  This was a problem
before directory rename detection, but directory rename detection
increases the number of places where we have to worry about this.

Elijah Newren (30):
  Tighten and correct a few testcases for merging and cherry-picking
  merge-recursive: Fix logic ordering issue
  merge-recursive: Add explanation for src_entry and dst_entry

These three patches provide a few miscellaneous fixups that could be
submitted independent of this series, though the series partially
depends on the fixes in the first one, and the second fix becomes more
important with the rest of the changes in this series.

  directory rename detection: basic testcases
  directory rename detection: directory splitting testcases
  directory rename detection: testcases to avoid taking detection too
  directory rename detection: partially renamed directory
  directory rename detection: files/directories in the way of some
  directory rename detection: testcases checking which side did the
  directory rename detection: more involved edge/corner testcases
  directory rename detection: testcases exploring possibly suboptimal
  directory rename detection: miscellaneous testcases to complete
  directory rename detection: tests for handling overwriting untracked
  directory rename detection: tests for handling overwriting dirty files

These patches add testcases for directory rename detection, trying to
cover the space of possibilities as exhaustively as I can while trying
to avoid excessive overlap in testcases

  merge-recursive: Move the get_renames() function
  merge-recursive: Introduce new functions to handle rename logic
  merge-recursive: Fix leaks of allocated renames and diff_filepairs
  merge-recursive: Make !o->detect_rename codepath more obvious
  merge-recursive: Split out code for determining diff_filepairs

These four patches make small code reorganizations in preparation for
further changes, though they include some memory leak fixes.

  merge-recursive: Add a new hashmap for storing directory renames
  merge-recursive: Add get_directory_renames()
  merge-recursive: Check for directory level conflicts
  merge-recursive: Add a new hashmap for storing file collisions
  merge-recursive: Add computation of collisions due to dir rename &
  merge-recursive: Check for file level conflicts then get new name
  merge-recursive: When comparing files, don't include trees
  merge-recursive: Apply necessary modifications for directory renames

These eight patches implement the directory rename detection logic.
  merge-recursive: Avoid clobbering untracked files with directory
  merge-recursive: Fix overwriting dirty files involved in renames
  merge-recursive: Fix remaining directory rename + dirty overwrite

These last three deal with untracked and dirty file overwriting
headaches.  The middle patch in particular, isn't just a fix for
directory rename detection but fixes a bug in current versions of git
in overwriting dirty files that are involved in a rename.  That patch
could be backported and submitted independent of this series, but the
final patch depends heavily on it.

 merge-recursive.c                   | 1212 +++++++++++--
 merge-recursive.h                   |   17 +
 t/t3501-revert-cherry-pick.sh       |    5 +-
 t/t6043-merge-rename-directories.sh | 3277 +++++++++++++++++++++++++++++++++++
 t/t7607-merge-overwrite.sh          |    7 +-
 unpack-trees.c                      |    4 +-
 unpack-trees.h                      |    4 +
 7 files changed, 4413 insertions(+), 113 deletions(-)
 create mode 100755 t/t6043-merge-rename-directories.sh