Re: [PATCH v5 6/6] RFC blame: use a fingerprint heuristic to match ignored lines
- Date: Tue, 9 Apr 2019 11:56:55 -0400
- From: Barret Rhoden <brho@xxxxxxxxxx>
- Subject: Re: [PATCH v5 6/6] RFC blame: use a fingerprint heuristic to match ignored lines
On 4/7/19 5:46 PM, michael@xxxxxxxxx wrote:
From: Michael Platings <michael@xxxxxxxxx>
This is the updated fuzzy matching algorithm, sorry for the delay. It does
highlight a bug in the calculation for the number of lines ("int nr_parent_lines
= e->num_lines - delta;") - if you apply the patch, build it, then try to
./git blame --ignore-rev <the patch commit ID> blame.c then you'll get a segfault
because nr_parent_lines is a negative number. I haven't had time to investigate further
but I have confirmed that the bug is not due to my patch.
Although I couldn't recreate this, I saw how it could happen. I have an
updated version that fixes it. Short version: I just pass through
parent_len instead of trying to recreate it with 'delta.' Recreating
can go awry because e->num_lines may be less than a chunk size.
The matching algorithm might not be obvious so it could do with more commenting.
In the mean time I hope the tests will make the intent clear. In particular I
want to avoid lines being reordered, because for the interesting use cases
usually sequences are unchanged even if they shift across different lines.
I thought you wanted to detect similarity across potentially reordered
commit a 10) #include <sys/a.h>
commit b 11) #include <foo/b.h>
commit c 12) #include <bar/c.h>
commit d 13) #include <dir/d.h>
Where commit X (to be ignored) alphabetized them (or something):
commit X 10) #include <bar/c.h>
commit X 11) #include <dir/d.h>
commit X 12) #include <foo/b.h>
commit X 13) #include <sys/a.h>
In this case, you want to find the line number in the parent that
corresponds to each line in the target (X is the target, in the relevant
blame code). The mapping is: (target -> parent)
10 -> 12
11 -> 13
12 -> 11
13 -> 10
And the parent's line numbers are out of order. But that's fine - at
least with the infrastructure from my patch, since it sorts the ignored
queue at the end of blame_chunk(). The rule in blame.c is that blame
entries attached to a suspect need to be sorted by s_lno. (At least
based on a comment, and my reading of the code).
Maybe we have a different definition of reordering or are having
different issues regarding line numbers? TBH, I didn't follow the
recursion and partitioning strategy too closely.
Regarding the existing implementation I've got to say I find it unhelpful
marking "unblameable" lines with a 000000 commit ID. That commit ID already has
a meaning - lines that aren't yet committed. Further, the purpose of ignoring
commits should be to avoid obscuring other useful information, not to absolutely
refuse to show that commit at all. If there's no other commit to show then it's
harmless to show the commit that would otherwise be ignored.
For me, if I tell git blame to not tell me about a commit, then I don't
want to hear about it. My typical blame session involves running blame,
then digging up the commit it pointed to. If it points to the commit I
already told it to not show, then it'll just annoy me. All 0s made
sense to me as in "don't bother looking this up."
If you *did* want to see the ignored commit, then I'd run it with
--ignore-revs-file="", to disable ignore.
That being said, I realize this is a preference of mine, and I can make
a blame config option to control this. Probably
blame.maskIgnoredCommits or something.
- How about matching *outside* the parent's diff hunk?
I'd like to know what the use case would be for that. For the use case of
looking "through" a reformatting or renaming commit I think it would be unhelpful.
I noticed that I have some commits where a hunk in a diff is broken up
into multiple calls to blame_chunk. So a better way to prhase it would
be "looking outside a blame_chunk()'s chunk."
For example, I have a hunk like this (modified to show the contiguous
lines of context, subtraction, addition):
@@ -256,99 +260,99 @@
- 2 sub
- 5 sub
+ 6 add
blame_chunk() sees that as four separate calls, broken up by the lines
blame_chunk tlno 262 off -4 same 285 plen 83 enl 23
blame_chunk tlno 286 off 56 same 302 plen 2 enl 16
blame_chunk tlno 303 off 42 same 309 plen 5 enl 6
blame_chunk tlno 310 off 41 same 355 plen 0 enl 45
For a lot of those lines, the source of the change was in another diff
hunk - especially that 45 line addition with nothing from the parent.
Part of the reason for that large diff is whitespace changes, so I can
run blame with -w. But there are other scenarios. Here's another one:
the header file alphabetizing case:
@@ -4,9 +4,9 @@
- 1 #include <header.h>
2 context (other headers)
+ 1 #include <header.h>
That shows up as two blame chunks:
blame_chunk tlno 6 off 0 same 6 plen 1 enl 0
blame_chunk tlno 8 off 1 same 9 plen 0 enl 1
What's funny about it is that if the commit we're ignoring did more
damage to the file (changed every line, for instance), it'd be easier to
find the right line in the parent, since we'd have the entire diff hunk.
Anyway, being able to look outside the current blame_chunk would help in
those scenarios. Specifically, I'm talking about letting blame_chunk()
point anywhere in the parent. Right now, it can only look in the
parent's part of the chunk passed to blame_chunk, which can be