Web lists-archives.com

Re: [PATCH v5 6/6] RFC blame: use a fingerprint heuristic to match ignored lines




On 4/9/19 11:56 AM, Barret Rhoden wrote:
Anyway, being able to look outside the current blame_chunk would help in those scenarios.  Specifically, I'm talking about letting blame_chunk() point anywhere in the parent.  Right now, it can only look in the parent's part of the chunk passed to blame_chunk, which can be relatively small.

I hacked up the ability to look outside of a diff chunk. The change to the heuristic-independent part of the code was very minor, both in code and in performance.

The change to make the fingerprinting algorithm from my RFC patch look at the entire parent was pretty minor too - I can also cache the fingerprints. The main drawback is performance, but Michael's new fingerprinting code alleviates this.

Here's a quick analysis. When run on a 1000 line C file, with large changes from an ignored commit, after the file has been paged in (so, run twice):

not ignoring at all:
-------------
real	0m0.062s
user	0m0.042s
sys	0m0.021s

scan only in the parent chunk:
----------------------------
real	0m0.097s
user	0m0.085s
sys	0m0.012s

scan parent chunk, scan entire parent on failure:
-------------------------
real	0m1.773s
user	0m1.752s
sys	0m0.021s

scan the entire parent:
-----------------------
real	0m3.049s
user	0m3.024s
sys	0m0.024s

Scanning the parent chunk first helped a lot. Scanning the entire parent is O(nr_parent_lines * nr_lines_changed). In my test file, that's about 1000 * 600.

It still takes a little while even when checking the parent chunk first. Let's call that one the 'smaller scan.'

Caching the fingerprints (meaning, calculate once and attach to the blame_origin) didn't help much here. It's actually worse, possibly due to fingerprinting more than I needed to.

smaller scan, without caching:
----------------
real	0m1.651s
user	0m1.626s
sys	0m0.025s

smaller scan, with caching:
----------------
real	0m1.774s
user	0m1.753s
sys	0m0.021s


Let's try Michael's new fingerprinting code (get_fingerprint() and fingerprint_similarity())

smaller scan, caching:
-------------------
real	0m0.240s
user	0m0.215s
sys	0m0.025s

smaller scan, no caching:
----------------------
real	0m0.295s
user	0m0.266s
sys	0m0.029s

full parent scan, caching:
--------------------------
real	0m0.377s
user	0m0.356s
sys	0m0.021s

full parent scan, no caching:
-----------------------------
real	0m0.458s
user	0m0.430s
sys	0m0.028s

And for completenesss,

scan only in the parent chunk:
------------------------------
real	0m0.072s
user	0m0.048s
sys	0m0.024s


So first off, Michael's new fingerprinting is much better. Caching the fingerprints also helps. Overall, it's fast enough now that I don't notice the delay.

I'll reroll the patchset with the ability to find lines in the entire parent, and see what you all think.

Barret