Web lists-archives.com

Re: Bloom Filters (was Re: We should add a "git gc --auto" after "git clone" due to commit graph)




On Tue, Oct 09 2018, Derrick Stolee wrote:

> The filter needs to store every path that would be considered "not
> TREESAME". It can't store wildcards, so you would need to evaluate the
> wildcard and test all of those paths individually (not a good idea).

If full paths are stored, yes, But have you considered instead of
storing paths, storing all trigrams that can be extracted from changed
paths at that commit?

I.e. instead of a change to "t/t0000-basic.sh" storing
"t/t0000-basic.sh" we'd store ["t/t", "/t0", "t00", "000", "00-" ...]
etc.

That sort of approach would mean that e.g. "t*000*", "*asi*.sh"
etc. could all be indexed, and as long as we could find three
consecutive bytes of fixed string we'd have a chance to short-circuit,
but would need to degrade to a full tree unpack for e.g. "t*". We could
also special-case certain sub-three-char indexes, or to
"bi-grams". E.g. to be able to index '*.c' or 't*' (first char at the
beginning of a string only).

It would mean having to check more things in the bloom filter for each
commit, but that's going to be hot in cache at that point so it'll
probably beat unpacking trees by far, and we could short-circuit exit at
the first one that returned false.