Re: We should add a "git gc --auto" after "git clone" due to commit graph
- Date: Mon, 8 Oct 2018 14:29:47 -0400
- From: Derrick Stolee <stolee@xxxxxxxxx>
- Subject: Re: We should add a "git gc --auto" after "git clone" due to commit graph
On 10/8/2018 2:10 PM, SZEDER Gábor wrote:
On Mon, Oct 08, 2018 at 12:57:34PM -0400, Derrick Stolee wrote:
Nice! These numbers make sense to me, in terms of how many TREESAME queries
we actually need to perform for such a query.
Yeah... because you didn't notice that I deliberately cheated :)
As it turned out, it's not just about the number of diff queries that
we can spare, but, for the speedup _ratio_, it's more about how
expensive those diff queries are.
git.git has a rather flat hierarchy, and 't/' is the 372th entry in
the current root tree object, while 'valgrind/' is the 923th entry,
and the diff machinery spends considerable time wading through the
previous entries. Notice the "carefully chosen path" remark in my
previous email; I think this particular path has the highest number of
preceeding tree entries, and, in addition, 't/' changes rather
frequently, so the diff machinery often has to scan two relatively big
tree objects. Had I chosen 'Documentation/RelNotes/184.108.40.206.txt'
instead, i.e. another path two directories deep, but whose leading
path components are both near the beginning of the tree objects, the
speedup would be much less impressive: 0.282s vs. 0.049s, i.e. "only"
~5.7x instead of ~24.8x.
This is expected. The performance ratio is better when the path is any
of the following:
1. A very deep path (need to walk multiple trees to answer TREESAME)
2. An entry is late in a very wide tree (need to spend extra time
parsing tree object)
3. The path doesn't change very often (need to inspect many TREESAME
pairs before finding enough interesting commits)
4. Some sub-path changes often (so the TREESAME comparison needs to
parse beyond that sub-path often)
Our standard examples (Git and Linux repos) don't have many paths that
have these properties. But: they do exist. In other projects, this is
actually typical. Think about Java projects that frequently have ~5
levels of folders before actually touching a code file.
When I was implementing the Bloom filter feature for Azure Repos, I ran
performance tests on the Linux repo using a random sampling of paths.
The typical speedup was 5x while some outliers were in the 25x range.
But I'm afraid it will take a while until I get around to turn it into
Do you have the code pushed somewhere public where one could take a look? I
Do you have the code pushed somewhere public where one could take a
look? I could provide some early feedback.
Nah, definitely not... I know full well how embarassingly broken this
implementation is, I don't need others to tell me that ;)
There are two questions that I was hoping to answer by looking at your code:
1. How do you store your Bloom filter? Is it connected to the
commit-graph and split on a commit-by-commit basis (storing "$path" as a
key), or is it one huge Bloom filter (storing "$commitid:$path" as key)?
2. Where does your Bloom filter check plug into the TREESAME logic? I
haven't investigated this part, but hopefully it isn't too complicated.