Re: [RFC PATCH 00/18] Multi-pack index (MIDX)
- Date: Sun, 07 Jan 2018 23:42:27 +0100
- From: Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx>
- Subject: Re: [RFC PATCH 00/18] Multi-pack index (MIDX)
On Sun, Jan 07 2018, Derrick Stolee jotted:
> git log --oneline --raw --parents
>
> Num Packs | Before MIDX | After MIDX | Rel % | 1 pack %
> ----------+-------------+------------+--------+----------
> 1 | 35.64 s | 35.28 s | -1.0% | -1.0%
> 24 | 90.81 s | 40.06 s | -55.9% | +12.4%
> 127 | 257.97 s | 42.25 s | -83.6% | +18.6%
>
> The last column is the relative difference between the MIDX-enabled repo
> and the single-pack repo. The goal of the MIDX feature is to present the
> ODB as if it was fully repacked, so there is still room for improvement.
>
> Changing the command to
>
> git log --oneline --raw --parents --abbrev=40
>
> has no observable difference (sub 1% change in all cases). This is likely
> due to the repack I used putting commits and trees in a small number of
> packfiles so the MRU cache workes very well. On more naturally-created
> lists of packfiles, there can be up to 20% improvement on this command.
>
> We are using a version of this patch with an upcoming release of GVFS.
> This feature is particularly important in that space since GVFS performs
> a "prefetch" step that downloads a pack of commits and trees on a daily
> basis. These packfiles are placed in an alternate that is shared by all
> enlistments. Some users have 150+ packfiles and the MRU misses and
> abbreviation computations are significant. Now, GVFS manages the MIDX file
> after adding new prefetch packfiles using the following command:
>
> git midx --write --update-head --delete-expired --pack-dir=<alt>
(Not a critique of this, just a (stupid) question)
What's the practical use-case for this feature? Since it doesn't help
with --abbrev=40 the speedup is all in the part that ensures we don't
show an ambiguous SHA-1.
The reason we do that at all is because it makes for a prettier UI.
Are there things that both want the pretty SHA-1 and also care about the
throughput? I'd have expected machine parsing to just use
--no-abbrev-commit.
If something cares about both throughput and e.g. is saving the
abbreviated SHA-1s isn't it better off picking some arbitrary size
(e.g. --abbrev=20), after all the default abbreviation is going to show
something as small as possible, which may soon become ambigous after the
next commit.