Web lists-archives.com

Re: [RFC PATCH 00/18] Multi-pack index (MIDX)




On Sunday, January 07, 2018 01:14:41 PM Derrick Stolee 
wrote:
> This RFC includes a new way to index the objects in
> multiple packs using one file, called the multi-pack
> index (MIDX).
...
> The main goals of this RFC are:
> 
> * Determine interest in this feature.
> 
> * Find other use cases for the MIDX feature.

My interest in this feature would be to speed up fetches 
when there is more than one large pack-file with many of the 
same objects that are in other pack-files.   What does your 
MIDX design do when it encounters multiple copies of the 
same object in different pack files?  Does it index them all, 
or does it keep a single copy?

In our Gerrit instance (Gerrit uses jgit), we have multiple 
copies of the linux kernel repos linked together via the 
alternatives file mechanism.  These repos have many different 
references (mostly Gerrit change references), but they share 
most of the common objects from the mainline.  I have found 
that during a large fetch such as a clone, jgit spends a 
significant amount of extra time by having the extra large 
pack-files from the other repos visible to it, usually around 
an extra minute per instance of these (without them, the 
clone takes around 7mins).  This adds up easily with a few 
repos extra repos, it can almost double the time.

My investigations have shown that this is due to jgit 
searching each of these pack files to decide which version of 
each object to send.  I don't fully understand its selection 
criteria, however if I shortcut it to just pick the first 
copy of an object that it finds, I regain my lost time.  I 
don't know if git suffers from a similar problem?  If git 
doesn't suffer from this then it likely just uses the first 
copy of an object it finds (which may not be the best object 
to send?)

It would be nice if this use case could be improved with 
MIDX.  To do so, it seems that it would either require that 
MIDX either only put "the best" version of an object (i.e. 
pre-select which one to use), or include the extra 
information to help make the selection process of which copy 
to use (perhaps based on the operation being performed) 
fast.

This also leads me to ask, what other additional information 
(bitmaps?) for other operations, besides object location, 
might suddenly be valuable in an index that potentially 
points to multiple copies of objects?  Would such 
information be appropriate in MIDX, or would it be better in 
another index?

Thanks,

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation