Web lists-archives.com

Re: Partial clone design (with connectivity check for locally-created objects)




Ben Peart <peartben@xxxxxxxxx> writes:

> My concern with this proposal is the combination of 1) writing a new
> pack file for every git command that ends up bringing down a missing
> object and 2) gc not compressing those pack files into a single pack
> file.

Your noticing these is a sign that you read the outline of the
design correctly, I think.  

The basic idea is that the local fsck should tolerate missing
objects when they are known to be obtainable from that external
service, but should still be able to diagnose missing objects that
we do not know if the external service has, especially the ones that
have been newly created locally and not yet made available to them
by pushing them back.

So we need a way to tell if an object that we do not have (but we
know about) can later be obtained from the external service.
Maintaining an explicit list of such objects obviously is one way,
but we can get the moral equivalent by using pack files.  After
receiving a pack file that has a commit from such an external
service, if the commit refers to its parent commit that we do not
have locally, the design proposes us to consider that the parent
commit that is missing is available at the external service that
gave the pack to us.  Similarly for missing trees, blobs, and any
objects that are supposed to be "reachable" from objects in such a
packfile.  

We can extend the approach to cover loose objects if we wanted to;
just define an alternate object store used internally for this
purpose and drop loose objects obtained from such an external
service in that object store.

Because we do not want to leave too many loose objects and small
packfiles lying around, we will need a new way of packing these.
Just enumerate these objects known to have come from the external
service (by being in packfiles marked as such or being loose objects
in the dedicated alternate object store), and create a single larger
packfile, which is marked as "holding the objects that are known to
be in the external service".  We do not have such a mode of gc, and
that is a new development that needs to happen, but we know that is
doable.

> That thinking did lead me back to wondering again if we could live
> with a repo specific flag.  If any clone/fetch was "partial" the flag
> is set and fsck ignore missing objects whether they came from a
> "partial" remote or not.

The only reason people run "git fsck" is to make sure that their
local repository is sound and they can rely on the objects you have
as the base of building new stuff on top of.  That is why we are
trying to find a way to make sure "fsck" can be used to detect
broken or missing objects that cannot be obtained from the
lazy-object store, without incurring undue overhead for normal
codepath (i.e. outside fsck).

It is OK to go back to wondering again, but I think that essentially
tosses "git fsck" out of the window and declares that it is OK to
hope that local objects will never go bad.  We can make such an
declaration anytime, but I do not want to see us doing so without
first trying to solve the issue without punting.