Web lists-archives.com

Re: [PATCH 06/10] rev-list: add --allow-partial option to relax connectivity checks

On Wed, Mar 08, 2017 at 03:10:54PM -0500, Jeff Hostetler wrote:

> > Even though I do very much like the basic "high level" premise to
> > omit often useless large blobs that are buried deep in the history
> > we would not necessarily need from the initial cloning and
> > subsequent fetches, I find it somewhat disturbing that the code
> > "Assume"s that any missing blob is due to an previous partial clone.
> > Adding this option smells like telling the users that they are not
> > supposed to run "git fsck" because a partially cloned repository is
> > inherently a corrupt repository.
> > 
> > Can't we do a bit better?  If we want to make the world safer again,
> > what additional complexity is required to allow us to tell the
> > "missing by design" and "corrupt repository" apart?
> I'm open to suggestions here.  It would be nice to extend the
> fetch-pack/upload-pack protocol to return a list of the SHAa
> (and maybe the sizes) of the omitted blobs, so that a partial
> clone or fetch would still be able to be integrity checked.

Yeah, the early external-odb patches did this. It lets you do a more
accurate fsck, and it also helps diff avoid faulting in large-object
cases (because we can mark them as binary for "free" by comparing the
size to big_file_threshold).

So I think it makes a lot of sense in the large-blob case, where
transmitting a type/size/sha1 tuple is way more efficient than sending
the blob itself. But it's less clear for "sparse" cases where just
enumerating the set of blobs may be prohibitively large.

I have a feeling that the "sparse" thing needs to be handled separately
from "partial". IOW, the client needs to tell the server "I'm only
interested in the path foo/bar, so just send that". Then you don't find
out about the types and sizes outside of that path, but you don't need
to; the sparse path is stored locally and fsck knows to avoid looking
into it.