Re: Partial clone design (with connectivity check for locally-created objects)
- Date: Mon, 7 Aug 2017 15:12:11 -0400
- From: Ben Peart <peartben@xxxxxxxxx>
- Subject: Re: Partial clone design (with connectivity check for locally-created objects)
On 8/4/2017 8:21 PM, Jonathan Tan wrote:
On Fri, 04 Aug 2017 15:51:08 -0700
Junio C Hamano <gitster@xxxxxxxxx> wrote:
Jonathan Tan <jonathantanmy@xxxxxxxxxx> writes:
"Imported" objects must be in a packfile that has a "<pack name>.remote"
file with arbitrary text (similar to the ".keep" file). They come from
clones, fetches, and the object loader (see below).
A "homegrown" object is valid if each object it references:
1. is a "homegrown" object,
2. is an "imported" object, or
3. is referenced by an "imported" object.
Overall it captures what was discussed, and I think it is a good
I missed the offline discussion and so am trying to piece together what
this latest design is trying to do. Please let me know if I'm not
understanding something correctly.
From what I can tell, objects are going to be segmented into two
"types" - those that were fetched from a remote source that allows
partial clones/fetches (lazyobject/imported) and those that come from
"regular" remote sources (homegrown) that requires all objects to exist
FWIW, the names here are not making things clearer for me. If I'm
correct perhaps "partial" and "normal" would be better to indicate the
type of the source? Anyway...
Once the objects are segmented into the 2 types, the fsck connectivity
check code is updated to ignore missing objects from "partial" remotes
but still expect/validate them from "normal" remotes.
This compromise seems reasonable - don't generate errors for missing
objects for remotes that returned a partial clone but do generate errors
for missing objects from normal clones as a missing object is always an
error in this case.
This segmentation is what is driving the need for the object loader to
build a new local pack file for every command that has to fetch a
missing object. For example, we can't just write a tree object from a
"partial" clone into the loose object store as we have no way for fsck
to treat them differently and ignore any missing objects referenced by
that tree object.
My concern with this proposal is the combination of 1) writing a new
pack file for every git command that ends up bringing down a missing
object and 2) gc not compressing those pack files into a single pack file.
We all know that git doesn't scale well with a lot of pack files as it
has to do a linear search through all the pack files when attempting to
find an object. I can see that very quickly, there would be a lot of
pack files generated and with gc ignoring "partial" pack files, this
would never get corrected.
In our usage scenarios, _all_ of the objects come from "partial" clones
so all of our objects would end up in a series of "partial" pack files
and would have pretty poor performance as a result.
I wondered if it is possible to flag a specific remote as "partial" and
have fsck be able to track any given object back to the remote and then
properly handle the fact that it was missing based on that. I couldn't
think of a good way to do that without some additional data structure
that would have to be build/maintained (ie promises).
That thinking did lead me back to wondering again if we could live with
a repo specific flag. If any clone/fetch was "partial" the flag is set
and fsck ignore missing objects whether they came from a "partial"
remote or not.
I'll admit it isn't as robust if someone is mixing and matching remotes
from different servers some of which are partial and some of which are
not. I'm not sure how often that would actually happen but I _am_
certain a single repo specific flag is a _much_ simpler model than
anything else we've come up with so far.
I doubt you want to treat all fetches/clones the same way as the
"lazy object" loading, though. You may be critically rely on the
corporate central server that will give the objects it "promised"
when you cloned from it lazily (i.e. it may have given you a commit,
but not its parents or objects contained in its tree--you still know
that the parents and the tree and its contents will later be
available and rely on that fact). You trust that and build on top,
so the packfile you obtained when you cloned from such a server
should count as "imported". But if you exchanged wip changes with
your colleages by fetching or pushing peer-to-peer, without the
corporate central server knowing, you would want to treat objects in
packs (or loose objects) you obtained that way as "not imported".
That's true. I discussed this with a teammate and we might need to make
extensions.lazyObject be the name of the "corporate central server"
remote instead, and have a "loader" setting within that remote, so that
we can distinguish that objects from this server are "imported" but
objects from other servers are not.
The connectivity check shouldn't be slow in this case because fetches
are usually onto tips that we have (so we don't hit case 3).
Also I think "imported" vs "homegrown" may be a bit misnomer; the
idea to split objects into two camps sounds like a good idea, and
"imported" probably is an OK name to use for the category that is a
group of objects to which you know/trust are backed by your lazy
loader. But the other one does not have to be "home"-grown.
Well, the names are not that important, but I think the line between
the two classes should not be "everything that came from clone and
fetch is imported", which is a more important point I am trying to
Maybe "imported" vs "non-imported" would be better. I agree that the
objects in the non-"imported" group could still be obtained from
Thanks for your comments.