Re: Partial clone design (with connectivity check for locally-created objects)
- Date: Tue, 8 Aug 2017 12:45:40 -0400
- From: Ben Peart <peartben@xxxxxxxxx>
- Subject: Re: Partial clone design (with connectivity check for locally-created objects)
On 8/7/2017 3:41 PM, Junio C Hamano wrote:
Ben Peart <peartben@xxxxxxxxx> writes:
My concern with this proposal is the combination of 1) writing a new
pack file for every git command that ends up bringing down a missing
object and 2) gc not compressing those pack files into a single pack
Your noticing these is a sign that you read the outline of the
design correctly, I think.
The basic idea is that the local fsck should tolerate missing
objects when they are known to be obtainable from that external
service, but should still be able to diagnose missing objects that
we do not know if the external service has, especially the ones that
have been newly created locally and not yet made available to them
by pushing them back.
This helps me a lot as now I think I understand the primary requirement
we're trying to solve for. Let me rephrase it and see if this makes sense:
We need to be able to identify whether an object was created locally
(and should pass more strict fsck/connectivity tests) or whether it came
from a remote (and so any missing objects could presumably be fetched
from the server).
I agree it would be nice to solve this (and not just punt fsck - even if
it is an opt-in behavior).
We've discussed a couple of different possible solutions, each of which
have different tradeoffs. Let me try to summarize here and perhaps
suggest some other possibilities:
This provides an external data structure that allowed us to flag objects
that came from a remote server (vs created locally).
The biggest drawback is that this data structure can get very large and
become difficult/expensive to generate/transfer/maintain.
It also (at least in one proposal) required protocol and server side
changes to support it.
Annotated via filename
This idea is to annotate the file names of objects that came from a
remote server (pack files and loose objects) with a unique file
extension (.remote) that indicates whether they are locally created or not.
To make this work, git must understand about both types of loose objects
and pack files and search in both locations when looking for objects.
Another drawback of this is that commands (repack, gc) that optimize
loose objects and pack files must now be aware of the different
extensions and handle both while not merging remote and non-remote objects.
In short, we're creating separate object stores - one for locally
created objects and one for everything else.
Now a couple of different ideas:
Annotated via flags
The fundamental idea here is that we add the ability to flag locally
created objects on the object itself.
Given that at the core, "Git is a simple key-value data store" can we
take advantage of that fact and include a "locally created" bit as a
property on every object?
I could not think of a good way to accomplish this as it is ultimately
changing the object format which creates rapidly expanding ripples of
For example, The object header currently includes the type a space, the
length and a null. Even if we could add a "local" property (either by
adding a 5th item, taking over the space, creating new object types,
etc), the fact that the header is included in the sha1 means that push
would become problematic as flipping the bit would change the sha and
the trees and commits that reference it.
Given the number of locally created objects is usually very small in
comparison to the total number of objects (even just due to history), it
makes more sense to track locally created objects instead of
The biggest advantage of this over the "promised list" is that the
"local list" being maintained is _significantly_ smaller (often orders
of magnitude smaller).
Another advantage over the "promised list" solution is that it doesn't
require any server side or protocol changes.
On the client when objects are created (write_loose_object?) the new
objects are added to the "local list" and in the connectivity check
(fsck) if the object is not in the "local list," the connectivity check
can be skipped as any missing object can presumably be retrieved from
A simple file format could be used (header + list of SHA1 values) and
write_loose_object could do a trivial append. In fsck, the file could be
loaded into a hashmap to make for fast existence checks.
Entries could be removed from the "local list" for objects later fetched
from a server (though I had a hard time contriving a scenario where this
would happen so I consider this optional).
On the surface, this seems like the simplest solution that meets the
This is a different way of providing separate object stores than the
"Annotated via filename" proposal. It should be a cleaner/more elegant
solution that enables several other capabilities but it is also more
work to implement (isn't that always the case?).
We create an object store abstraction layer that enables multiple object
store providers to exist. The order that they are called should be
configurable based on the command (esp have/read vs create/write). This
enables features like tiered storage: in memory, pack, loose, alternate,
The connectivity check in fsck would then only traverse and validate
objects that existed via the local object store providers.
While I like the flexibility of this design and hope we can obtain it in
the long term for it's other benefits, it's a bit overkill for this
specific problem. The big drawback of this model is the cost to design
and implement it.