Web lists-archives.com

Re: Partial clone design (with connectivity check for locally-created objects)

On 8/7/2017 3:41 PM, Junio C Hamano wrote:
Ben Peart <peartben@xxxxxxxxx> writes:

My concern with this proposal is the combination of 1) writing a new
pack file for every git command that ends up bringing down a missing
object and 2) gc not compressing those pack files into a single pack

Your noticing these is a sign that you read the outline of the
design correctly, I think.

The basic idea is that the local fsck should tolerate missing
objects when they are known to be obtainable from that external
service, but should still be able to diagnose missing objects that
we do not know if the external service has, especially the ones that
have been newly created locally and not yet made available to them
by pushing them back.

This helps me a lot as now I think I understand the primary requirement we're trying to solve for. Let me rephrase it and see if this makes sense:

We need to be able to identify whether an object was created locally (and should pass more strict fsck/connectivity tests) or whether it came from a remote (and so any missing objects could presumably be fetched from the server).

I agree it would be nice to solve this (and not just punt fsck - even if it is an opt-in behavior).

We've discussed a couple of different possible solutions, each of which have different tradeoffs. Let me try to summarize here and perhaps suggest some other possibilities:

Promised list
This provides an external data structure that allowed us to flag objects that came from a remote server (vs created locally).

The biggest drawback is that this data structure can get very large and become difficult/expensive to generate/transfer/maintain.

It also (at least in one proposal) required protocol and server side changes to support it.

Annotated via filename
This idea is to annotate the file names of objects that came from a remote server (pack files and loose objects) with a unique file extension (.remote) that indicates whether they are locally created or not.

To make this work, git must understand about both types of loose objects and pack files and search in both locations when looking for objects.

Another drawback of this is that commands (repack, gc) that optimize loose objects and pack files must now be aware of the different extensions and handle both while not merging remote and non-remote objects.

In short, we're creating separate object stores - one for locally created objects and one for everything else.

Now a couple of different ideas:

Annotated via flags
The fundamental idea here is that we add the ability to flag locally created objects on the object itself.

Given that at the core, "Git is a simple key-value data store" can we take advantage of that fact and include a "locally created" bit as a property on every object?

I could not think of a good way to accomplish this as it is ultimately changing the object format which creates rapidly expanding ripples of change.

For example, The object header currently includes the type a space, the length and a null. Even if we could add a "local" property (either by adding a 5th item, taking over the space, creating new object types, etc), the fact that the header is included in the sha1 means that push would become problematic as flipping the bit would change the sha and the trees and commits that reference it.

Local list
Given the number of locally created objects is usually very small in comparison to the total number of objects (even just due to history), it makes more sense to track locally created objects instead of promised/remote objects.

The biggest advantage of this over the "promised list" is that the "local list" being maintained is _significantly_ smaller (often orders of magnitude smaller).

Another advantage over the "promised list" solution is that it doesn't require any server side or protocol changes.

On the client when objects are created (write_loose_object?) the new objects are added to the "local list" and in the connectivity check (fsck) if the object is not in the "local list," the connectivity check can be skipped as any missing object can presumably be retrieved from the server.

A simple file format could be used (header + list of SHA1 values) and write_loose_object could do a trivial append. In fsck, the file could be loaded into a hashmap to make for fast existence checks.

Entries could be removed from the "local list" for objects later fetched from a server (though I had a hard time contriving a scenario where this would happen so I consider this optional).

On the surface, this seems like the simplest solution that meets the stated requirements.

Object DB
This is a different way of providing separate object stores than the "Annotated via filename" proposal. It should be a cleaner/more elegant solution that enables several other capabilities but it is also more work to implement (isn't that always the case?).

We create an object store abstraction layer that enables multiple object store providers to exist. The order that they are called should be configurable based on the command (esp have/read vs create/write). This enables features like tiered storage: in memory, pack, loose, alternate, large, remote.

The connectivity check in fsck would then only traverse and validate objects that existed via the local object store providers.

While I like the flexibility of this design and hope we can obtain it in the long term for it's other benefits, it's a bit overkill for this specific problem. The big drawback of this model is the cost to design and implement it.