Re: How hard would it be to implement sparse fetching/pulling?
- Date: Sat, 2 Dec 2017 18:24:49 -0000
- From: "Philip Oakley" <philipoakley@xxxxxxx>
- Subject: Re: How hard would it be to implement sparse fetching/pulling?
From: "Jeff Hostetler" <git@xxxxxxxxxxxxxxxxx>
Sent: Friday, December 01, 2017 5:23 PM
On 11/30/2017 6:43 PM, Philip Oakley wrote:
From: "Vitaly Arbuzov" <vit@xxxxxxxx>
On Thu, Nov 30, 2017 at 9:01 AM, Vitaly Arbuzov <vit@xxxxxxxx> wrote:
It's great, I didn't expect that anyone is actively working on this.
I'll check out your branch, meanwhile do you have any design docs that
describe these changes or can you define high level goals that you
want to achieve?
On Thu, Nov 30, 2017 at 6:24 AM, Jeff Hostetler <git@xxxxxxxxxxxxxxxxx>
On 11/29/2017 10:16 PM, Vitaly Arbuzov wrote:
I have, for separate reasons been _thinking_ about the issue ($dayjob is
defence, so a similar partition would be useful).
The changes would almost certainly need to be server side (as well as
side), as it is the server that decides what is sent over the wire in the
pack files, which would need to be a 'narrow' pack file.
Yes, there will need to be both client and server changes.
In the current 3 part patch series, the client sends a "filter_spec"
to the server as part of the fetch-pack/upload-pack protocol.
If the server chooses to honor it, upload-pack passes the filter_spec
to pack-objects to build an "incomplete" packfile omitting various
objects (currently blobs). Proprietary servers will need similar
changes to support this feature.
Discussing this feature in the context of the defense industry
makes me a little nervous. (I used to be in that area.)
I'm viewing the desire for codebase partitioning from a soft layering of
risk view (perhaps a more UK than USA approach ;-)
What we have in the code so far may be a nice start, but
probably doesn't have the assurances that you would need
for actual deployment. But it's a start....
True. I need to get some of my collegues more engaged...
If we had such a feature then all we would need on top is a separate
tool that builds the right "sparse" scope for the workspace based on
paths that developer wants to work on.
In the world where more and more companies are moving towards large
monorepos this improvement would provide a good way of scaling git to
meet this demand.
The 'companies' problem is that it tends to force a client-server,
on-line mentality. I'm also wanting the original DVCS off-line capability
still be available, with _user_ control, in a generic sense, of what they
have locally available (including files/directories they have not yet
at, but expect to have. IIUC Jeff's work is that on-line view, without
I'd commented early in the series at [1,2,3].
Yes, this does tend to lead towards an always-online mentality.
However, there are 2 parts:
[a] dynamic object fetching for missing objects, such as during a
random command like diff or blame or merge. We need this
regardless of usage -- because we can't always predict (or
dry-run) every command the user might run in advance.
Making something "useful" happen here when off-line is an obvious goal.
[b] batch fetch mode, such as using partial-fetch to match your
sparse-checkout so that you always have the blobs of interest
to you. And assuming you don't wander outside of this subset
of the tree, you should be able to work offline as usual.
If you can work within the confines of [b], you wouldn't need to
always be online.
I feel this is the area that does need ensure a capability to avoid any
perception of the much maligned 'Embrace, extend, and extinguish' by
I don't think this should be viewed as a type of sparse checkout - it's just
a checkout of what you have (under the hood it could use the same code
We might also add a part [c] with explicit commands to back-fill or
alter your incomplete view of the ODB (as I explained in response
to the "git diff <commit1> <commit2>" comment later in this thread.
At its core, my idea was to use the object store to hold markers for the
'not yet fetched' objects (mainly trees and blobs). These would be in a
known fixed format, and have the same effect (conceptually) as the
sub-module markers - they _confirm_ the oid, yet say 'not here, try
We do have something like this. Jonathan can explain better than I, but
basically, we denote possibly incomplete packfiles from partial clones
and fetches as "promisor" and have special rules in the code to assert
that a missing blob referenced from a "promisor" packfile is OK and can
be fetched later if necessary from the "promising" remote.
The remote interaction is one area that may need thought, especially in a
triangle workflow, of which there are a few.
The main problem with markers or other lists of missing objects is
that it has scale problems for large repos. Suppose I have 100M
blobs in my repo. If I do a blob:none clone, I'd have 100M missing
blobs that would need tracking. If I then do a batch fetch of the
blobs needed to do a sparse checkout of HEAD, I'd have to remove
those entries from the tracking data. Not impossible, but not
** Ahhh. I see. That's a consequence of having all the trees isn't it. **
I've always thought that limiting the trees is at the heart of the Narrow
OK so if you have flat, wide structures with 10k files/directories per tree
then it's still a fair sized problem, but it should *scale logarithmically*
for the part of the tree structure that's not being downloaded.
You never have to add a marker for a blob that you have no containing tree
for. Nor for the tree that contained the blob's tree, all the way up to
primary line of descent to the tree of concern. All those trees are never
down loaded, there are few markers (.gitNarrowTree files) for those tree
stubs, certainly no 100M missing blob markers.
The comaprison with submodules mean there is the same chance of
de-synchronisation with triangular and upstream servers, unless managed.
The server side, as noted, will need to be included as it is the one that
decides the pack file.
Options for a server management are:
- "I accept narrow packs?" No; yes
- "I serve narrow packs?" No; yes.
- "Repo completeness checks on reciept": (must be complete) || (allow
narrow to nothing).
we have new config settings for the server to allow/reject
and we have code in fsck/gc to handle these incomplete packfiles.
For server farms (e.g. Github..) the settings could be global, or by
(note that the completeness requirement and narrow reciept option are not
incompatible - the recipient server can reject the pack from a narrow
subordinate as incomplete - see below)
For now our scope is limited to partial clone and fetch. We've not
* Marking of 'missing' objects in the local object store, and on the
The missing objects are replaced by a place holder object, which used the
same oid/sha1, but has a short fixed length, with content
<oid>”. The chance that that string would actually have such an oid clash
the same as all other object hashes, so is a *safe* self-referential
Again, there is a scale problem here. If I have 100M missing blobs,
I can't afford to create 100M loose place holder files. Or juggle
a 2GB file of missing objects on various operations.
As above, I'm also trimming the trees, so in general, there would be no
missing blobs, just the content of the directory one was interested in.
That's not quite true if higher level trees have blob references in them
that are otherwise unwanted - they may each need a marker. [Or maybe a
special single 'tree-of-blobs' marker for them all thus only one marker per
tree - over-thinking maybe...]
* The stored object already includes length (and inferred type), so we do
know what it stands in for. Thus the local index (index file) should be
to be recreated from the object store alone (including the ‘promised /
narrow / missing’ files/directory markers)
* the ‘same’ as sub-modules.
The potential for loss of synchronisation with a golden complete repo is
just the same as for sub-modules. (We expected object/commit X here, but
it’s not in the store). This could happen with a small user group who
have locally narrow clones, who interact with their local narrow server
for ‘backup’, and then fail to push further upstream to a server that
mandates completeness. They could create a death by a thousand narrow
cuts. Having a golden upstream config reference (indicating which is the
upstream) could allow checks to ensure that doesn’t happen.
The fsck can be taught the config option of 'allowNarrow'.
We've updated fsck to be aware of partial clones.
The narrowness would be defined in a locally stored '.gitNarrowIgnore'
(which can include the size constraints being developed elsewhere on the
As a safety it could be that the .gitNarrowIgnore is sent with the pack
that fold know what they missed, and fsck could check that they are
not narrower than some specific project .gitNarrowIgnore spec.
Currently, we store the filter_spec used for the partial clone (or the
first partial fetch) as a default for subsequent fetches, but we limit
it there. That is, for operations like checkout or blame or whatever,
it doesn't matter why a blob is missing or what filter criteria was
used to cause it to be omitted -- just that it is. Any time we need
a missing object, we have to go get it -- whether that is via dynamic
or bulk fetching.
Deciding *if* we have to get it, while still being 'useful', is part of the
question I raised above. In my world view, we already have the intersting
blobs, so we shouldn't need to get anything. diff's, blame's, checkout's,
simply go with the stub values and everything is cushty.
The benefit of this that the off-line operation capability of Git
which GVFS doesn’t quite do (accidental lock in to a client-server model
all those other VCS systems).
Yes, there are limitations of GVFS and you must be online to use it.
It essentially does a blob:none filter and dynamically faults in every
blob. (But it is also using a kernel driver and daemon to dynamically
populate the file system when a file is first opened for writing --
much like copy-on-write virtual memory. And yes, these 2 operations
are independent, but currently combined in the GVFS code.)
And yes, there is work here to make sure that most normal
operations can continue to work offline.
I believe that its all doable, and that Jeff H's work already puts much
it in place, or touches those places
That said, it has been just _thinking_, without sufficient time to delve
into the code.
Thanks for the great work.