Web lists-archives.com

Re: How hard would it be to implement sparse fetching/pulling?




From: "Jeff Hostetler" <git@xxxxxxxxxxxxxxxxx>
Sent: Friday, December 01, 2017 2:30 PM
On 11/30/2017 8:51 PM, Vitaly Arbuzov wrote:
I think it would be great if we high level agree on desired user
experience, so let me put a few possible use cases here.

1. Init and fetch into a new repo with a sparse list.
Preconditions: origin blah exists and has a lot of folders inside of
src including "bar".
Actions:
git init foo && cd foo
git config core.sparseAll true # New flag to activate all sparse
operations by default so you don't need to pass options to each
command.
echo "src/bar" > .git/info/sparse-checkout
git remote add origin blah
git pull origin master
Expected results: foo contains src/bar folder and nothing else,
objects that are unrelated to this tree are not fetched.
Notes: This should work same when fetch/merge/checkout operations are
used in the right order.

With the current patches (parts 1,2,3) we can pass a blob-ish
to the server during a clone that refers to a sparse-checkout
specification.

I hadn't appreciated this capability. I see it as important, and should be available both ways, so that a .gitNarrow spec can be imposed from the server side, as well as by the requester.

It could also be used to assist in the 'precious/secret' blob problem, so that AWS keys are never pushed, nor available for fetching!

       There's a bit of a chicken-n-egg problem getting
things set up.  So if we assume your team would create a series
of "known enlistments" under version control, then you could

s/enlistments/entitlements/ I presume?

just reference one by <branch>:<path> during your clone.  The
server can lookup that blob and just use it.

    git clone --filter=sparse:oid=master:templates/bar URL

And then the server will filter-out the unwanted blobs during
the clone.  (The current version only filters blobs; you still
get full commits and trees.  That will be revisited later.)

I'm for the idea that only the in-heirachy trees should be sent.
It should also be possible that the server replies that it is only sending a narrow clone, with the given (accessible?) spec.


On the client side, the partial clone installs local config
settings into the repo so that subsequent fetches default to
the same filter criteria as used in the clone.


I don't currently have provision to send a full sparse-checkout
specification to the server during a clone or fetch.  That
seemed like too much to try to squeeze into the protocols.
We can revisit this later if there is interest, but it wasn't
critical for the initial phase.

Agreed. I think it should be somewhere 'visible' to the user, but could be setup by the server admin / repo maintainer if they don't have write access. But there could still be the catch-22 - maybe one starts with a <commit | toptree> : <tree> pair to define an origin point (it's not as refined as a .gitNarrow spec file, but is definative). The toptree option could even allow sub-tree clones.. maybe..



2. Add a file and push changes.
Preconditions: all steps above followed.
touch src/bar/baz.txt && git add -A && git commit -m "added a file"
git push origin master
Expected results: changes are pushed to remote.

I don't believe partial clone and/or partial fetch will cause
any changes for push.

I suspect that pushes could be rejected if the user 'pretends' to modify files or trees outside their area. It does need the user to be able to spoof part of a tree they don't have, so an upstream / remote would immediatly know it was a spoof but locally the narrow clone doesn't have enough detail about the 'bad' oid. It would be right to reject such attempts!



3. Clone a repo with a sparse list as a filter.
Preconditions: same as for #1
Actions:
echo "src/bar" > /tmp/blah-sparse-checkout
git clone --sparse /tmp/blah-sparse-checkout blah # Clone should be
the only command that would requires specific option key being passed.
Expected results: same as for #1 plus /tmp/blah-sparse-checkout is
copied into .git/info/sparse-checkout

I presume clone and fetch are treated equivalently here.


There are 2 independent concepts here: clone and checkout.
Currently, there isn't any automatic linkage of the partial clone to
the sparse-checkout settings, so you could do something like this:

I see an implicit link that clearly one cannot checkout (inflate/populate) a file/directory that one does not have in the object store. But that does not imply the reverse linkage. The regular sparse checkout should be available independently of the local clone being a narrow one.

    git clone --no-checkout --filter=sparse:oid=master:templates/bar URL
    git cat-file ... templates/bar >.git/info/sparse-checkout
    git config core.sparsecheckout true
    git checkout ...

I've been focused on the clone/fetch issues and have not looked
into the automation to couple them.


I foresee that large files and certain files need to be filterable for fetch-clone, and that might not be (backward) compatible with the sparse-checkout.



4. Showing log for sparsely cloned repo.
Preconditions: #3 is followed
Actions:
git log
Expected results: recent changes that affect src/bar tree.

If I understand your meaning, log would only show changes
within the sparse subset of the tree.  This is not on my
radar for partial clone/fetch.  It would be a nice feature
to have, but I think it would be better to think about it
from the point of view of sparse-checkout rather than clone.

One option maybe by making a marker for the tree/blob to be a first class citizen. So the oid (and worktree file) has content ".gitNarrowTree <oid>" or ",gitNarrowBlob <oid>" as required (*), which is safe, and allows a consistent alter-ego view of the tree contents and hence for git-log et.al.

(*) I keep flip flopping between a single object marker, and distinct object markers for the types. It partly depends on whether one can know in advance, locally, what the oid type should be, and how it should be embedded in the object store - need to re-check the specs.

I'm tending toward distinct types to cope with the D/F conflict in the worktrees - the directory must be created (holds the name etc), and the alter-ego content then must be placed in a _known_ sub-file ".gitNarrowTree" (without the oid in the file name, but included in the content). Presence of a ".gitNarrowTree" should be standalone in the directory when that part of the work-tree is clean.



5. Showing diff.
Preconditions: #3 is followed
Actions:
git diff HEAD^ HEAD
Expected results: changes from the most recent commit affecting
src/bar folder are shown.
Notes: this can be tricky operation as filtering must be done to
remove results from unrelated subtrees.

I don't have any plan for this and I don't think it fits within
the scope of clone/fetch.  I think this too would be a sparse-checkout
feature.


See my note about first class citizens for marker OIDs



*Note that I intentionally didn't mention use cases that are related
to filtering by blob size as I think we should logically consider them
as a separate, although related, feature.

I've grouped blob-size and sparse filter together for the
purposes of clone/fetch since the basic mechanisms (filtering,
transport, and missing object handling) are the same for both.
They do lead to different end-uses, but that is above my level
here.



What do you think about these examples above? Is that something that
more-or-less fits into current development? Are there other important
flows that I've missed?

These are all good ideas and it is good to have someone else who
wants to use partial+sparse thinking about it and looking for gaps
as we try to make a complete end-to-end feature.

-Vitaly

Thanks
Jeff


Philip