Web lists-archives.com

Re: New Ft. for Git : Allow resumable cloning of repositories.




On Fri, Mar 8, 2019 at 11:13 PM Jonathan Tan <jonathantanmy@xxxxxxxxxx> wrote:
> This is indeed a nice feature to have, and thanks for details of how
> this would be accomplished.
>
> One issue is that when cloning a repository, we do not download many
> files - we only download one dynamically generated packfile containing
> all the objects we want.

Since the packfile is dynamically generated specifically for a client
request, and is destroyed from the server as soon as the connection
between them closes.
Is this the reason why we cannot pause it in between like we can do
with download managers ?

I read through the progit ebook 'git internels' chapter and the
following thought came to me:

Assume a pack file as follows:
---
$ git verify-pack -v .git/objects/pack/pack-
978e03944f5c581011e6998cd0e9e30000905586.idx
b042a60ef7dff760008df33cee372b945b6e884e blob   22054 5799 1463
033b4468fa6b2a9547a70d88d1bbe8bf3f9ed0d5 blob   9 20 7262 1 \
  b042a60ef7dff760008df33cee372b945b6e884e
.git/objects/pack/pack-978e03944f5c581011e6998cd0e9e30000905586.pack: ok
---

Here 033b blob refers b042 blob, and both blobs are different versions
of the same file.

Before this pack was made, both of these blobs were stored separately
and thus were taking more space.
Packfile is made to save space, by only storing latest version and its
delta with earlier version. Both delta and latest version are stored
in compressed form right ?

Now, here is another approach to save space without needing to create pack:

Earlier both the blobs had their object files as:

.git/objects/03/3b4468fa6b2a9547a70d88d1bbe8bf3f9ed0d5
.git/objects/b0/42a60ef7dff760008df33cee372b945b6e884e

Lets say b042 is latest and 033b is its earlier version.

what git does in packfile can be done right here by:

storing latest version in
.git/objects/b0/42a60ef7dff760008df33cee372b945b6e884e and its delta
in .git/objects/03/3b4468fa6b2a9547a70d88d1bbe8bf3f9ed0d5, with the
delta version we can add a header that tells it to check for
.git/objects/b0/42a60ef7dff760008df33cee372b945b6e884e and apply delta
on it to get the earlier version.

Doing this, eliminates the big packfile, and all the objects are
spread into folders. We can now make this resume-able right ?

Please point out what i missed here.
Is it possible to do the above ? if yes then what was the reason to
introduce concept of packfile ?

> You might be interested in some work I'm doing to offload part of the
> packfile response to CDNs:
>
> https://public-inbox.org/git/cover.1550963965.git.jonathantanmy@xxxxxxxxxx/
>
> This means that when cloning/fetching, multiple files could be
> downloaded, meaning that a scheme like you suggest would be more
> worthwhile. (In fact, I allude to such a scheme in the design document
> in patch 5.)

currently reading through all the discussion on this strategy.