Web lists-archives.com

Re: How hard would it be to implement sparse fetching/pulling?




Found some details here: https://github.com/jeffhostetler/git/pull/3

Looking at commits I see that you've done a lot of work already,
including packing, filtering, fetching, cloning etc.
What are some areas that aren't complete yet? Do you need any help
with implementation?


On Thu, Nov 30, 2017 at 9:01 AM, Vitaly Arbuzov <vit@xxxxxxxx> wrote:
> Hey Jeff,
>
> It's great, I didn't expect that anyone is actively working on this.
> I'll check out your branch, meanwhile do you have any design docs that
> describe these changes or can you define high level goals that you
> want to achieve?
>
> On Thu, Nov 30, 2017 at 6:24 AM, Jeff Hostetler <git@xxxxxxxxxxxxxxxxx> wrote:
>>
>>
>> On 11/29/2017 10:16 PM, Vitaly Arbuzov wrote:
>>>
>>> Hi guys,
>>>
>>> I'm looking for ways to improve fetch/pull/clone time for large git
>>> (mono)repositories with unrelated source trees (that span across
>>> multiple services).
>>> I've found sparse checkout approach appealing and helpful for most of
>>> client-side operations (e.g. status, reset, commit, etc.)
>>> The problem is that there is no feature like sparse fetch/pull in git,
>>> this means that ALL objects in unrelated trees are always fetched.
>>> It may take a lot of time for large repositories and results in some
>>> practical scalability limits for git.
>>> This forced some large companies like Facebook and Google to move to
>>> Mercurial as they were unable to improve client-side experience with
>>> git while Microsoft has developed GVFS, which seems to be a step back
>>> to CVCS world.
>>>
>>> I want to get a feedback (from more experienced git users than I am)
>>> on what it would take to implement sparse fetching/pulling.
>>> (Downloading only objects related to the sparse-checkout list)
>>> Are there any issues with missing hashes?
>>> Are there any fundamental problems why it can't be done?
>>> Can we get away with only client-side changes or would it require
>>> special features on the server side?
>>>
>>> If we had such a feature then all we would need on top is a separate
>>> tool that builds the right "sparse" scope for the workspace based on
>>> paths that developer wants to work on.
>>>
>>> In the world where more and more companies are moving towards large
>>> monorepos this improvement would provide a good way of scaling git to
>>> meet this demand.
>>>
>>> PS. Please don't advice to split things up, as there are some good
>>> reasons why many companies decide to keep their code in the monorepo,
>>> which you can easily find online. So let's keep that part out the
>>> scope.
>>>
>>> -Vitaly
>>>
>>
>>
>> This work is in-progress now.  A short summary can be found in [1]
>> of the current parts 1, 2, and 3.
>>
>>> * jh/object-filtering (2017-11-22) 6 commits
>>> * jh/fsck-promisors (2017-11-22) 10 commits
>>> * jh/partial-clone (2017-11-22) 14 commits
>>
>>
>> [1]
>> https://public-inbox.org/git/xmqq1skh6fyz.fsf@xxxxxxxxxxxxxxxxxxxxxxxxxxx/T/
>>
>> I have a branch that contains V5 all 3 parts:
>> https://github.com/jeffhostetler/git/tree/core/pc5_p3
>>
>> This is a WIP, so there are some rough edges....
>> I hope to have a V6 out before the weekend with some
>> bug fixes and cleanup.
>>
>> Please give it a try and see if it fits your needs.
>> Currently, there are filter methods to filter all blobs,
>> all large blobs, and one to match a sparse-checkout
>> specification.
>>
>> Let me know if you have any questions or problems.
>>
>> Thanks,
>> Jeff