Web lists-archives.com

Re: [GSoC] Discussion of "Submodule related work" project

On Fri, Mar 10, 2017 at 3:27 AM, Valery Tolstov <me@xxxxxxxxxxxx> wrote:
> Have some questions about "Submodule related work" project
> First of all, I would like to add this task to the project, if I'll take it:
> https://public-inbox.org/git/1488913150.8812.0@xxxxxxxxxxxxxx/T/
> What do you think about this task?

That is a nice project, though my gut feeling is that it is too small
for a GSoC project on itself.

>> Cleanup our test suite. Do not use a repo itself as a submodule for itself
> Not quite familiar with submodules yet, why this is considered to be
> ineligible (i.e. using repo as a submodule for itself)?

(a bit of background on submodules)

man gitglossary (then searching for submodule):
           A repository that holds the history of a separate project inside
           another repository (the latter of which is called superproject).

           A repository that references repositories of other projects in its
           working tree as submodules. The superproject knows about the names
           of (but does not hold copies of) commit objects of the contained

An example that I just found on Github[1]. It is a game
(so it includes graphics, game code etc). But it makes use of a library[2],
which could be used by different projects.

[1] https://github.com/stephank/orona
[2] https://github.com/stephank/villain

Now why would a repo be ineligible to use itself as a submodule?
There is nothing wrong with it *technically* (which is why we do such things
in the test suite.)

But what are the use cases for it? Why would a project bind itself
as a submodule (you can get the same effect of having the source code
by just checking out that other version.) ? Well now that I think about it,
it may be useful if you want to test old versions of yourself for e.g.
networking compatibility. But for that you'd probably still not use submodules.

So the use case of using submodules for another copy of itself is
*very rare* if it exists at all out there. And the Git test suite
should rather test
use cases that are not these weird corner cases, but rather pay attention to
the common case first.

I thought this project would have been solved parially already, but I was wrong.
($ git grep "submodule add \./\."). This also doesn't seem large enough for
a summer project, after thinking about it further.

>> (Advanced datastructure knowledge required?) Protect submodule from gc-ing
>> interesting HEADS.
> Can you provide a small example that shows the problem, please?

Let's use this example from above:

$ git clone --recursive https://github.com/stephank/orona
    # now we have 2 repositories, the orona repo as well as its submodule
    # at node_modules/villain
    # "Let's inspect the Readmes/license files, if they are ok to use
    # Oh! the submodule is MIT licensed but doesn't have the full
    # license text, I can contribute and make a patch for it."
$ cd node_modules/villain
$ git add LICENSE
$ git commit -a -m "add license full text"
$ cd ../.. # go back to the superproject
$ git add  node_modules/villain
$ git commit -a -m "update game to include latest lib"
$ git checkout -b "fix_license"
    # note how I forget to actually push it / pull request it!
    # All we need for the demonstration is a local commit
    # in the submodule that is referenced by the superproject...
    # ... "Let's test the pristine copy of the game!" ...
$ git checkout origin/master
$ git submodule update
    # ... which gets lost here. The submodule commit
    # is only referenced by a superproject commit.

.. time passes ..

    # "My disk is so full, maybe I can clean up all these random
    # accumulated projects, to have more disk space again."
    # my cleanup script may do this:

$ cd node_modules/villain
$ git reflog expire --all --expire=all
$ git gc --prune=all
$ cd ../..

$ git branch
    # "Oh what about this 'fix_license branch' ?
    #  Did I actually send that upstream?"
$ git checkout fix_license
$ git submodule update
error: no such remote ref 96016818b2ed9eb0ca72552b18f8339fc20850b4
Fetched in submodule path 'villain', but it did not contain
96016818b2ed9eb0ca72552b18f8339fc20850b4. Direct fetching of that
commit failed.

> And why advanced datastructure knowledge is expected?

I am not quite sure how to approach this problem, so I put
a "warning; it may be complicated" sticker on it. ;)

The problem is that a submodule until now was considered
its own repository, in full control what to keep and delete,
how to name its branches and so on.

git-gc only pays attention to commits (and its history) of all
branches and commits mentioned in the reflog.
(which is why we had to delete the reflog, and as we
were making the license commit on a "detached HEAD",
there was no need to delete its branch).

However it should also consider all commits referenced
by the superproject valuable.

In this case the superproject has a branch "fix_license",
so that commit is considered too valuable for gc in the
superproject, but it breaks with the submodule pointer
as the pointer changes in the superproject, but the
gc operation in the submodule doesn't care.

One way to fix it is to figure out if there is a superproject
at gc time and then collect all valuable hashes (submodule
pointers) before actually performing the gc.

But that may be expensive, so we would rather record
it on the fly, e.g. when making the commit in the superproject
we'd record in the submodule that the given hash by the
submodule pointer is valuable.

This could be done by having a ref (=branch) in the submodule
that points at all the interesting submodule commits.

So despite being prominent on the ideas page (because of a lot
of text), this may be controversial how to actually solve it.

> Maybe you have something else about this project to say.

If I remember correctly, shell -> C conversion projects are
easy (both for writing the code as well as for mentoring)

> git archive(/bundle) to have a --recurse-submodules flag
> to include the submodule contents.

is an actual interesting project as well despite its short description.