Web lists-archives.com

[RFD] Long term plan with submodule refs?




> The relationship is indeed currently useful, but if the long term plan
> is to strongly discourage detached submodule HEAD, then I would think
> that these patches are in the wrong direction. (If the long term plan is
> to end up supporting both detached and linked submodule HEAD, then these
> patches are fine, of course.) So I think that the plan referenced in
> Junio's email (that you linked above) still needs to be discussed.

This email presents different approaches.

Objective
=========
This document should summarize the current situation of Git submodules
and start a discussion of where it can be headed long term.
Show different ways in which submodule refs could evolve.

Background
==========
Submodules in Git are considered as an independet repository currently.
This is okay for current workflows, such as utilizing a library that is
rarely updated. Other workflows that require a tighter integration between
submodule and superproject are possible, but cumbersome as there is an
additional step that has to be performed, which is the update of the gitlink
pointer in the superproject.

Other discussions of the past:
"Re-attach HEAD?"
  https://public-inbox.org/git/20170501180058.8063-1-sbeller@xxxxxxxxxx/
"Semantics of checkout --recursive for submodules on a branch"
  https://public-inbox.org/git/20170630003851.17288-1-sbeller@xxxxxxxxxx/
"A new type of symref?"
  https://public-inbox.org/git/xmqqvamqg2fy.fsf@xxxxxxxxxxxxxxxxxxxxxxxxxxx/

Workflows
=========
* Obtaining a copy of the Superproject tightly coupled with submodules
  solved via git clone --recurse-submodules=<pathspec>
* Changing the submodule selection
  solved via submodule.active flags
* Changing the remote / Interacting with a different remote for all submodules
  -> need to be solved, not core issue of this discussion
* Syncing to the latest upstream
  solved via git pull --recurse  
* Working on a local feature in one submodule
  -> How do refs work spanning superproject/submodule?
* Working on a feature spanning multiple submodules
  -> How do refs work spanning multiple repos?
* Working on a bug fix (Changing the feature that you currently work on, branches)
  -> How does switching branches in the superproject affect submodules

This discussion should resolve around refs are handled in submodules in
relation to a superproject.

Possible data models and workflow implications
==============================================
In the following different data models are presented, which aid a submodule
heavy workflow each giving pros and cons.

Keep everything as is, superproject and submodule have their own refs
---------------------------------------------------------------------
In this alternative we'd just make existing commands nicer, e.g.
git-status, git-log would give information about the superprojects
gitlink similar as they give information about a remote branch.

We might want to introduce an option that triggers adding the submodule
to the superproject once a commit is done in the submodule.

Pros:
 * easiest to implement
 * easy to understand when having a git background already
 
Cons:
 * Current tools that manage multiple repositories (e.g. repo, git-slave)
   have "branches in parallel", i.e. each repo has a branch of the same
   name, instead of using a superproject to manage the state of all repos
   involved. So users of such tools may be confused by submodules.
 * when using a detached HEAD in the submodule, we may run into git-gc issues.
 

Use replicate refs in submodules
--------------------------------
This approach will replicate the superproject refs into the submodule
ref namespace, e.g. git-branch learns about --recurse-submodules, which
creates a branch of a given name in all submodules. These (topic) branches
should be kept in sync with the superproject

Pros:
 * This seemed intuitive to Gerrit users
 * 'quick' to implement, most of the commands are already there,
   just git-branch is needed to have the workflows mentioned above complete.
Cons:
 * What does "git checkout -b A B" mean? (special case: B == HEAD)
   Is the branch name replicated as a string into the submodule operation,
   or do we dereference the superprojects gitlink and walk from there?
   When taking the superprojects gitlink, then why do we have the branches
   in the submodule in the first place? When taking the string as-is,
   then it might confuse users.
 * non-atomic of refs between superproject and submodule by design;
   This relies on superproject and submodule to stay in sync via hope.

No submodule refstore at all
----------------------------
Use refs and commits in the superproject to stitch submodule changes
together. Disallow branches in the submodule. This is only restricted
to the working tree inside the superproject, such that the output of git-branch
changes depending whether the working tree is in- or outside the superproject
working tree.

The messages of git-status inside the superproject working tree are changed
as "detached HEAD"s are common in submodule and sound scary.  Maybe
"following the superproject"

Pros:
 * solves the atomicity issue from the prior proposal
Cons:
 * In a submodule one must use a worktree outside the superproject
   to do upstream work.
 * As the detached HEAD is not referenced, we have git-gc issues.


New type of symbolic refs
=========================
A symbolic ref can currently only point at a ref or another symbolic ref.
This proposal showcases different scenarios on how this could change in the
future.

HEAD pointing at the superprojects index
----------------------------------------
Introduce a new symbolic ref that points at the superprojects
index of the gitlink. The format is

  "repo:" <superprojects gitdir> '\0' <gitlink-path> '\0'

Just like existing symrefs, the content of the ref will be read and followed.
On reading "repo:", the sha1 will be obtained equivalent to:

    git -C <superproject> ls-files -s <gitlink-path> | awk '{ print $2}'

Ref write operations driven by the submodule, affecting symrefs
  e.g. git checkout <other branch> (in the submodule)

In this scenario only the HEAD is optionally attached to the superproject,
so we can rewrite the HEAD to be anything else, such as a branch just fine.
Once the HEAD is not pointing at the superproject any more, we'll leave the
submodule alone in operations driven by the superproject.
To get back on the superproject branch, we’d need to invent new UX, such as
   git checkout --attach-superproject
as that is similar to --detach

Ref write operations driven by the submodule, affecting target ref
  e.g. git commit, reset --hard, update-ref (in the submodule)

The HEAD stays the same, pointing at the superproject.
The gitlink is changed to the target sha1, using

  git -C <superproject> update-index --add \
      --cacheinfo 160000,$SHA1,<gitlink-path>

This will affect the superprojects index, such that then a commit in
the superproject is needed.

Ref write operations driven by the superproject, changing the gitlink
  e.g. git checkout <tree-ish>, git reset --hard (in the superproject)

This will change the gitlink in the superprojects index, such that the HEAD
in the submodule changes, which would trigger an update of the
submodules working tree.

Superproject operations spanning index and worktree
  E.g. git reset --mixed
As the submodules HEAD is defined in the index, we would reset it to the
version in the last commit. As --mixed promises to not touch the working tree,
the submodules worktree would not be touched. git reset --mixed in the
superproject is the same as --soft in the submodule.

Consistency considerations (gc)
  e.g. git gc --aggressive --prune=now

The repacking logic is already aware of a detached HEAD, such that
using this new symref mechanism would not generate problems as long as
we keep the HEAD attached to the superproject. However when commits/objects
are created while the HEAD is attached to the superproject and then HEAD
switches to a local branch, there are problems with the created objects
as they seem unreachable now.

This problem is not new as a superproject may record submodule objects
that are not reachable from any of the submodule branches. Such objects
fall prey to overzealous packing in the submodule.

This proposal however exposes this problem a lot more, as the submodule
has fewer needs for branches.

Pros
 * easy to tell if a submodule is attached to the superproject,
 * no atomicity issues
 * once enough commands implement this behavior, it may be easier to understand
   than previous alternatives and feel more intuitive
Cons:
 * gc issues for now
 * lots of work as it revamps submodules alot.
 
 This last proposal might be differentiated further, e.g. the submodule HEAD
 pointing at the superprojects gitlink in the index, in its HEAD or other
 branch.

Any feedback welcome!
Stefan