Web lists-archives.com

Re: [PATCH 12/17] Documentation: describe split commit-graphs




On Wed, May 08 2019, Derrick Stolee wrote:

> On 5/8/2019 1:20 PM, SZEDER Gábor wrote:
>> On Wed, May 08, 2019 at 08:53:57AM -0700, Derrick Stolee via GitGitGadget wrote:
>>> From: Derrick Stolee <dstolee@xxxxxxxxxxxxx>
>> Consider the following sequence of events:
>>
>>   1. There are three commit-graph files in the repository.
>>
>>   2. A git process opens the base commit-graph and commit-graph-1 for
>>      reading.  It doesn't yet open commit-graph-2, because the (for
>>      arguments sake not very fair) scheduler takes the CPU away.
>>
>>   3. Meanwhile, a 'git fetch', well, fetches from a remote, and
>>      upon noticing that it got a lot of commits it decides to collapse
>>      commit-graph-1 and -2 and the new commits, writing a brand new
>>      commit-graph-1.
>>
>>   4. A second fetch fetches from a second remote, and writes
>>      commit-graph-2 (no collapsing this time).
>>
>>   5. Now the crappy scheduler finally decides that it's time to wake
>>      up the waiting git process from step 2, which then finds the new
>>      commit-graph-2 file and opens it for reading.
>>
>>   6. At this point this poor git process has file handles for:
>>
>>      - the base commit-graph file, which is unchanged.
>>
>>      - the old commit-graph-1 which has since been replaced, and does
>>        not yet contain info about the old commit-graph-2 or the
>>        commits received in the first fetch.
>>
>>      - the new commit-graph-2, containing info only about commits
>>        received in the second fetch, and whose parents' graph
>>        positions point either to the base commitg-graph (good, since
>>        unchanged) or to the new commit-graph-1 (uh-oh).
>>
>> What happens next?  If this process tries to access the parent of a
>> commit from commit-graph-2, and the metadata about this parent is in
>> the new commit-graph-1, then I expect all kinds of weird bugs.
>>
>> But will a git process ever try to access a commit that didn't yet
>> existed in the repository when it started opening the commit-graph
>> files?
>
> I'll ignore the improbability of this turn of events (two writes happening
> during the span of trying to read two files) and focus on the fact that
> we can prevent issues here using the 4th TODO item in my cover letter:

FWIW the sort of scenario SZEDER is describing is something I deal with
in production a lot. It doesn't require an unfair scheduler, just that
you have differently nice(1)'d processes accessing the same repo.

So if you have batch "cron" processes their IO scheduling follows their
nice(1) scheduling. It's not atypical to e.g. have some background
thingy sit for seconds or even minutes on an I/O syscall while the
kernel decides everyone else has right of way, since you nice'd that not
caring if it finishes in 10 seconds or 10 hours.

>  4. It would be helpful to add a new optional chunk that contains the
>     trailing hash for the lower level of the commit-graph stack. This chunk
>     would only be for the commit-graph-N files, and would provide a simple
>     way to check that the stack is valid on read, in case we are still
>     worried about other processes reading/writing in the wrong order.
>
> If we have this chunk -- you have convinced me that we need it -- then we
> could ignore the "new" commit-graph-2 because its base graph hash does not
> match. We can continue without dying because we can always parse the "missing"
> commits from the packs.

Having read the actual code for this & what the format will look like
(last time I didn't have that context[1]) I really don't get why we
don't save ourselves a lot of trouble and do away with this "N" in the
filename idea from the start.

It's just a slight change to the iteration in
prepare_commit_graph_one().

Instead of looping through files N at a time we'd have a discovery step
where we'd need to open() all the files, see which ones say "my parent
hash hash X", and then create a list of those hashes in order to read a
bunch of commit-graph-<HASH> files.

Is that a bit painful? Sure, but way less painful than dealing with the
caveats I'd mentioned in [1] and SZEDER details here.

And the obvious thing would be to save ourselves most of that work every
time we read by writing a .git/objects/commit-graphs/info on
"commit-graph write", which would be the <HASH> of the end of the
"latest" chain. We could also have some side-index listing the whole
"current" chain in order (but I'm more paranoid about locking/updates to
such a thing, maybe we could put it in the last file in a new chunk
....).

If we didn't have such side-indexes then the way the current loading
works it would need to traverse the files back to front, and *then* load
them front to back (as it does now), so that's a slight pain,
obviously. But not a big deal.

With commit-graph-<HASH> all these unlink() race conditions go away,
partial reads due to concurrent graph writing becomes a non-issue (we'd
just leave the old files, "gc" deals with them later..), no need to
carefully fsync() files/dirs etc as we need to carefully juggle N and
N+1 files.

It also becomes easy to "chain" graphs across repos e.g. via
alternates. Say in the scenario github/gitlab have where they have a
"main" repo and other objects on another delta island.

In that case the repo would have a local "tip" file with the last link
in its chain, some of which would then refer back to <HASHes> in other
"parent" alternates.

As long as such a setup has a "gc" process that's not overly eager about
pruning old stuff and considers that constellation of repos as a whole
that should just work. You can freely optimize and rewrite graphs across
repos, just be careful about unlinking old stuff.

I don't see how it would work with commit-graph-N without a *lot* of
painful orchestration (where e.g. you *must* guarantee that the parent
repo ends in N, all child repos start at N+1).

> This allows incremental writes without updating the file format.

FWIW this is some of what I was talking about in [2]. In
ds/commit-graph-format-v2 I had feedback to the effect[3] that the
particular way in which you proposed to update the format (changing the
header) wouldn't be worth it, since old clients dealt with it so badly.

But as noted in [3] I see zero reason for why we can't update the
existing format, we just add new chunks. That allows us to add any new
data in backwards-compatible ways.

I see nothing wrong with solution that has split files in principle,
just with the currently proposed commit-graph-N way of doing that.

I just wonder if we're looking at a "Y" solution to an "X-Y" problem
where "X" was unduly dismissed. If updating the format was a non-issue
(which seems to me to be the case), what then?

I imagine we'd still have just a "commit-graph" file, to write a new one
we'd "cp" that one, then munge that existing file to write something new
to the and and "mv" it in-place. It seems to me a (sane) split-file plan
is better, but I'm not in your head, maybe it was a much better for
reasons I'm not imagining before I apparently talked you out of changing
the format itself :)

1. https://public-inbox.org/git/87bm0jirjj.fsf@xxxxxxxxxxxxxxxxxxx/
2. https://public-inbox.org/git/87r298hhlc.fsf@xxxxxxxxxxxxxxxxxxx/
3. https://public-inbox.org/git/87lfzprkfc.fsf@xxxxxxxxxxxxxxxxxxx/