Web lists-archives.com

Re: Finer timestamps and serialization in git

Derrick Stolee <stolee@xxxxxxxxx>:
> What it sounds like you are doing is piping a 'git fast-import' process into
> reposurgeon, and testing that reposurgeon does the same thing every time.
> Of course this won't be consistent if 'git fast-import' isn't consistent.

It's not actually import that fails to have consistent behavior, it's export.

That is, if I fast-import a given stream, I get indistinguishable
in-core commit DAGs every time. (It would be pretty alarming if this
weren't true!)

What I have no guarantee of is the other direction.  In a multibranch repo,
fast-export writes out branches in an order I cannot predict and which
appears from the outside to be randomly variable.

> But what you should do instead is store a fixed file from one run of
> 'git fast-import' and send that file to reposurgeon for the repeated test.
> Don't rely on fast-import being consistent and instead use fixed input for
> your test.
> If reposurgeon is providing the input to _and_ consuming the output from
> 'git fast-import', then yes you will need to have at least one integration
> test that runs the full pipeline. But for regression tests covering complicated
> logic in reposurgeon, you're better off splitting the test (or mocking out
> 'git fast-import' with something that provides consistent output given
> fixed input).

And I'd do that... but the problem is more fundamental than you seem to
understand.  git fast-export can't ship a consistent output order because
it doesn't retain metadata sufficient to totally order child branches.

This is why I wanted unique timestamps.  That would solve the problem,
branch child commits of any node would be ordered by their commit date.

But I had a realization just now.  A much smaller change would do it.
Suppose branch creations had creation stamps with a weak uniqueness property;
for any given parent node, the creation stamps of all branches originating
there are guaranteed to be unique?

If that were true, there would be an implied total ordering of the
repository.  The rules for writing out a totally ordered dump would go
like this:

1. At any given step there is a set of active branches and a cursor
on each such branch.  Each cursor points at a commit and caches the
creation stamp of the current branch.

2. Look at the set of commits under the cursors.  Write the oldest one.
If multiple commits have the same commit date, break ties by their
branch creation stamps.

3. Bump that cursor forward. If you're at a branch creation, it
becomes multiple cursors, one for each child branch.
If you're at a join, some cursors go away.

Here's the clever bit - you make the creation stamp nothing but a
counter that says "This was the Nth branch creation."  And it is
set by these rules:

4. If the branch creation stamp is undefined at branch creation time,
number it in any way you like as long as each stamp is unique. A
defined, documented order would be nice but is not necessary for
streams to round-trip.

5. When writing an export stream, you always utter a reset at the
point of branch creation.

6. When reading an import stream, the ordinal for a new branch is
defined as the number of resets you have seen.

Rules 5 and 6 together guarantee that branch creation ordinals round-trip
through export streams.  Thus, streams round-trip and I can have my
regression tests with no change to git's visible interface at all!

I could write this code.
		<a href="http://www.catb.org/~esr/";>Eric S. Raymond</a>