Web lists-archives.com

Re: Finer timestamps and serialization in git




"Eric S. Raymond" <esr@xxxxxxxxxxx> writes:
> Jakub Narebski <jnareb@xxxxxxxxx>:

>> As far as I understand it this would slow down receiving new commits
>> tremendously.  Currently great care is taken to not have to parse the
>> commit object during fetch or push if it is not necessary (thanks to
>> things such as reachability bitmaps, see e.g. [1]).
>> 
>> With this restriction you would need to parse each commit to get at
>> commit timestamp and committer, check if the committer+timestamp is
>> unique, and bump it if it is not.
>
> So, I'd want to measure that rather than simply assuming it's a blocker.
> Clocks are very cheap these days.

Clocks may be cheap, but parsing is not.

You can receive new commits in the repository by creating them, and from
other repository (via push or fetch).  In the second case you often get
many commits at once.

In [1] it is described how using "bitmap index" you can avoid parsing
commits when deciding which objects to send to the client; they can be
directly copied to the client (added to the packfile that is sent to
client).  Thanks to this reachability bitmap (bit vector) the time to
clone Linux repository decreased from 57 seconds to 1.6 seconds.

It is not a direct correspondence, but there most probably would be the
same problem with requiring fractional timestamp+committer identity to
be unique on the receiving side.

[1]: https://githubengineering.com/counting-objects/

>> Also, bumping timestamp means that the commit changed, means that its
>> contents-based ID changed, means that all commits that follow it needs
>> to have its contents changed...  And now you need to rewrite many
>> commits.
>
> What "commits that follow it?" By hypothesis, the incoming commit's
> timestamp is bumped (if it's bumped) when it's first added to a branch
> or branches, before there are following commits in the DAG.

Errr... the main problem is with distributed nature of Git, i.e. when
two repositories create different commits with the same
committer+timestamp value.  You receive commits on fetch or push, and
you receive many commits at once.

Say you have two repositories, and the history looks like this:

 repo A:   1<---2<---a<---x<---c<---d      <- master

 repo B:   1<---2<---X<---3<---4           <- master

When you push from repo A to repo B, or fetch in repo B from repo A you
would get the following DAG of revisions

 repo B:   1<---2<---X<---3<---4           <- master
                 \
                  \--a<---x<---c<---d      <- repo_A/master

Now let's assume that commits X and x have the came committer and the
same fractional timestamp, while being different commits.  Then you
would need to bump timestamp of 'x', changing the commit.  This means
that 'c' needs to be rewritten too, and 'd' also:

 repo B:   1<---2<---X<---3<---4           <- master
                 \
                  \--a<---x'<--c'<--d'     <- repo_A/master

And now for the final nail in the coffing of the Bazaar-esque idea of
changing commits on arrival.  Say that repository A created new commits,
and pushed them to B.  You would need to rewrite all future commits from
this repository too, and you would always fetch all commits starting
from the first "bumped"

 repo A:   1<---2<---a<---x<---c<---d<---E   <- master

transfer of [<---x<---c<---d<---E], instead of [<--E], because 'x', 'c',
and 'd' are missing in repo B.

 repo B:   1<---2<---X<---3<---4             <- master
                 \
                  \--a<---x'<--c'<--d'<--E'  <- repo_A/master

And there is yet another problem.  Let's assume that repo B created some
history on top of bump-rewritten commits:

 repo B:   1<---2<---X<---3<---4             <- master
                 \
                  \--a<---x'<--c'<--d'<--E'  <- repo_A/master
                                \
                                 \--5        <- next

Then if in repo A you fetch from repo B (remember, in Git there is no
concept of central repository), you would get the following history

                  /--X'<--3'<--4'            <- repo_B/master
                 /
 repo A:   1<---2<---a<---x<---c<---d<---E   <- master
                     \
                      \---x'<--c'
                                \
                                 \--5        <- repo_B/master

(because 'X' is now incoming, it needs to be "bumped", therefore
changing 3' and 4').

The history without all this rewriting looks like this:

                  /--X<---3<---4'            <- repo_B/master
                 /           
 repo A:   1<---2<---a<---x<---c<---d<---E   <- master
                                \
                                 \--5        <- repo_B/master

Notice the difference?

>>    And you also break the assumptions that the same commits have
>> the same contents (including date) and the same ID in different
>> repositories (some of which may include additional branches, some of
>> which may have been part of network of related repositories, etc.).

See repo A and repo B in above example.

> Wait...unless I completely misunderstand the hash-chain model, doesn't the
> hash of a commit depend on the hashes of its parents?  If that's the case,
> commits cannot have portable hashes. If it's not, please correct me.
>
> But if it's not, how does your first objection make sense?

Hash of a commit depend in hashes of its parents (Merkle tree). That is
why signing a commit (or a tag pointing to the commit) signs a whole
history of a commit.

>>> You don't need a daemon now to write commits to a repository. You can
>>> just add stuff to the object store, and then later flip the SHA-1 on a
>>> reference, we lock those indivdiual references, but this sort of thing
>>> would require a global write lock. This would introduce huge concurrency
>>> caveats that are non-issues now.
>>>
>>> Dumb clients matter. Now you can e.g. have two libgit2 processes writing
>>> to ref A and B respectively in the same repo, and they never have to
>>> know about each other or care about IPC.
>
> How do they know they're not writing to the same ref?  What keeps
> *that* operation atomic?

Because different refs are stored in different files (at least for
"live" refs that are stores in loose ref format).  The lock is taken on
ref (to update ref and its reflog in sync), there is no need to take
global lock on all refs.

>> You do realize that dates may not be monotonic (because of imperfections
>> in clock synchronization), thus the fact that the date is different from
>> parent does not mean that is different from ancestor.
>
> Good point. That means the O(log2 n) version of the check has to be done
> all the time.  Unfortunate.

Especially with around 1 million of commits (Linux kernel, Chromium,
AOSP), or even 3M commits (MS Windows repository).

>>>> That's the simple case. The complicated case is checking for date
>>>> collisions on *other* branches. But there are ways to make that fast,
>>>> too. There's a very obvious one involving a presort that is is O(log2
>>>> n) in the number of commits.
>> 
>> I don't think performance hit you would get would be acceptable.
>
> Again, it's bad practice to assume rather than measure. Human intuitions
> about this sort of thing are notoriously unreliable.

Techniques created to handle very large repositories (with respect to
number of commits) that make it possible for Git to avoid parsing commit
objects, namely bitmap index (for 'git fetch'/'clone') and serialized
commit graph (for 'git log') lead to _significant_ performance
improvements.

The performance changes from "waiting for Git to finish" to "done in the
blink of eye" (well, almost).

>>>> Excuse me, but your premise is incorrect.  A git DAG isn't just "any" DAG.
>>>> The presence of timestamps makes a total ordering possible.
>>>>
>>>> (I was a theoretical mathematician in a former life. This is all very
>>>> familiar ground to me.)
>> 
>> Maybe in theory, when all clock are synchronized.
>
> My assertion does not depend on synchronized clocks, because it doesn't have to.
>
> If the timestamps in your repo are unique, there *is* a total ordering - 
> by timestamp. What you don't get is guaranteed consistency with the
> topo ordering - that is you get no guarantee that a child's timestamp
> is greater than its parents'. That really would require a common
> timebase.
>
> But I don't need that stronger property, because the purpose of
> totally ordering the repo is to guarantee the uniqueness of action
> stamps.  For that, all I need is to be able to generate a unique cookie
> for each commit that can be inserted in its action stamp.

For cookie to be unique among all forks / clones of the same repository
you need either centralized naming server, or for the cookie to be based
on contents of the commit (i.e. be a hash function).

>                                                          For my use cases
> that cookie should *not* be a hash, because hashes always break N years
> down.  It should be an eternally stable product of the commit metadata.

Well, the idea for SHA-1 <--> NewHash == SHA-256 transition is to avoid
having a flag day, and providing full interoperability between
repositories and Git installations using the old hash ad using new
hash^1.  This will be done internally by using SHA-1 <--> SHA-256
mapping.  So after the transition all you need is to publish this
mapping somewhere, be it with Internet Archive or Software Heritage.
Problem solved.

P.S. Could you explain to me how one can use action stamp, e.g.
<esr@xxxxxxxxxxx!2019-05-15T20:01:15.473209800Z>, to quickly find the
commit it refers to?  With SHA-1 id you have either filesystem pathname
or the index file for pack to find it _fast_.

Footnotes:
----------
1. That is why where would be no "major format break", thus no place for
   incompatibile format changes.

Best,
--
Jakub Narębski