Web lists-archives.com

Re: [RFC] Generation Number v2




Derrick Stolee <stolee@xxxxxxxxx> writes:
> On 10/31/2018 8:54 AM, Ævar Arnfjörð Bjarmason wrote:
>> On Tue, Oct 30 2018, Junio C Hamano wrote:
>>> Derrick Stolee <stolee@xxxxxxxxx> writes:
>>>>
>>>> In contrast, maximum generation numbers and corrected commit
>>>> dates both performed quite well. They are frequently the top
>>>> two performing indexes, and rarely significantly different.
>>>>
>>>> The trade-off here now seems to be: which _property_ is more important,
>>>> locally-computable or backwards-compatible?
>>>
>>> Nice summary.
>>>
>>> As I already said, I personally do not think being compatible with
>>> currently deployed clients is important at all (primarily because I
>>> still consider the whole thing experimental), and there is a clear
>>> way forward once we correct the mistake of not having a version
>>> number in the file format that tells the updated clients to ignore
>>> the generation numbers.  For longer term viability, we should pick
>>> something that is immutable, reproducible, computable with minimum
>>> input---all of which would lead to being incrementally computable, I
>>> would think.
>>
>> I think it depends on what we mean by backwards compatibility. None of
>> our docs are saying this is experimental right now, just that it's
>> opt-in like so many other git-config(1) options.
>>
>> So if we mean breaking backwards compatibility in that we'll write a new
>> file or clobber the existing one with a version older clients can't use
>> as an optimization, fine.
>>
>> But it would be bad to produce a hard error on older clients, but
>> avoiding that seems as easy as just creating a "commit-graph2" file in
>> .git/objects/info/.
>
> Well, we have a 1-byte version number following the "CGPH" header in
> the commit-graph file, and clients will ignore the commit-graph file
> if that number is not "1". My hope for backwards-compatibility was
> to avoid incrementing this value and instead use the unused 8th byte.

How?  Some of the considered new generation numbers were backwards
compatibile in the sense that for graph traversal the old code for
(minimum) generation numbers could be used.  But that is for reading.
If the old client tried to update new generation number using old
generation number algorithm (assuming locality), it would write some
chimaera that may or may not work.

> However, it appears that we are destined to increment that version
> number, anyway. Here is my list for what needs to be in the next
> version of the commit-graph file format:

As a reminder, here is how the commit-graph format looks now [1]

  HEADER:
    4-byte signature:                  CGPH
    1-byte version number:             1
    1-byte Hash Version:               1 = SHA-1
    1-byte number (C) of "chunks":     3 or 4
    1-byte (reserved for later use)    <ignored>
  CHUNK LOOKUP:
    (C + 1) * 12 bytes listing the table of contents for the chunks
  CHUNK DATA:
    OIDF
    OIDL
    CDAT
    EDGE [optional, depends on graph structure]
  TRAILER:
    checksum

[1]: Documentation/technical/commit-graph-format.txt


> 1. A four-byte hash version.

Why do you need a four-byte hash version?  255 possible hash functions
is not enough?

> 2. File incrementality (split commit-graph).

This would probably require a segment lookup part, and one chunk of each
type per segment (or even one chunk lookup table per segment).  It may
or may not need generation number which can support segmenting (here
maximal generation numbers have the advantage); but you can assume that
if segment(A) > segment(B) then reach.ix(A) > reach.ix(B),
i.e. lexicographical-like sort.

> 3. Reachability Index versioning

I assume that you mean here versioning for the reachability index that
is stored in the CDAT section (if it is in a separate chunk, it should
be easy to version).

The easiest thing to do when encountering unknown or not supported
generation number (perhaps the commit-graph file was created in the past
by earlier version of Git, perahs it came from server with newer Git, or
perhaps the same repository is sometimes accessed using newer and
sometimes older Git version), is to set commit->generation_number to
GENERATION_NUMBER_ZERO, as you wrote in [2].

We could encode backward-compatibility (for using) somewhat, but I
wonder how much it would be of use.

[2]: https://public-inbox.org/git/61a829ce-0d29-81c9-880e-7aef1bec916e@xxxxxxxxx/

> Most of these changes will happen in the file header. The chunks
> themselves don't need to change, but some chunks may be added that
> only make sense in v2 commit-graphs.

What I would like to see in v2 commit-graph format would be some kind
convention for naming / categorizing chunks, so that Git would be able
to handle unknown chunks gracefully.

In PNG format there are various types of chunks.  Quoting Wikipedia:

  Chunk types [in PNG format] are given a four-letter case sensitive
  ASCII type/name; compare FourCC. The case of the different letters in
  the name (bit 5 of the numeric value of the character) is a bit field
  that provides the decoder with some information on the nature of
  chunks it does not recognize.

  The case of the first letter indicates whether the chunk is critical
  or not. If the first letter is uppercase, the chunk is critical; if
  not, the chunk is ancillary. Critical chunks contain information that
  is necessary to read the file. If a decoder encounters a critical
  chunk it does not recognize, it must abort reading the file or supply
  the user with an appropriate warning.

Currently this is solved with the format version.  For given format
version some chunks are necessary (critical); if there would be another
type of chunk that must be understood, we can simply increase format
version. 

  The case of the second letter indicates whether the chunk is "public"
  (either in the specification or the registry of special-purpose public
  chunks) or "private" (not standardised). Uppercase is public and
  lowercase is private. This ensures that public and private chunk names
  can never conflict with each other (although two private chunk names
  could conflict).

For Git this could be "official" and "experimental"... though I don't
see people sharing commit-graph files with experimental chunks that can
be used only by some local unpublished fork of Git.

  The third letter must be uppercase to conform to the PNG
  specification. It is reserved for future expansion. Decoders should
  treat a chunk with a lower case third letter the same as any other
  unrecognised chunk.

In short: reserved.  This could be the fate of all of those 4 flags that
we do not need.

  The case of the fourth letter indicates whether the chunk is safe to
  copy by editors that do not recognize it. If lowercase, the chunk may
  be safely copied regardless of the extent of modifications to the
  file. If uppercase, it may only be copied if the modifications have
  not touched any critical chunks.

If the data contained in the chunks is immutable, then it could be
copied when updating commit-graph file... but what to do with new
commits, with what to fill the data for a new commit for a copied
unknown chunk?  All zeros?  All ones?  Some value defined in the chunk
itself?

Best,
-- 
Jakub Narębski