Web lists-archives.com

Re: reftable [v5]: new ref storage format




On Sun, Aug 6, 2017 at 9:56 AM, Ævar Arnfjörð Bjarmason
<avarab@xxxxxxxxx> wrote:
> On Sun, Aug 06 2017, Shawn Pearce jotted:
>
>> 5th iteration of the reftable storage format.
>
> I haven't kept up with all of the discussion, sorry if these comments
> repeat something that's already mentioned.
>
>> ### Version 1
>>
>> A repository must set its `$GIT_DIR/config` to configure reftable:
>>
>>     [core]
>>         repositoryformatversion = 1
>>     [extensions]
>>         reftable = true
>
> David Turner's LMDB proposal specified a extensions.refStorage config
> variable instead. I think this is a much better idea, cf. the mistake we
> already made with grep.extendedRegexp & grep.patternType. I.e. to have
> 'extensions.refStorage = reftable' instead of 'extensions.reftable =
> true'.
>
> If we grow another storage backend this'll become messy, and it won't be
> obvious to the user that the configuration is mutually exclusive (which
> it surely will be), so we'll end up having to special case it similar to
> the grep.[extendedRegexp,patternType] (i.e. either make one override the
> other, or make specifying >1 an error, a hassle with the config API).

Good catch. I've fixed this to use extensions.refStorage. Thanks!


>> Performance testing indicates reftable is faster for lookups (51%
>> faster, 11.2 usec vs.  5.4 usec), although reftable produces a
>> slightly larger file (+ ~3.2%, 28.3M vs 29.2M):
>>
>> format    |  size  | seek cold | seek hot  |
>> ---------:|-------:|----------:|----------:|
>> mh-alt    | 28.3 M | 23.4 usec | 11.2 usec |
>> reftable  | 29.2 M | 19.9 usec |  5.4 usec |
>>
>> [mh-alt]: https://public-inbox.org/git/CAMy9T_HCnyc1g8XWOOWhe7nN0aEFyyBskV2aOMb_fe+wGvEJ7A@xxxxxxxxxxxxxx/
>
> Might be worth noting "based on WIP Java implementation". I started
> searching for patches for this new format & found via
> <CAJo=hJtrdCOF-RxzXfyLx7R-1f2-7pZVO_UOg28J=wUDNdf3yw@xxxxxxxxxxxxxx>
> that it's JGit only.
>
> Also if one wanted to run these tests via JGit using your WIP code where
> does that code live / how to test it?

git fetch https://googlers.googlesource.com/sop/jgit reftable mh-chunk

The reftable branch has my code; mh-chunk has the WIP I did for the
experiments above.

Running from tip of JGit is ... interesting? I load the workspace into
Eclipse and let Eclipse compile, and then use a shell script to pull
in the relevant classes:

--snip--
#!/bin/sh

S=$HOME/git/jgit
C=$S/org.eclipse.jgit/bin
C=$C:$S/org.eclipse.jgit.pgm/bin
C=$C:$S/org.eclipse.jgit.http.apache/bin
C=$C:$S/org.eclipse.jgit.lfs/bin
C=$C:$S/org.eclipse.jgit.ui/bin
C=$C:$HOME/Downloads/slf4j-1.7.13/slf4j-api-1.7.13.jar
C=$C:$HOME/Downloads/slf4j-1.7.13/slf4j-simple-1.7.13.jar
C=$C:$HOME/Documents/jgit/.metadata/.plugins/org.eclipse.pde.core/.bundle_pool/plugins/org.kohsuke.args4j_2.0.21.v201301150030.jar
C=$C:$HOME/Documents/jgit/.metadata/.plugins/org.eclipse.pde.core/.bundle_pool/plugins/com.jcraft.jsch_0.1.54.v20170116-1932.jar

exec java -Xmx1g -Xms1g -cp $C org.eclipse.jgit.pgm.Main "$@"
--snap--

Its commands like:

  ./jgit.sh debug-write-reftable ~/foo.refs ~/foo.reftable

to convert an ls-remote style output into a reftable. Then to benchmark:

  ./jgit.sh debug-benchmark-reftable \
    --test=SEEK_HOT --ref=refs/heads/master \
    --tries=60000 \
    ~/foo.refs ~/foo.reftable


>> ### LMDB
>>
>> David Turner proposed [using LMDB][dt-lmdb], as LMDB is lightweight
>> (64k of runtime code) and GPL-compatible license.
>>
>> A downside of LMDB is its reliance on a single C implementation.  This
>> makes embedding inside JGit (a popular reimplemenation of Git)
>> difficult, and hoisting onto virtual storage (for JGit DFS) virtually
>> impossible.
>
> This rationale as stated reads a bit too much like https://xkcd.com/927/

Hah. True. :)

But its technically correct. The best kind of correct.
https://www.youtube.com/watch?v=hou0lU8WMgo

> I.e. surely the actual problem isn't that there's a single C
> implementation of LMDB, since that's one more than the C implementation
> that exists of this new format already.

Fair point, but I think this format is easier to implement than LMDB.
We also had bitmap indexes in JGit a year before we had them in C git.

> Also isn't this info out of date now that this exists:
> https://github.com/lmdbjava/lmdbjava ? That project has been implemented
> after David's initial LMDB patches on-list, but I don't know if it
> implements the subset of the LMDB format needed for his proposed ref
> storage.

Looks pretty complete. Its a Java wrapper around the C implementation
of LMDB, which may be sufficient for reference storage. Keys are
limited to 511 bytes, so insanely long reference names would have to
be rejected. Reftable allows reference names up to the file's
`page_size`, minus overhead (~15 bytes) and value (20 bytes).

A downside for JGit is getting these two open source projects cleared.
We would have to get approval from our sponsor (Eclipse Foundation) to
use both lmdbjava (Apache License) and LMDB (LMDB license). Plus it
looks like lmdbjava still relies on local disk and isn't giving us a
way to patch in a virtual filesystem the way I need to at $DAY_JOB.


$DAY_JOB is likely to put reftable into production in the coming
month, even if we don't have consensus about using the format in
git-core.