Web lists-archives.com

Re: reftable: new ref storage format




On Thu, Jul 13, 2017 at 12:32 PM, Jeff King <peff@xxxxxxxx> wrote:
> On Wed, Jul 12, 2017 at 05:17:58PM -0700, Shawn Pearce wrote:
>
>> ### Problem statement
>>
>> Some repositories contain a lot of references (e.g.  android at 866k,
>> rails at 31k).  The existing packed-refs format takes up a lot of
>> space (e.g.  62M), and does not scale with additional references.
>> Lookup of a single reference requires linearly scanning the file.
>
> I think the linear scan is actually an implementation short-coming. Even
> though the records aren't fixed-length, the fact that newlines can only
> appear as end-of-record is sufficient to mmap and binary search a
> packed-refs file (you just have to backtrack a little when you land in
> the middle of a record).
>
> I wrote a proof of concept a while ago, but got stuck on integrating it
> into the ref code, because of some of the assumptions that it made.
> Michael Haggerty has been doing several rounds of refactors to remove
> those assumptions. I think we're pretty close (I've actually seen the
> endgame where packed-refs is fully binary searched, but I think there
> are a few more cleanups necessary to cover all cases).

You are correct, this is possible with the current packed-refs format.
It just hasn't materialized in a shipping implementation yet.


>> Atomic pushes modifying multiple references require copying the
>> entire packed-refs file, which can be a considerable amount of data
>> moved (e.g. 62M in, 62M out) for even small transactions (2 refs
>> modified).
>
> I think your definition of atomic here doesn't match what git.git does.

:-(


> Our atomic push just takes the lock on all of the refs, and then once it
> has all of them, commits all of the locks. So it's atomic in the sense
> that you either get all or none of the writes (modulo a commit failure
> in the middle, which we naturally have no rollback plan for). But it can
> be done without touching the packed-refs file at all.
>
> I imagine that you're looking at atomicity from the perspective of a
> reader. In the git.git scheme, the reader may see a half-committed
> transaction. If you dispense with loose refs entirely and treat the
> packed-refs file as a single poorly-implemented key/value database, then
> you get reader atomicity (but O(size_of_database) write performance).

Yes, I was hoping for reader atomicity. But I may OK foregoing that if
the transaction either all goes through, or all fails. A partially
stuck transaction because the process died in the middle of the commit
step creates a mess for an administrator to undo. Does she rename
"foo.lock" to "foo"? Or delete "foo.lock"?


>> Repositories with many loose references occupy a large number of disk
>> blocks from the local file system, as each reference is its own file
>> storing 41 bytes.  This negatively affects the number of inodes
>> available when a large number of repositories are stored on the same
>> filesystem.  Readers are also penalized due to the larger number of
>> syscalls required to traverse and read the `$GIT_DIR/refs` directory.
>
> In my experience, the syscalls involved in loose refs aren't usually a
> big deal. If you have 800k refs, they're not all changing constantly. So
> a single pack-refs "fixes" performance going forward. What _is_ a big
> deal is that the packing process is complicated, readers have a very
> un-atomic view because of the myriad of files involved, and you get
> annoying lock contention during packing, as well as between deletions
> that have to rewrite packed-refs.
>
> But I'm not sure if you meant to contrast here a system where we didn't
> use packed-refs at all (though of course such a system is very much not
> atomic by the definition above).

No, I really did mean the current system. Gerrit Code Review servers
create a lot of references throughout the day. Its easy to accumulate
a few thousand new loose references in a 24 hour period. Even if you
have 600k existing refs in packed-refs, you still have 2k new/modified
refs since last nightly cron ran git gc.


>> ### Objectives
>>
>> - Near constant time lookup for any single reference, even when the
>>   repository is cold and not in process or kernel cache.
>
> Good goal, though TBH I'd be happy with O(log n).
>
> A related one is being able to traverse a subset of refs in
> O(nr_traversed). E.g., "git tag -l" should not have to do work
> proportional to what is in refs/changes. That falls out of most
> proposals that allow fast lookups, but notably not a straight
> hash-table.

Thanks, I missed that in this list, even though it was an explicit
objective going into this work. I added:

- Efficient lookup of an entire namespace, such as `refs/tags/`.


>> - Occupy less disk space for large repositories.
>
> Good goal.  Just to play devil's advocate, the simplest way to do that
> with the current code would be to gzip packed-refs (and/or store sha1s
> as binary). That works against the "mmap and binary search" plan,
> though. :)

Yes it does. I tried to cover that later under "Alternatives
considered > bzip". :)


>> ### Reference name encoding
>>
>> Reference names should be encoded with UTF-8.
>
> Don't we usually treat refnames as byte sequences (subject to a few
> rules, as in check_ref_format())? It seems like the encoding should be
> out-of-scope for the storage format.

True that git-core treats them as byte sequences, but JGit treats them as UTF-8.


>> ## File format
>
> OK, let me try to summarize to see if I understand.
>
> The reftable file is a sequence of blocks, each of which contains a
> finite set of heavily-compressed refs. You have to read each block
> sequentially, but since they're a fixed size, that's still a
> constant-time operation (I'm ignoring the "restarts" thing for now). You
> find the right block by reading the index.  So lookup really is more
> like O(block_size * log(n/block_size)), but block_size being a constant,
> it drops out to O(log n).

Yes. My "near constant time" claim was because I had my head buried
thinking about disk IO operations when I wrote this, not the algorithm
that happens on the CPU. One of the applications for reftable is on a
slower-than-usual storage system, where reading a block of a file
costs enough milliseconds that it doesn't matter how good or bad the
CPU algorithm is.


> Linear scans are easy, because everything is in sorted order. So you
> just find the first entry via binary search, and then walk forward.

Yup.


> Updates are where things get dicier. It looks like you just write a new
> partial reftable file with your updates. And then if there are N
> reftables present, readers actually have to do a list-merge of the
> results they get from all of them (where the results from reftable.5
> trump ones from reftable.4).

Correct.


> So basically we're just journaling updates into a directory of atomic
> reftable updates. And then to keep the reader's job from getting too
> painful, a write occasionally has to compact into a single reftable,
> rewriting the entire ref store.

Correct.


> That's what I see as the biggest
> weakness here. If you keep too large a reftable stack, then readers have
> to spend a lot of extra effort on lookups. But if you keep too small a
> stack, then you are frequently rewriting the whole database.

But you say this yourself below, you can do this using a geometric
scheme or something to bound the cost of these rewrites such that you
aren't frequently rewriting the whole database.


> Technically writes are still O(n). Because of the journaling you
> amortize the whole-rewrite cost across several updates, but it's still
> O(n/c). That seems like the biggest weakness of the scheme to me.
>
> I think there's some cleverness you can use with compacting in a
> geometric scheme, though, to amortize up to a certain bound. I didn't
> see any discussion of that, though.

I think I left this as an exercise to the implementer. The
"Transactions > Compaction" section talks about the compaction
algorithm being applied only near the top of the stack, ignoring the
base table(s).


>> Compaction is similar to the update process, but an explicit temporary
>> file must be used:
>>
>> 1. Atomically create `$GIT_DIR/reftable.lock`.
>> 2. `readdir($GIT_DIR)` to determine the highest suffix ordinal, `n`.
>> 3. Compute the update transaction (e.g. compare expected values).
>> 4. Select files from (2) to collapse as part of this transaction.
>> 5. Create temp file by `mktemp("$GIT_DIR/.reftableXXXXXX")`.
>> 6. Write modified and collapsed references to temp file.
>> 7. Rename temp file to `reftable.${n + 1}`.
>> 8. Delete collapsed files `reftable.${n}`, `reftable.${n - 1}`, ...
>> 9. Delete `reftable.lock`.
>
> I had originally assumed you'd just compact back down to the reftable
> file after some N updates (say, 10). But here, it looks like you'd
> always compact 0-9 into 10, and then 10-19 into 20, and so on, and the
> ordinal would go up forever.
>
> I think that's OK, as it would take a long time to get unwieldy. And I
> think you have to do it that way, as you can't atomically replace
> "reftable" and delete .1-.9 at the same time.
>
>> Because `reftable.9` can disappear after `reftable.10` is created,
>> readers receiving ENOENT when opening `reftable.9` must peform
>> another readdir to look for new reftables.
>
> But after compaction, won't having "reftable.10" but no ".9" be the
> steady state?

Yes, it will be. You'd have maybe "reftable" and "reftable.10", and
nothing else.


> As a reader, how can I tell the difference between these
> two cases:
>
>   1. Somebody created .10 and deleted .9 and lower.
>
>   2. Somebody created .11 and deleted .10 and lower, while I was trying
>      to read .9.
>
> Is basically every read going to require:
>
>   1. readdir to find the highest ordinal
>
>   2. keep walking down the stack until you get ENOENT
>
>   3. readdir again to make sure there's not a new ordinal
>
> But in that case, if step 3 turns up a new reftable.11, how do I know
> whether it's a compaction (in which case I need to restart my read from
> .11) or if it's just another update-on-top? In a busy repository, you
> might see a lot of update-on-tops.

I think I was imaging updates are less frequent than reads, and a
reader is going to readdir(), and then immediately open every file in
the stack to setup the merge-join iteration. If the reader retains the
file descriptor, the reader can keep that file in their stack.

There is risk of a reader live-locking; the reader might have done a
readdir, starts opening the stack, sees ENOENT. In which case the
reader starts over. If an updater is performing compactions faster
than a reader can readdir and open paths, its live-lock for the
reader. That certainly is one motivation to not always perform a
compaction.


>> [...specifics...]
>
> I liked a lot of what I saw in the rest of it (e.g., handling symrefs,
> which packed-refs does not). Some bits seemed complicated. E.g., I
> actually wonder how much restarts help in practice if you have
> reasonably-sized blocks, and they complicate things a lot).

My implementation lets me tweak the restart table with a command line
option, so I was able to run a number of experiments for you. A 64k
block doesn't fit more than a few thousand references, so restart 4000
effectively disables restarts.

block  restart  size  lookup
4k     16       29M    90.2 usec
8k     16       29M    76.4 usec

64k    16       29M   147.7 usec
64k    64       28M   134.3 usec
64k    256     27M   143.4 usec

4k     4000     28M   104.0 usec
8k     4000     28M   117.1 usec
64k   4000     27M   288.5 usec

Turning off restarts shrinks the file, and increases lookup time.

For $REASONS, I favor a larger block size in some cases, even though
the lookup times get worse. For example, being able to use 64k/64 may
be a sweet spot for that particular IO system I mentioned above.

Fun fact: gzip packed-refs for this data set is 27M. The 64k/256 is
only 432 KiB larger than gzip (default compression).


> Likewise
> some bits are optional for very small reftable files to reduce overhead.
> But if you have very small reftables, it's going to be fast either way.
> If you waste 4K to store 200 bytes, that's fine

Not really. I store a lot of very small pack files, the 1k idx v2
header is pretty annoying for these. It dwarfs the rest of the
information, in cases it dwarfs the pack file itself. Not being able
to forgo the fanout table when you have a pack of 4 objects of 3 trees
and 1 commit is pretty damn annoying.

> as long as you're still
> wasting only 4K when you store 200 megabytes.

I don't think this is fair. Its better thought of as a ratio.

It depends on the parameters to the writer, but reftable was "wasting"
anywhere between 20K-44K of NUL byte padding for the various
experiments above, and the index was anywhere from 12K-185K, depending
on the block size (smaller block size == larger index).

Wasting 44K on padding, 12K on index, to compress 62M down to 27M...
a penalty of 0.2% of that 27M. That seems acceptable.

5 refs in reftable is ~204 bytes, because of the optional features
being disabled on small files. If reftable was forced to fill out to a
full 4K block, that is a penalty of 1907%. This might seem like
nothing, but for cases where the table has to travel on the network,
or is being stored in a tail-block-optimized filesystem, its a huge
waste to pad the file out.


> I also realize that beggars can't be choosers. If you have a working
> system that performs well, I should consider shutting up. :)

I have it in Java for JGit; I don't yet have a C implementation.


> One thing I didn't see is reflogs. They don't strictly have to be part
> of a ref-storage solution. But they do still consume at least one inode
> per ref in the current system. If you reflog everything, which in my
> opinion you should. Having an audit trail of ref updates is very useful
> for debugging (both bugs in Git, and trying to figure out what went
> wrong when a user goes berserk with "git push --prune").

Yea, I glossed over that and ignored them. Mostly because one system
where I want to use reftable already has a separate database handling
the reflogs. In another (Gerrit Code Review), we disable reflogs for
the insane refs/changes/ namespace, as nearly every reference is
created once, and never modified.

One could abuse the reftable format to store a reflog. If the refname
field is just a sequence of bytes, one could store keys like refname +
'\0' + timestamp, and reuse the symbolic ref value format to store the
old/new/message, as its just a length delimited string.

I'm loath to allocate more bits to denote a reflog entry vs. ref entry
in the same file, but I could see some advantages to that. Commits
would only have to write one new reftable for the combined update +
log record.