[PATCH 00/17] [RFC] Commit-graph: Write incremental files
- Date: Wed, 08 May 2019 08:53:47 -0700 (PDT)
- From: "Derrick Stolee via GitGitGadget" <gitgitgadget@xxxxxxxxx>
- Subject: [PATCH 00/17] [RFC] Commit-graph: Write incremental files
This patch series is marked as RFC quality because it is missing some key
features and tests, but hopefully starts a concrete discussion of how the
incremental commit-graph writes can work. If this is a good direction, then
it would replace ds/commit-graph-format-v2.
The commit-graph is a valuable performance feature for repos with large
commit histories, but suffers from the same problem as git repack: it
rewrites the entire file every time. This can be slow when there are
millions of commits, especially after we stopped reading from the
commit-graph file during a write in 43d3561 (commit-graph write: don't die
if the existing graph is corrupt).
Instead, create a "stack" of commit-graphs where the existing commit-graph
file is "level 0" and other levels are in files
$OBJDIR/info/commit-graphs/commit-graph-N for positive N. Each level is
closed under reachability with its lower levels, and the idea of "graph
position" now considers the concatenation of the commit orders from each
level. See PATCH 12 for more details.
When writing, we don't always want to add a new level to the stack. This
would eventually result in performance degradation, especially when
searching for a commit (before we know its graph position). We decide to
merge levels of the stack when the new commits we will write satisfy two
1. The expected size of the new file is more than half the size of the tip
of the stack.
2. The new file contains more than 64,000 commits.
The first condition alone would prevent more than a logarithmic number of
levels. The second condition is a stop-gap to prevent performance issues
when another process starts reading the commit-graph stack as we are merging
a large stack of commit-graph files. The reading process could be in a state
where the new file is not ready, but the levels above the new file were
already deleted. Thus, the commits that were merged down must be parsed from
The performance is necessarily amortized across multiple writes, so I tested
by writing commit-graphs from the (non-rc) tags in the Linux repo. My test
included 72 tags, and wrote everything reachable from the tag using
--stdin-commits. Here are the overall perf numbers:
git commit-graph write --stdin-commits: 8m 12s
git commit-graph write --stdin-commits --split: 45s
The test using --split included at least six full collapses to the full
commit-graph. I believe the commit-graph stack had at most three levels
during this test.
This series is long because I felt the need to refactor write_commit_graph()
before making such a sweeping change to the format.
* Patches 1-4: these are small changes which either fix issues or just
provide clean-up. These are mostly borrowed from
* Patches 5-11: these provide a non-functional refactor of
write_commit_graph() into several methods using a "struct
write_commit_graph_context" to share across the methods.
* Patches 12-16: Implement the split commit-graph feature.
* Patch 17: Demonstrate the value by writing a split commit-graph during
git fetch when the new config setting fetch.writeCommitGraph is true.
TODO: There are several things missing that need to be added before this
series is ready for full review and merging:
1. The documentation for git commit-graph needs updating for the --split
2. We likely want config settings for the merge strategy. This is mentioned
in the design doc, and could be saved for later.
3. We want to update the git commit-graph verify subcommand to understand
the commit-graph stack and optionally only verify the tip of the stack.
This allows faster (amortized) verification if we are verifying
immediately after writes and trusting the files at rest.
4. It would be helpful to add a new optional chunk that contains the
trailing hash for the lower level of the commit-graph stack. This chunk
would only be for the commit-graph-N files, and would provide a simple
way to check that the stack is valid on read, in case we are still
worried about other processes reading/writing in the wrong order.
5. Currently, --split essentially implies --append since we either (a)
don't change the existing stack and only add commits, or (b) add all
existing commits while merging files. However, if you would use --append
with --split, the append logic will trigger a merge with the current tip
(at minimum). Some care should be taken to make this more clear.
commit-graph write: don't die if the existing graph is corrupt
Derrick Stolee (17):
commit-graph: fix the_repository reference
commit-graph: return with errors during write
commit-graph: collapse parameters into flags
commit-graph: remove Future Work section
commit-graph: create write_commit_graph_context
commit-graph: extract fill_oids_from_packs()
commit-graph: extract fill_oids_from_commit_hex()
commit-graph: extract fill_oids_from_all_packs()
commit-graph: extract count_distinct_commits()
commit-graph: extract copy_oids_to_commits()
commit-graph: extract write_commit_graph_file()
Documentation: describe split commit-graphs
commit-graph: lay groundwork for incremental files
commit-graph: load split commit-graph files
commit-graph: write split commit-graph files
commit-graph: add --split option
fetch: add fetch.writeCommitGraph config setting
Documentation/technical/commit-graph.txt | 157 +++-
builtin/commit-graph.c | 31 +-
builtin/commit.c | 5 +-
builtin/fetch.c | 17 +
builtin/gc.c | 7 +-
commit-graph.c | 946 ++++++++++++++++-------
commit-graph.h | 19 +-
commit.c | 2 +-
t/t5318-commit-graph.sh | 28 +-
9 files changed, 887 insertions(+), 325 deletions(-)
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-184/derrickstolee/graph/incremental-v1