Web lists-archives.com

Proposal for "fetch-any-blob Git protocol" and server design




As described in "Background" below, there have been at least 2 patch sets to support "partial clones" and on-demand blob fetches, where the server part that supports on-demand blob fetches was treated at least in outline. Here is a proposal treating that server part in detail.

== Background

The desire for Git to support (i) missing blobs and (ii) fetching them as needed from a remote repository has surfaced on the mailing list a few times, most recently in the form of RFC patch sets [1] [2].

A local repository that supports (i) will be created by a "partial clone", that is, a clone with some special parameters (exact parameters are still being discussed) that does not download all blobs normally downloaded. Such a repository should support (ii), which is what this proposal describes.

== Design

A new endpoint "server" is created. The client will send a message in the following format:

----
fbp-request = PKT-LINE("fetch-blob-pack")
              1*want
              flush-pkt
want = PKT-LINE("want" SP obj-id)
----

The client may send one or more SHA-1s for which it wants blobs, then a flush-pkt.

The server will then reply:

----
server-reply = flush-pkt | PKT-LINE("ERR" SP message)
----

If there was no error, the server will then send them in a packfile, formatted like described in "Packfile Data" in pack-protocol.txt with "side-band-64k" enabled.

Any server that supports "partial clone" will also support this, and the client will automatically assume this. (How a client discovers "partial clone" is not covered by this proposal.)

The server will perform reachability checks on requested blobs through the equivalent of "git rev-list --use-bitmap-index" (like "git upload-pack" when using the allowreachablesha1inwant option), unless configured to suppress reachability checks through a config option. The server administrator is highly recommended to regularly regenerate the bitmap (or suppress reachability checks).

=== Endpoint support for forward compatibility

This "server" endpoint requires that the first line be understood, but will ignore any other lines starting with words that it does not understand. This allows new "commands" to be added (distinguished by their first lines) and existing commands to be "upgraded" with backwards compatibility.

=== Related improvements possible with new endpoint

Previous protocol upgrade suggestions have had to face the difficulty of allowing updated clients to discover the server support while not slowing down (for example, through extra network round-trips) any client, whether non-updated or updated. The introduction of "partial clone" allows clients to rely on the guarantee that any server that supports "partial clone" supports "fetch-blob-pack", and we can extend the guarantee to other protocol upgrades that such repos would want.

One such upgrade is "ref-in-want" [3]. The full details can be obtained from that email thread, but to summarize, the patch set eliminates the need for the initial ref advertisement and allows communication in ref name globs, making it much easier for multiple load-balanced servers to serve large repos to clients - this is something that would greatly benefit the Android project, for example, and possibly many others.

Bundling support for "ref-in-want" with "fetch-blob-pack" simplifies matters for the client in that a client needs to only handle one "version" of server (a server that supports both). If "ref-in-want" were added later, instead of now, clients would need to be able to handle two "versions" (one with only "fetch-blob-pack" and one with both "fetch-blob-pack" and "ref-in-want").

As for its implementation, that email thread already contains a patch set that makes it work with the existing "upload-pack" endpoint; I can update that patch set to use the proposed "server" endpoint (with a "fetch-commit-pack" message) if need be.

== Client behavior

This proposal is concerned with server behavior only, but it is useful to envision how the client would use this to ensure that the server behavior is useful.

=== Indication to use the proposed endpoint

The client will probably already record that at least one of its remotes (the one that it successfully performed a "partial clone" from) supports this new endpoint (if not, it can’t determine whether a missing blob was caused by repo corruption or by the "partial clone"). This knowledge can be used both to know that the server supports "fetch-blob-pack" and "fetch-commit-pack" (for the latter, the client can fall back to "fetch-pack"/"upload-pack" when fetching from other servers).

=== Multiple remotes

Fetches of missing blobs should (at least by default?) go to the remote that sent the tree that points to them. This means that if there are multiple remotes, the client needs to remember which remote it learned about a given missing blob from.

== Alternatives considered

The "fetch-blob-pack" and "fetch-commit-pack" messages could be split into their own endpoints. It seemed more reasonable to combine them together since they serve similar use cases (large repos), and (for example) reduces the number of binaries in PATH, but I do not feel strongly about this.

The client could supply commit information about the blobs it wants (or other information that could help the reachability analysis). However, these lines wouldn’t be used by the proposed server design. And if we do discover that these lines are useful, the protocol could be extended with new lines that contain this information (since old servers will ignore all lines that they do not understand).

We could extend "upload-pack" to allow blobs in "want" lines instead of having a new endpoint. Due to a quirk in the Git implementation (but possibly not other implementations like JGit), this is already supported [4]. However, each invocation would require the server to generate an unnecessary ref list, and would require both the server and the client to undergo more network traffic.

Also, the new "server" endpoint might be made to be discovered through another mechanism (for example, a capability advertisement on another endpoint). It is probably simpler to tie it to the "partial clone" feature, though, since they are so likely to be used together.

[1] <20170304191901.9622-1-markbt@xxxxxxxxxx>
[2] <1488999039-37631-1-git-send-email-git@xxxxxxxxxxxxxxxxx>
[3] <cover.1485381677.git.jonathantanmy@xxxxxxxxxx>
[4] <20170309003547.6930-1-jonathantanmy@xxxxxxxxxx>