Web lists-archives.com

[RFC PATCH 02/18] midx: specify midx file format

A multi-pack-index (MIDX) file indexes the objects in multiple
packfiles in a single pack directory. After a simple fixed-size
header, the version 1 file format uses chunks to specify
different regions of the data that correspond to different types
of data, including:

- List of OIDs in lex-order
- A fanout table into the OID list
- List of packfile names (relative to pack directory)
- List of object metadata
- Large offsets (if needed)

By adding extra optional chunks, we can easily extend this format
without invalidating written v1 files.

One value in the header corresponds to a number of "base MIDX files"
and will always be zero until the value is used in a later patch.

We considered using a hashtable format instead of an ordered list
of objects to reduce the O(log N) lookups to O(1) lookups, but two
main issues arose that lead us to abandon the idea:

- Extra space required to ensure collision counts were low.
- We need to identify the two lexicographically closest OIDs for
  fast abbreviations. Binary search allows this.

The current solution presents multiple packfiles as if they were
packed into a single packfile with one pack-index.

Signed-off-by: Derrick Stolee <dstolee@xxxxxxxxxxxxx>
 Documentation/technical/pack-format.txt | 85 +++++++++++++++++++++++++++++++++
 1 file changed, 85 insertions(+)

diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index 8e5bf60be3..ab459ef142 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -160,3 +160,88 @@ Pack file entry: <+
     corresponding packfile.
     20-byte SHA-1-checksum of all of the above.
+== midx-*.midx files have the following format:
+The multi-pack-index (MIDX) files refer to multiple pack-files.
+In order to allow extensions that add extra data to the MIDX format, we
+organize the body into "chunks" and provide a lookup table at the beginning
+of the body. The header includes certain length values, such as the number
+of packs, the number of base MIDX files, hash lengths and types.
+All 4-byte numbers are in network order.
+	4-byte signature:
+	    The signature is: {'M', 'I', 'D', 'X'}
+	4-byte version number:
+	    Git currently only supports version 1.
+	1-byte Object Id Version (1 = SHA-1)
+	1-byte Object Id Length (H)
+	1-byte number (I) of base multi-pack-index files:
+	    This value is currently always zero.
+	1-byte number (C) of "chunks"
+	4-byte number (P) of pack files
+	(C + 1) * 12 bytes providing the chunk offsets:
+	    First 4 bytes describe chunk id. Value 0 is a terminating label.
+	    Other 8 bytes provide offset in current file for chunk to start.
+	    (Chunks are provided in file-order, so you can infer the length
+	    using the next chunk position if necessary.)
+	The remaining data in the body is described one chunk at a time, and
+	these chunks may be given in any order. Chunks are required unless
+	otherwise specified.
+	OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes)
+	    The ith entry, F[i], stores the number of OIDs with first
+	    byte at most i. Thus F[255] stores the total
+	    number of objects (N). The number of objects with first byte
+	    value i is (F[i] - F[i-1]) for i > 0.
+	OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
+	    The OIDs for all objects in the MIDX are stored in lexicographic
+	    order in this chunk.
+	Object Offsets (ID: {'O', 'O', 'F', 'F'}) (N * 8 bytes)
+	    Stores two 4-byte values for every object.
+	    1: The pack-int-id for the pack storing this object.
+	    2: The offset within the pack.
+		If all offsets are less than 2^31, then the large offset chunk
+		will not exist and offsets are stored as in IDX v1.
+		If there is at least one offset value larger than 2^32-1, then
+		the large offset chunk must exist. If the large offset chunk
+		exists and the 31st bit is on, then removing that bit reveals
+		the row in the large offsets containing the 8-byte offset of
+		this object.
+	[Optional] Object Large Offsets (ID: {'L', 'O', 'F', 'F'})
+	    8-byte offsets into large packfiles.
+	Packfile Name Lookup (ID: {'P', 'L', 'O', 'O'}) (P * 4 bytes)
+	    P * 4 bytes storing the offset in the packfile name chunk for
+	    the null-terminated string containing the filename for the
+	    ith packfile. The filename is relative to the MIDX file's parent
+	    directory.
+	Packfile Names (ID: {'P', 'N', 'A', 'M'})
+	    Stores the packfile names as concatenated, null-terminated strings.
+	    Packfiles must be listed in lexicographic order for fast lookups by
+	    name. This is the only chunk not guaranteed to be a multiple of four
+	    bytes in length, so it should be the last chunk for alignment reasons.
+	H-byte HASH-checksum of all of the above.