Web lists-archives.com

Re: [GSoC][RFC] Proposal: Make pack access code thread-safe

On Mon, Apr 8, 2019 at 6:26 AM Philip Oakley <philipoakley@xxxxxxx> wrote:
> On 08/04/2019 02:23, Duy Nguyen wrote:
> > On Mon, Apr 8, 2019 at 5:52 AM Christian Couder
> > <christian.couder@xxxxxxxxx> wrote:
> >>> Git has a very optimized mechanism to compactly store
> >>> objects (blobs, trees, commits, etc.) in packfiles[2]. These files are
> >>> created by[3]:
> >>>
> >>> 1. listing objects;
> >>> 2. sorting the list with some good heuristics;
> >>> 3. traversing the list with a sliding window to find similar objects in
> >>> the window, in order to do delta decomposing;
> >>> 4. compress the objects with zlib and write them to the packfile.
> >>>
> >>> What we are calling pack access code in this document, is the set of
> >>> functions responsible for retrieving the objects stored at the
> >>> packfiles. This process consists, roughly speaking, in three parts:
> >>>
> >>> 1. Locate and read the blob from packfile, using the index file;
> >>> 2. If the blob is a delta, locate and read the base object to apply the
> >>> delta on top of it;
> >>> 3. Once the full content is read, decompress it (using zlib inflate).
> >>>
> >>> Note: There is a delta cache for the second step so that if another
> >>> delta depends on the same base object, it is already in memory. This
> >>> cache is global; also, the sliding windows, are global per packfile.
> >> Yeah, but the sliding windows are used only when creating pack files,
> >> not when reading them, right?
> > These windows are actually for reading. We used to just mmap the whole
> > pack file in the early days but that was impossible for 4+ GB packs on
> > 32-bit platforms, which was one of the reasons, I think, that sliding
> > windows were added, to map just the parts we want to read.
> Another "32-bit problem" should also be expressly considered during the
> GSoC work because of the MS Windows definition of uInt / long to be only
> 32 bits, leading to much of the Git code failing on the Git for Windows
> port and on the Git LFS (for Windows) for packs and files greater than
> 4Gb. https://github.com/git-for-windows/git/issues/1063

Thanks for pointing it out. I didn't get it, thought, if your
suggestion was to also propose tackling this issue in this GSoC
project. Was it that? I read the link but it seems to be a kind of
unrelated problem from what I'm planing to do with the pack access
code (which is tread-safety). I may have understood this wrongly,
though. Please, let me know if that's the case :)

> Mainly it is just substitution of size_t for long, but there can be
> unexpected coercions when mixed data types get coerced down to a local
> 32-bit long. This is made worse by it being implementation defined, so
> one needs to be explicit about some casts up to pointer/memsized types.
> >>> # Points to work on
> >>>
> >>> * Investigate pack access call chains and look for non-thread-safe
> >>> operations on then.
> >>> * Protect packfile.c read-and-write global variables, such as
> >>> pack_open_windows, pack_open_fds and etc., using mutexes.
> >> Do you want to work on making both packfile reading and packfile
> >> writing thread safe? Or just packfile reading?
> > Packfile writing is probably already or pretty close to thread-safe
> > (at least the main writing code path in git-pack-objects; the
> > streaming blobs to a pack, i'm not so sure).
> --
> Philip