Re: State of NewHash work, future directions, and discussion
- Date: Mon, 11 Jun 2018 20:09:47 +0200
- From: Duy Nguyen <pclouds@xxxxxxxxx>
- Subject: Re: State of NewHash work, future directions, and discussion
On Sat, Jun 9, 2018 at 10:57 PM brian m. carlson
> Since there's been a lot of questions recently about the state of the
> NewHash work, I thought I'd send out a summary.
> == Status
> I have patches to make the entire codebase work, including passing all
> tests, when Git is converted to use a 256-bit hash algorithm.
> Obviously, such a Git is incompatible with the current version, but it
> means that we've fixed essentially all of the hard-coded 20 and 40
> constants (and therefore Git doesn't segfault).
This is so cool!
> == Future Design
> The work I've done necessarily involves porting everything to use
> the_hash_algo. Essentially, when the piece I'm currently working on is
> complete, we'll have a transition stage 4 implementation (all NewHash).
> Stage 2 and 3 will be implemented next.
> My vision of how data is stored is that the .git directory is, except
> for pack indices and the loose object lookup table, entirely in one
> format. It will be all SHA-1 or all NewHash. This algorithm will be
> stored in the_hash_algo.
> I plan on introducing an array of hash algorithms into struct repository
> (and wrapper macros) which stores, in order, the output hash, and if
> used, the additional input hash.
I'm actually thinking that putting the_hash_algo inside struct
repository is a mistake. We have code that's supposed to work without
a repo and it shows this does not really make sense to forcefully use
a partially-valid repo. Keeping the_hash_algo a separate variable
sounds more elegant.
> If people are interested, I've done some analysis on availability of
> implementations, performance, and other attributes described in the
> transition plan and can send that to the list.
I quickly skimmed through that document. I have two more concerns that
are less about any specific hash algorithm:
- how does larger hash size affects git (I guess you covered cpu
aspect, but what about cache-friendliness, disk usage, memory
- how does all the function redirection (from abstracting away SHA-1)
affects git performance. E.g. hashcmp could be optimized and inlined
by the compiler. Now it still probably can optimize the memcmp(,,20),
but we stack another indirect function call on top. I guess I might be
just paranoid and this is not a big deal after all.