Web lists-archives.com

Re: Hash algorithm analysis

On Mon, Jun 11 2018, Linus Torvalds wrote:

> On Mon, Jun 11, 2018 at 12:29 PM Jonathan Nieder <jrnieder@xxxxxxxxx> wrote:
>> Yves Orton and Linus Torvalds prefer[5] SHA3 over SHA2 because of how
>> it is constructed.
> Yeah, I really think that it's a mistake to switch to something that
> has the same problem SHA1 had.
> That doesn't necessarily mean SHA3, but it does mean "bigger
> intermediate hash state" (so no length extension attack), which could
> be SHA3, but also SHA-512/256 or K12.
> Honestly, git has effectively already moved from SHA1 to SHA1DC.
> So the actual known attack and weakness of SHA1 should simply not be
> part of the discussion for the next hash. You can basically say "we're
> _already_ on the second hash, we just picked one that was so
> compatible with SHA1 that nobody even really noticed.
> The reason to switch is
>  (a) 160 bits may not be enough
>  (b) maybe there are other weaknesses in SHA1 that SHA1DC doesn't catch.
>  (c) others?
> Obviously all of the choices address (a).

FWIW I updated our docs 3 months ago to try to address some of this:

> But at least for me, (b) makes me go "well, SHA2 has the exact same
> weak inter-block state attack, so if there are unknown weaknesses in
> SHA1, then what about unknown weaknesses in SHA2"?
> And no, I'm not a cryptographer. But honestly, length extension
> attacks were how both md5 and sha1 were broken in practice, so I'm
> just going "why would we go with a crypto choice that has that known
> weakness? That's just crazy".

What do you think about Johannes's summary of this being a non-issue for
Git in

> From a performance standpoint, I have to say (once more) that crypto
> performance actually mattered a lot less than I originally thought it
> would. Yes, there are phases that do care, but they are rare.

One real-world case is rebasing[1]. As noted in that E-Mail of mine a
year ago we can use SHA1DC v.s. OpenSSL as a stand-in for the sort of
performance difference we might expect between hash functions, although
as you note this doesn't account for the difference in length.

With our perf tests, in t/perf on linux.git:

    $ GIT_PERF_LARGE_REPO=~/g/linux GIT_PERF_REPEAT_COUNT=10 GIT_PERF_MAKE_COMMAND='if pwd | grep -q $(git rev-parse origin/master); then make -j8 CFLAGS=-O3 DC_SHA1=Y; else make -j8 CFLAGS=-O3 OPENSSL_SHA1=Y; fi' ./run origin/master~ origin/master -- p3400-rebase.sh
    Test                                                            origin/master~    origin/master
    3400.2: rebase on top of a lot of unrelated changes             1.38(1.19+0.11)   1.40(1.23+0.10) +1.4%
    3400.4: rebase a lot of unrelated changes without split-index   4.07(3.28+0.66)   4.62(3.71+0.76) +13.5%
    3400.6: rebase a lot of unrelated changes with split-index      3.41(2.94+0.38)   3.35(2.87+0.37) -1.8%

On a bigger monorepo I have here:

    Test                                                            origin/master~    origin/master
    3400.2: rebase on top of a lot of unrelated changes             1.39(1.19+0.17)   1.34(1.16+0.16) -3.6%
    3400.4: rebase a lot of unrelated changes without split-index   6.67(3.37+0.63)   6.95(3.90+0.62) +4.2%
    3400.6: rebase a lot of unrelated changes with split-index      3.70(2.85+0.45)   3.73(2.85+0.41) +0.8%

I didn't paste any numbers in that E-Mail a year ago, maybe I produced
them differently, but this is clerly not that of a "big difference". But
this is one way to see the difference.

> For example, I think SHA1 performance has probably mattered most for
> the index and pack-file, where it's really only used as a fancy CRC.
> For most individual object cases, it is almost never an issue.

Yeah there's lots of things we could optimize there, but we are going to
need to hash things to create the commit in e.g. the rebase case, but
much of that could probably be done more efficiently without switching
the hash.

> From a performance angle, I think the whole "256-bit hashes are
> bigger" is going to be the more noticeable performance issue, just
> because things like delta compression and zlib - both of which are
> very *real* and present performance issues - will have more data that
> they need to work on. The performance difference between different
> hashing functions is likely not all that noticeable in most common
> cases as long as we're not talking orders of magnitude.
> And yes, I guess we're in the "approaching an order of magnitude"
> performance difference, but we should actually compare not to OpenSSL
> SHA1, but to SHA1DC. See above.
> Personally, the fact that the Keccak people would suggest K12 makes me
> think that should be a front-runner, but whatever. I don't think the
> 128-bit preimage case is an issue, since 128 bits is the brute-force
> cost for any 256-bit hash.
> But hey, I picked sha1 to begin with, so take any input from me with
> that historical pinch of salt in mind ;)

1. https://public-inbox.org/git/87tw3f8vez.fsf@xxxxxxxxx/