Web lists-archives.com

Re: [RFC PATCH 00/18] Multi-pack index (MIDX)




On Mon, Jan 08 2018, Jeff King jotted:

> On Mon, Jan 08, 2018 at 05:20:29AM -0500, Jeff King wrote:
>
>> I.e., what if we did something like this:
>>
>> diff --git a/sha1_name.c b/sha1_name.c
>> index 611c7d24dd..04c661ba85 100644
>> --- a/sha1_name.c
>> +++ b/sha1_name.c
>> @@ -600,6 +600,15 @@ int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len)
>>  	if (len == GIT_SHA1_HEXSZ || !len)
>>  		return GIT_SHA1_HEXSZ;
>>
>> +	/*
>> +	 * A default length of 10 implies a repository big enough that it's
>> +	 * getting expensive to double check the ambiguity of each object,
>> +	 * and the chance that any particular object of interest has a
>> +	 * collision is low.
>> +	 */
>> +	if (len >= 10)
>> +		return len;
>> +
>
> Oops, this really needs to terminate the string in addition to returning
> the length (so it was always printing 40 characters in most cases). The
> correct patch is below, but it performs the same.
>
> diff --git a/sha1_name.c b/sha1_name.c
> index 611c7d24dd..5921298a80 100644
> --- a/sha1_name.c
> +++ b/sha1_name.c
> @@ -600,6 +600,17 @@ int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len)
>  	if (len == GIT_SHA1_HEXSZ || !len)
>  		return GIT_SHA1_HEXSZ;
>
> +	/*
> +	 * A default length of 10 implies a repository big enough that it's
> +	 * getting expensive to double check the ambiguity of each object,
> +	 * and the chance that any particular object of interest has a
> +	 * collision is low.
> +	 */
> +	if (len >= 10) {
> +		hex[len] = 0;
> +		return len;
> +	}
> +
>  	mad.init_len = len;
>  	mad.cur_len = len;
>  	mad.hex = hex;

That looks much more sensible, leaving aside other potential benefits of
MIDX.

Given the argument Linus made in e6c587c733 ("abbrev: auto size the
default abbreviation", 2016-09-30) maybe we should add a small integer
to the length for good measure, i.e. something like:

	if (len >= 10) {
		int extra = 2; /* or  just 1? or maybe 0 ... */
		hex[len + extra] = 0;
		return len + extra;
	}

I tried running:

    git log --pretty=format:%h --abbrev=7 | perl -nE 'chomp; say length'|sort|uniq -c|sort -nr

On several large repos, which forces something like the disambiguation
we had before Linus's patch, on e.g. David Turner's
2015-04-03-1M-git.git test repo it's:

     952858 7
      44541 8
       2861 9
        168 10
         17 11
          2 12

And the default abbreviation picks 12. I haven't yet found a case where
it's wrong, but if we wanted to be extra safe we could just add a byte
or two to the SHA-1.