Web lists-archives.com

Re: [PATCH v2 05/13] midx: check both pack and index names for containment




Am 05.04.2019 um 20:06 schrieb Jeff King:
> A midx file (and the struct we parse from it) contains a list of all of
> the covered packfiles, mentioned by their ".idx" names (e.g.,
> "pack-1234.idx", etc). And thus calls to midx_contains_pack() expect
> callers to provide the idx name.
>
> This works for most of the calls, but the one in open_packed_git_1()
> tries to feed a packed_git->pack_name, which is the ".pack" name,
> meaning we'll never find a match (even if the pack is covered by the
> midx).
>
> We can fix this by converting the ".pack" to ".idx" in the caller.
> However, that requires allocating a new string. Instead, let's make
> midx_contains_pack() a bit friendlier, and allow it take _either_ the
> .pack or .idx variant.
>
> All cleverness in the matching code is credited to René. Bugs are mine.

I didn't consider it to be particularly tricky -- but then kept the dots
in both filename extension strings, which is not going to fly, as they
are skipped by the common-prefix loop..  Thanks for fixing that.

> There's no test here, because while this does fix _a_ bug, it's masked
> by another bug in that same caller. That will be covered (with a test)
> in the next patch.
>
> Helped-by: René Scharfe <l.s.r@xxxxxx>
> Signed-off-by: Jeff King <peff@xxxxxxxx>
> ---
> I was tempted to suggest that the midx struct just store the base name
> without ".idx" at all, but having callers pass that is no less tricky
> than passing ".idx" (they still have to allocate a new string).

The middle part of the comparison function would become:

	if (!*idx_name && (!strcmp(idx_or_pack_name, ".idx") ||
			   !strcmp(idx_or_pack_name, ".pack"))
		return 0;

No allocations needed -- except when building the list.

(And this time we'd actually need the dots.)

>
>  midx.c | 36 ++++++++++++++++++++++++++++++++++--
>  midx.h |  2 +-
>  2 files changed, 35 insertions(+), 3 deletions(-)
>
> diff --git a/midx.c b/midx.c
> index 8a505fd423..0ceca1938f 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -307,7 +307,39 @@ int fill_midx_entry(const struct object_id *oid, struct pack_entry *e, struct mu
>  	return nth_midxed_pack_entry(m, e, pos);
>  }
>
> -int midx_contains_pack(struct multi_pack_index *m, const char *idx_name)
> +/* Match "foo.idx" against either "foo.pack" _or_ "foo.idx". */
> +static int cmp_idx_or_pack_name(const char *idx_or_pack_name,
> +				const char *idx_name)
> +{
> +	/* Skip past any initial matching prefix. */
> +	while (*idx_name && *idx_name == *idx_or_pack_name) {
> +		idx_name++;
> +		idx_or_pack_name++;
> +	}
> +
> +	/*
> +	 * If we didn't match completely, we may have matched "pack-1234." and
> +	 * be left with "idx" and "pack" respectively, which is also OK. We do
> +	 * not have to check for "idx" and "idx", because that would have been
> +	 * a complete match (and in that case these strcmps will be false, but
> +	 * we'll correctly return 0 from the final strcmp() below.
> +	 *
> +	 * Technically this matches "fooidx" and "foopack", but we'd never have
> +	 * such names in the first place.
> +	 */
> +	if (!strcmp(idx_name, "idx") && !strcmp(idx_or_pack_name, "pack"))
> +		return 0;

This is asymmetric, and thus the function should not be used for sorting,
where it would be called for random pairs of values. For the binary search
it's fine, assuming the list contains only index filenames (ending with
".idx"), as the search string is always passed in as idx_or_pack_name.

And an extension-insensitive lookup works fine in a strcmp()-sorted list
because the order wouldn't change when sorting it again with a looser
stable extension-sensitive sort.

> +
> +	/*
> +	 * This not only checks for a complete match, but also orders based on
> +	 * the first non-identical character, which means our ordering will
> +	 * match a raw strcmp(). That makes it OK to use this to binary search
> +	 * a naively-sorted list.
> +	 */
> +	return strcmp(idx_or_pack_name, idx_name);

At this point we could also compare the chars like this:

	return (unsigned char)(*idx_or_pack_name) - (unsigned char)(*idx_name);

This avoids a function call, but doesn't look very pretty.  And I'm not
fully sure the casts are correct.

> +}
> +
> +int midx_contains_pack(struct multi_pack_index *m, const char *idx_or_pack_name)
>  {
>  	uint32_t first = 0, last = m->num_packs;
>
> @@ -317,7 +349,7 @@ int midx_contains_pack(struct multi_pack_index *m, const char *idx_name)
>  		int cmp;
>
>  		current = m->pack_names[mid];
> -		cmp = strcmp(idx_name, current);
> +		cmp = cmp_idx_or_pack_name(idx_or_pack_name, current);
>  		if (!cmp)
>  			return 1;
>  		if (cmp > 0) {
> diff --git a/midx.h b/midx.h
> index 774f652530..26dd042d63 100644
> --- a/midx.h
> +++ b/midx.h
> @@ -43,7 +43,7 @@ struct object_id *nth_midxed_object_oid(struct object_id *oid,
>  					struct multi_pack_index *m,
>  					uint32_t n);
>  int fill_midx_entry(const struct object_id *oid, struct pack_entry *e, struct multi_pack_index *m);
> -int midx_contains_pack(struct multi_pack_index *m, const char *idx_name);
> +int midx_contains_pack(struct multi_pack_index *m, const char *idx_or_pack_name);
>  int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, int local);
>
>  int write_midx_file(const char *object_dir);
>