Web lists-archives.com

Re: Bug in pathspec handling (in conjunction with submodules)




On 11/26, Johannes Schindelin wrote:
> Hi Duy & Brandon,
> 
> in 74ed43711fd (grep: enable recurse-submodules to work on <tree> objects,
> 2016-12-16), the do_match() function in tree-walk.c was changed so that it
> can recurse across submodule boundaries.
> 
> However, there is a bug, and I *think* there may be two bugs actually. Or
> even three.
> 
> First of all, here is an MCVE that I distilled from
> https://github.com/git-for-windows/git/issues/1371:
> 
> 	git init repo
> 	cd repo
> 
> 	git init submodule
> 	git -C submodule commit -m initial --allow-empty
> 
> 	touch "[bracket]"
> 	git add "[bracket]"
> 	git commit -m bracket
> 	git add submodule
> 	git commit -m submodule
> 
> 	git rev-list HEAD -- "[bracket]"
> 
> Nothing fancy, just adding a file with brackets in the name, then a
> submodule, then showing the commit history filtered by the funny file
> name.
> 
> However, the log prints *both* commits. Clearly the submodule commit
> should *not* be shown.
> 
> Now, how does this all happen?
> 
> Since the pathspec contains brackets, parse_pathspec() marks it as
> containing wildcards and sets nowildcard_len to 0.
> 
> Now, note that [bracket] *is* a wildcard expression: it should only match
> a single character that is one of  a, b, c, e, k, r or t.
> 
> I think this is the first bug: `git rev-list` should not even match the
> commit that adds the file [bracket] because its file name does not match
> that expression. From where I sit, it would appear that f1a2ddbbc2d
> (tree_entry_interesting(): optimize wildcard matching when base is
> matched, 2010-12-15) simply added the fnmatch() code without disabling the
> literal match_entry() code when the pathspec contains a pattern.

I can see both sides to this, wanting to try matching literally first
and then trying the wildcards, so I don't really have an opinion on
how/if we should fix that.

> 
> But it does not stop there: there is *another* bug which causes the
> pattern to somehow match the submodule. I *guess* the idea of
> https://github.com/git/git/commit/74ed43711#diff-7a08243175f2cae66aedf53f7dce3bdfR1015
> was to allow a pattern like *.c to match files in a submodule, but the
> pattern [bracket] should not match any file in submodule/. I think that
> that code needs to be a little bit more careful to try to match the
> submodule's name against the pattern (it seems to interpret nowildcard_len
> == 0 to mean that the wildcard is `*`).

This is a much bigger issue and I'm surprised it took this long to find
this bug.  And of course its due to one of my earlier contributions to
the project :)

> 
> However, the commit introducing that code wanted to teach *grep* (not
> *rev-list*) a new trick, and it relies on the `recursive` flag of the
> pathspec to be set.

This is the root cause of the bug.  The added code to match against
submodules was intended to allow for matching past submodule boundaries
for those commands (like grep) which are recursing submodules.  So
really there should be an additional flag which is passed in to trigger
this logic instead of relying on the recursive flag of the pathspec.  Or
we can add a recurse_submodules flag to the pathspec struct and respect
that flag instead of the 'recursive' flag.

I have a quick patch to do just that which I'll send shortly.

> 
> And now it gets really interesting. Or confusing, depending on your mental
> condition. This recursive flag of the pathspec is set in
> ll_diff_tree_paths() (yep, changing the flag in the passed-in opt
> structure... which I found a bit... unexpected, given the function name, I
> would have been less surprised if that function only diff'ed the trees and
> used the options without changing the options). That flag-change was
> introduced in
> https://github.com/git/git/commit/bc96cc87dbb2#diff-15203e8cd8ee9191113894de9d97a8a6R149
> which is another patch that changed the tree diff machinery to accommodate
> `git grep` (but maybe not really paying a lot of attention to the fact
> that the same machinery is called repeatedly by the revision machinery,
> too).
> 
> I am really confused by this code mainly due to the fact that the term
> "recursive" is pretty ambiguous in that context: does it refer to
> directories/tree objects, or to submodules? I guess it is used for both
> when there should be two flags so that rev-list can recurse over tree
> objects but not submodules (unless told to do so).
> 
> The problem, of course, is that `git rev-list HEAD -- '[bracket]'` never
> recurses into the submodule. And therefore, the promised "more accurate
> matching [...] in the submodule" is never performed. And the commit adding
> the submodule is never pruned.
> 
> Since I am not really familiar with all that tree diff code (and as a
> general rule to protect my mental health, I try my best to stay away from
> submodules, too), but you two are, may I ask you gentle people to have a
> closer look to fix those bugs?
> 
> Thanks,
> Dscho

-- 
Brandon Williams