Web lists-archives.com

Re: [PATCH] t4062: stop using repetition in regex

Am 09.08.2017 um 00:09 schrieb Junio C Hamano:
> René Scharfe <l.s.r@xxxxxx> writes:
>> Am 08.08.2017 um 16:49 schrieb Johannes Schindelin:
>>> Hi René,
>>> On Tue, 8 Aug 2017, René Scharfe wrote:
>>>> OpenBSD's regex library has a repetition limit (RE_DUP_MAX) of 255.
>>>> That's the minimum acceptable value according to POSIX.  In t4062 we use
>>>> 4096 repetitions in the test "-G matches", though, causing it to fail.
>>>> Do the same as the test "-S --pickaxe-regex" in the same file and search
>>>> for a single zero instead.  That still suffices to trigger the buffer
>>>> overrun in older versions (checked with b7d36ffca02^ and --valgrind on
>>>> Linux), simplifies the test a bit, and avoids exceeding OpenBSD's limit.
>>> I am afraid not. The 4096 is precisely the page size required to trigger
>>> the bug on Windows against which this regression test tries to safeguard.
>> Checked with b7d36ffca02^ on MinGW now as well and found that it
>> segfaults with the proposed change ten out of ten times.
> That is a strange result but I can believe it.
> The reason why I find it strange is that the test wants to run
> diff_grep() in diffcore-pickaxe.c with one == NULL (because we are
> looking at difference between an initial empty commit and the
> current commit that adds 4096-zeroes.txt file), which makes the
> current blob (i.e. a page of '0' that may be mmap(2)ed without
> trailing NUL to terminate it) scanned via regexec() to look for the
> search string.
> I can understand why Dscho originally did "^0{4096}$"; it is to
> ensure that the whole page is scanned for 4096 zeroes and then the
> code would somehow make sure that there is no more byte until the
> end of line, which will force regexec (but not regexec_buf that knows
> where the buffer ends) to look at the 4097th byte that does not exist.
> If you change the pattern to just "0" that is not anchored, I'd expect
> regexec() that does not know how big the haystack is to just find "0"
> at the first byte and happily return without segfaulting (because it
> does not even have to scan the remainder of the buffer).
> So I find Dscho's concern quite valid, even though I do believe you
> when you say the code somehow segfaults.  I just can not tell
> how/why it would segfault, though---it is possible that regexec()
> implementation is stupid and does not realize that it can return early
> reporting success without looking at the rest of the buffer, but
> somehow I find it unlikely.
> Puzzled.

Good point.  Valgrind reports:

==57466== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==57466==  Access not within mapped region at address 0x4027000
==57466==    at 0x4C2EDF4: strlen (vg_replace_strmem.c:458)
==57466==    by 0x59D9F76: regexec@@GLIBC_2.3.4 (regexec.c:240)
==57466==    by 0x54D96E: diff_grep (diffcore-pickaxe.c:0)
==57466==    by 0x54DAC3: pickaxe_match (diffcore-pickaxe.c:149)

And you can see in our version in compat/regex/regexec.c:241 that the
first thing regexec() does is calling strlen().

So to avoid depending on that implementation detail we'd need to use
a search string that won't be found (e.g. "1") or with unlimited
repetition (e.g. "0*"), right?