Web lists-archives.com

Re: Occasional git p4 test failures because of stray fast-import processes




On Wed, 27 Feb 2019 at 09:49, SZEDER Gábor <szeder.dev@xxxxxxxxx> wrote:
>
> Hi Luke,
>
> I saw rare failures in test 6 'git p4 sync uninitialized repo' in
> 't9800-git-p4-basic.sh' on Travis CI, because the 'cleanup_git'
> function failed to do its job.  The (redacted) trace looks like this:

Thanks for the *very* detailed analysis!

I think you are right - git-p4 should wait() for all of its children,
and that ought to fix this.

I think I may have even added the bit of code you mention (about a
decade ago now).

I'll have a look and see what can be done.

Thanks!
Luke



>
>   + cleanup_git
>   + retry_until_success rm -r /home/szeder/src/git/t/trash directory.t9800-git-p4-basic/git
>   + time_in_seconds
>   + cd /
>   + /usr/bin/python -c import time; print(int(time.time()))
>   + timeout=1551233042
>   + rm -r /home/szeder/src/git/t/trash directory.t9800-git-p4-basic/git
>   + test_must_fail test -d /home/szeder/src/git/t/trash directory.t9800-git-p4-basic/git
>   test_must_fail: command succeeded: test -d /home/szeder/src/git/t/trash directory.t9800-git-p4-basic/git
>   + eval_ret=1
>   + :
>   not ok 6 - git p4 sync uninitialized repo
>
> Trying to reproduce this failure with stock Git can be tricky: I've
> seen
>
>   ./t9800-git-p4-basic.sh --stress -r 1,2,6,22
>
> fail within the first 10 tries, but it also run overnight without a
> single failure...  However, the following two-liner patch can reliably
> trigger this failure:
>
> diff --git a/fast-import.c b/fast-import.c
> index b7ba755c2b..54715018b3 100644
> --- a/fast-import.c
> +++ b/fast-import.c
> @@ -3296,6 +3296,7 @@ int cmd_main(int argc, const char **argv)
>                 rc_free[i].next = &rc_free[i + 1];
>         rc_free[cmd_save - 1].next = NULL;
>
> +       sleep(1);
>         start_packfile();
>         set_die_routine(die_nicely);
>         set_checkpoint_signal();
> diff --git a/t/lib-git-p4.sh b/t/lib-git-p4.sh
> index b3be3ba011..2d2ef50bfa 100644
> --- a/t/lib-git-p4.sh
> +++ b/t/lib-git-p4.sh
> @@ -190,6 +190,7 @@ kill_p4d () {
>
>  cleanup_git () {
>         retry_until_success rm -r "$git"
> +       sleep 2
>         test_must_fail test -d "$git" &&
>         retry_until_success mkdir "$git"
>  }
>
>
> What's going on is this: 'git p4' invokes 'git fast-import' via
> Python's subprocess.Popen(), and then there are two chain of events
> racing with each other:
>
>   - In the foreground:
>
>     - 'git p4' notices that there are no p4 branches and die()s as
>       expected, but leaves its child fast-import behind
>     - 'test_i18ngrep' quickly verifies that 'git p4' died with the
>       expected error message
>     - the cleanup function removes the "$TRASH_DIRECTORY/git"
>       directory, and
>     - checks that the directory is indeed gone.
>
>   - Meanwhile in the background:
>
>     - 'git fast-import' starts up, kicks off the dashed external
>       'git-fast-import' process,
>     - which opens a tmp packfile in "$TRASH_DIRECTORY/git" for writing
>       (the start_packfile() call in the patch context above), creating
>       any leading directories if necessary,
>     - notices that its stdin is closed (because 'git p4' died) and it
>       has nothing left to do, so
>     - it cleans up and exits.  Note that this cleanup only removes the
>       tmp packfile, but leaves any newly created leading directories
>       behind.
>
> While unlikely, it does apparently happen from time to time that the
> test's cleanup function first removes "$TRASH_DIRECTORY/git", but then
> 'git fast-import' re-creates it for its packfile before the cleanup
> function checks the result of the removal.  The two well-placed
> sleep()s in the patch above force such a problematic scheduling.
>
> There are 4 test cases running 'test_must_fail git p4 sync': the above
> patch triggers a failure in 3 of them, and with a bit of extra modding
> I could trigger a failure in the fourth one as well.
>
> We could work this around by waiting for 'git p4' and all its child
> processes in the affected tests themselves, using the same shell
> trickery as in ef09036cf3 (t6500: wait for detached auto gc at the end
> of the test script, 2017-04-13), but this feels like, well, a
> workaround.
>
> I think the proper solution would be to ensure that 'git p4' kills and
> waits for all its child processes when die()ing.  Alternatively (or in
> addition?), it could perform all the necessary sanity-checking (and
> potential die()ing) before starting the 'git fast-import' process in
> the first place.
>
> I've glanced through all subprocess.Popen() callsites in 'git p4' and
> found most of them OK, in the sense that they wait for whatever child
> process they create.  Alas, there was one exception: p4CmdList() can
> invoke an optional callback function before wait()ing on its 'p4'
> child, and the streamP4FilesCb() callback function can die() without
> waiting for the 'p4' process (but it does wait() for 'git
> fast-import'!).
>
> On a related note: this check for the just-removed directory was added
> in 23aee4199a (git-p4: retry kill/cleanup operations in tests with
> timeout, 2015-11-19), which mentions flaky cleanup operations.
> Perhaps this is the same flakiness?!
>
> Anyway, as I mentioned elsewhere before, I have no idea why/how 'git
> p4' works, so I'll leave it up to you how it's best to deal with this
> issue...
>
>
> Gábor
>