Web lists-archives.com

Occasional git p4 test failures because of stray fast-import processes

Hi Luke,

I saw rare failures in test 6 'git p4 sync uninitialized repo' in
't9800-git-p4-basic.sh' on Travis CI, because the 'cleanup_git'
function failed to do its job.  The (redacted) trace looks like this:

  + cleanup_git
  + retry_until_success rm -r /home/szeder/src/git/t/trash directory.t9800-git-p4-basic/git
  + time_in_seconds
  + cd /
  + /usr/bin/python -c import time; print(int(time.time()))
  + timeout=1551233042
  + rm -r /home/szeder/src/git/t/trash directory.t9800-git-p4-basic/git
  + test_must_fail test -d /home/szeder/src/git/t/trash directory.t9800-git-p4-basic/git
  test_must_fail: command succeeded: test -d /home/szeder/src/git/t/trash directory.t9800-git-p4-basic/git
  + eval_ret=1
  + :
  not ok 6 - git p4 sync uninitialized repo

Trying to reproduce this failure with stock Git can be tricky: I've

  ./t9800-git-p4-basic.sh --stress -r 1,2,6,22

fail within the first 10 tries, but it also run overnight without a
single failure...  However, the following two-liner patch can reliably
trigger this failure:

diff --git a/fast-import.c b/fast-import.c
index b7ba755c2b..54715018b3 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -3296,6 +3296,7 @@ int cmd_main(int argc, const char **argv)
 		rc_free[i].next = &rc_free[i + 1];
 	rc_free[cmd_save - 1].next = NULL;
+	sleep(1);
diff --git a/t/lib-git-p4.sh b/t/lib-git-p4.sh
index b3be3ba011..2d2ef50bfa 100644
--- a/t/lib-git-p4.sh
+++ b/t/lib-git-p4.sh
@@ -190,6 +190,7 @@ kill_p4d () {
 cleanup_git () {
 	retry_until_success rm -r "$git"
+	sleep 2
 	test_must_fail test -d "$git" &&
 	retry_until_success mkdir "$git"

What's going on is this: 'git p4' invokes 'git fast-import' via
Python's subprocess.Popen(), and then there are two chain of events
racing with each other:

  - In the foreground:
    - 'git p4' notices that there are no p4 branches and die()s as
      expected, but leaves its child fast-import behind
    - 'test_i18ngrep' quickly verifies that 'git p4' died with the
      expected error message
    - the cleanup function removes the "$TRASH_DIRECTORY/git"
      directory, and
    - checks that the directory is indeed gone.

  - Meanwhile in the background:
    - 'git fast-import' starts up, kicks off the dashed external
      'git-fast-import' process,
    - which opens a tmp packfile in "$TRASH_DIRECTORY/git" for writing
      (the start_packfile() call in the patch context above), creating
      any leading directories if necessary,
    - notices that its stdin is closed (because 'git p4' died) and it
      has nothing left to do, so
    - it cleans up and exits.  Note that this cleanup only removes the
      tmp packfile, but leaves any newly created leading directories

While unlikely, it does apparently happen from time to time that the
test's cleanup function first removes "$TRASH_DIRECTORY/git", but then
'git fast-import' re-creates it for its packfile before the cleanup
function checks the result of the removal.  The two well-placed
sleep()s in the patch above force such a problematic scheduling.

There are 4 test cases running 'test_must_fail git p4 sync': the above
patch triggers a failure in 3 of them, and with a bit of extra modding
I could trigger a failure in the fourth one as well.

We could work this around by waiting for 'git p4' and all its child
processes in the affected tests themselves, using the same shell
trickery as in ef09036cf3 (t6500: wait for detached auto gc at the end
of the test script, 2017-04-13), but this feels like, well, a

I think the proper solution would be to ensure that 'git p4' kills and
waits for all its child processes when die()ing.  Alternatively (or in
addition?), it could perform all the necessary sanity-checking (and
potential die()ing) before starting the 'git fast-import' process in
the first place.

I've glanced through all subprocess.Popen() callsites in 'git p4' and
found most of them OK, in the sense that they wait for whatever child
process they create.  Alas, there was one exception: p4CmdList() can
invoke an optional callback function before wait()ing on its 'p4'
child, and the streamP4FilesCb() callback function can die() without
waiting for the 'p4' process (but it does wait() for 'git

On a related note: this check for the just-removed directory was added
in 23aee4199a (git-p4: retry kill/cleanup operations in tests with
timeout, 2015-11-19), which mentions flaky cleanup operations.
Perhaps this is the same flakiness?!

Anyway, as I mentioned elsewhere before, I have no idea why/how 'git
p4' works, so I'll leave it up to you how it's best to deal with this