Web lists-archives.com

Re: [RFC PATCH 5/5] split-index: smudge and add racily clean cache entries to split index




On Thu, Sep 06, 2018 at 02:26:49PM +0200, Ævar Arnfjörð Bjarmason wrote:
> 
> On Thu, Sep 06 2018, SZEDER Gábor wrote:
> > Several tests failed occasionally when the test suite was run with
> > 'GIT_TEST_SPLIT_INDEX=yes'.  Here are those that I managed to trace
> > back to this racy split index problem, starting with those failing
> > more frequently, with a link to a failing Travis CI build job for
> > each.  The highlighted line shows when the racy file was written,
> > which is not always in the failing test (but note that those lines are
> > in the 'after failure' fold, and your browser might unhelpfully fold
> > it up before you could take a good look).
> 
> Thanks for working on this. When I package up git I run the tests
> under a few different modes, in the case of split index I've been
> doing:
> 
>     GIT_TEST_SPLIT_INDEX=true GIT_SKIP_TESTS="t3903 t4015.77"

Yeah, I noticed that you mentioned this in an unrelated thread the
other day, that's why I put you on Cc.  ... and that's why I pulled
this series from the backburner and cleaned it up for submission.
(Gah, most of these commits have an author date back in February...)

> Since those were the ones I spotted failing under that mode, but
> I still had occasional other failures, I don't have a record of
> which, maybe some of these other tests you mention, maybe not.

I poked around the Travis CI API, and managed to get the logs of all
build jobs that failed with 'GIT_TEST_SPLIT_INDEX=yes' but succeeded
without it.  Here is the list of failed test scripts, along with how
many times they failed:

      1 t0027-auto-crlf.sh
      1 t0090-cache-tree.sh
      1 t3404-rebase-interactive.sh
      1 t5520-pull.sh
      1 t5573-pull-verify-signatures.sh
      1 t5608-clone-2gb.sh
      1 t7406-submodule-update.sh
      2 t2200-add-update.sh
      2 t4002-diff-basic.sh
      2 t5504-fetch-receive-strict.sh
      3 t0028-working-tree-encoding.sh
      4 t1000-read-tree-m-3way.sh
      6 t4015-diff-whitespace.sh
     10 t4024-diff-optimize-common.sh
     17 t3030-merge-recursive.sh
     17 t3402-rebase-merge.sh
     17 t3501-revert-cherry-pick.sh
     17 t6022-merge-rename.sh
     17 t6032-merge-large-rename.sh
     17 t6034-merge-rename-nocruft.sh
     17 t6042-merge-rename-corner-cases.sh
     17 t6043-merge-rename-directories.sh
     17 t6046-merge-skip-unneeded-updates.sh
     17 t7003-filter-branch.sh
     17 t7601-merge-pull-config.sh
     23 t3903-stash.sh

I doubt that this racy split index problem is responsible for all
these failures; e.g. I looked at the failures of a few merge-related
test scripts, and these logs show that they tend to fail because of
memory corruption, i.e. with 'Aborted (core dumped)' or 'Segmentation
fault (core dumped)'.

> To test how this this series improves things, I've been running
> this on a 56 core CentOS 7.5 machine:
> 
>     while true; do GIT_TEST_SPLIT_INDEX=yes prove -j$(parallel --number-of-cores) t3903-stash.sh t4024-diff-optimize-common.sh t4015-diff-whitespace.sh t2200-add-update.sh t0090-cache-tree.sh && echo "OK $(date) $(git describe)" >>log2 || echo "FAIL $(date) $(git describe)" >>log2; done
> 
> While, in another window to get some load on the machine (these seem to
> fail more under load):
> 
>     while true; do prove -j$(parallel --number-of-cores) t[156789]*.sh; done
> 
> The results with this series applied up to 4/5. I.e. without the actual
> fix:
> 
>      92 OK v2.19.0-rc2-6-ged839bd155
>       8 FAIL v2.19.0-rc2-6-ged839bd155
> 
> I.e. when running this 100 times, I got 8 failures. So 8%.

Lucky you ;)

I could only get t3903 to fail on me, but not the others.  That was
enough to eventually track down and fix the problem (fun! ;), and then
I looked at the logs of failed git/git Travis CI build jobs to see,
what other failures might have been caused by it.

> With this patch applied:
> 
>     389 OK v2.19.0-rc2-5-g05a5a13935
>      11 FAIL v2.19.0-rc2-5-g05a5a13935
> 
> This time I ran the tests 400 times, and got 11 failures, i.e. a
> ~2.8% failure rate. I don't have a full account of what stuff
> failed (this was just scrolling past in my terminal), but most
> were:
> 
>     t0090-cache-tree.sh          (Wstat: 256 Tests: 21 Failed: 3)
>       Failed tests:  10-12
>       Non-zero exit status: 1
> 
> I.e. these tests:
> 
>     ok 10 - commit --interactive gives cache-tree on partial commit
>     ok 11 - commit in child dir has cache-tree
>     ok 12 - reset --hard gives cache-tree

But hey, 't0090 --verbose-log -x -i' just failed on me with the fix
applied while writing this email, yay!  In its logs I see the
following errors in all three tests you mention:

  error: index uses ?½þA extension, which we do not understand
  fatal: index file corrupt

Test 13 then starts with 'rm -f .git/index', and then all is well for
the remaining tests.

There was a recent discussion about a similar error starting at:

  https://public-inbox.org/git/20180901214157.hxlqmbz3fds7hsdl@ltop.local/

and leading to a fix in 6c003d6ffb (reopen_tempfile(): truncate opened
file, 2018-09-04).  I certainly hope that not my fix is buggy, but
combined with 'GIT_TEST_SPLIT_INDEX=yes' it can occasionally trigger
the same error, and Peff's (Cc'd) fix will resolve it as well.

Could you run your stress tests with Peff's fix merged with mine?
Apparetly your setup is much more capable of triggering these issues
than mine...