Web lists-archives.com

[PATCH v3 0/5] Fix and extend encoding handling in fast export/import




While stress testing `git filter-repo`, I noticed an issue with
encoding; further digging led to the fixes and features in this series.
See the individual commit messages for details.

Changes since v2 (full range-diff below):
  * Modified the testcases to pass on Windows[1], as verified via
    gitgitgadget pull request[2].  Required adding a couple new files
    (which store the desired bytes) and checking the size of the output
    instead of checking for particular bytes (but the lengths of the
    expected byte sequences differ so this works fine...).

[1] Failures of previous patchset on Windows noticed and reported by Dscho;
    explanation from Hannes is that Windows munges users' command lines to
    force them to be characters instead of bytes.
[2] https://github.com/gitgitgadget/git/pull/187

Elijah Newren (5):
t9350: fix encoding test to actually test reencoding
fast-import: support 'encoding' commit header
fast-export: avoid stripping encoding header if we cannot reencode
fast-export: differentiate between explicitly utf-8 and implicitly
utf-8
fast-export: do automatic reencoding of commit messages only if
requested

Documentation/git-fast-import.txt            |  7 ++
builtin/fast-export.c                        | 44 ++++++++++--
fast-import.c                                | 11 ++-
t/t9300-fast-import.sh                       | 20 ++++++
t/t9350-fast-export.sh                       | 75 +++++++++++++++++---
t/t9350/broken-iso-8859-7-commit-message.txt |  1 +
t/t9350/simple-iso-8859-7-commit-message.txt |  1 +
7 files changed, 142 insertions(+), 17 deletions(-)
create mode 100644 t/t9350/broken-iso-8859-7-commit-message.txt
create mode 100644 t/t9350/simple-iso-8859-7-commit-message.txt

Range-diff:
1:  9cc04242bd ! 1:  2d7bb64acf t9350: fix encoding test to actually test reencoding
    @@ -32,15 +32,26 @@
     -	git commit -s -m den file &&
     -	git fast-export wer^..wer >iso8859-1.fi &&
     -	sed "s/wer/i18n/" iso8859-1.fi |
    -+	git commit -s -m "$(printf "Pi: \360")" file &&
    ++	git commit -s -F "$TEST_DIRECTORY/t9350/simple-iso-8859-7-commit-message.txt" file &&
     +	git fast-export wer^..wer >iso-8859-7.fi &&
     +	sed "s/wer/i18n/" iso-8859-7.fi |
      		(cd new &&
      		 git fast-import &&
    ++		 # The commit object, if not re-encoded, would be 240 bytes.
    ++		 # Removing the "encoding iso-8859-7\n" header drops 20 bytes.
    ++		 # Re-encoding the Pi character from \xF0 in iso-8859-7 to
    ++		 # \xCF\x80 in utf-8 adds a byte.  Grepping for specific bytes
    ++		 # would be nice, but Windows apparently munges user data
    ++		 # in the form of bytes on the command line to force them to
    ++		 # be characters instead, so we are limited for portability
    ++		 # reasons in subsequent similar tests in this file to check
    ++		 # for size rather than what bytes are present.
    ++		 test 221 -eq "$(git cat-file -s i18n)" &&
    ++		 # Also make sure the commit does not have the "encoding" header
      		 git cat-file commit i18n >actual &&
     -		 grep "Áéí óú" actual)
     -
    -+		 grep $(printf "\317\200") actual)
    ++		 ! grep ^encoding actual)
      '
     +
      test_expect_success 'import/export-marks' '
    @@ -54,3 +65,11 @@
      	git checkout -b copy rein &&
      	git mv file file3 &&
      	git commit -m move1 &&
    +
    + diff --git a/t/t9350/simple-iso-8859-7-commit-message.txt b/t/t9350/simple-iso-8859-7-commit-message.txt
    + new file mode 100644
    + --- /dev/null
    + +++ b/t/t9350/simple-iso-8859-7-commit-message.txt
    +@@
    ++Pi: �  + \ No newline at end of file
2:  0cd023ac7a = 2:  9fa5695017 fast-import: support 'encoding' commit header
3:  1fddf51402 ! 3:  dfc76573e9 fast-export: avoid stripping encoding header if we cannot reencode
    @@ -35,7 +35,7 @@
      --- a/t/t9350-fast-export.sh
      +++ b/t/t9350-fast-export.sh
     @@
    - 		 grep $(printf "\317\200") actual)
    + 		 ! grep ^encoding actual)
      '
      
     +test_expect_success 'encoding preserved if reencoding fails' '
    @@ -43,15 +43,26 @@
     +	test_when_finished "git reset --hard HEAD~1" &&
     +	test_config i18n.commitencoding iso-8859-7 &&
     +	echo rosten >file &&
    -+	git commit -s -m "$(printf "Pi: \360; Invalid: \377")" file &&
    ++	git commit -s -F "$TEST_DIRECTORY/t9350/broken-iso-8859-7-commit-message.txt" file &&
     +	git fast-export wer^..wer >iso-8859-7.fi &&
     +	sed "s/wer/i18n-invalid/" iso-8859-7.fi |
     +		(cd new &&
     +		 git fast-import &&
     +		 git cat-file commit i18n-invalid >actual &&
    -+		 grep ^encoding actual)
    ++		 grep ^encoding actual &&
    ++		 # Also verify that the commit has the expected size; i.e.
    ++		 # that no bytes were re-encoded to a different encoding.
    ++		 test 252 -eq "$(git cat-file -s i18n-invalid)")
     +'
     +
      test_expect_success 'import/export-marks' '
      
      	git checkout -b marks master &&
    +
    + diff --git a/t/t9350/broken-iso-8859-7-commit-message.txt b/t/t9350/broken-iso-8859-7-commit-message.txt
    + new file mode 100644
    + --- /dev/null
    + +++ b/t/t9350/broken-iso-8859-7-commit-message.txt
    +@@
    ++Pi: �nvalid: �
    + \ No newline at end of file
4:  4a2e04b3ae = 4:  83b3656b76 fast-export: differentiate between explicitly utf-8 and implicitly utf-8
5:  44aacb1a0b ! 5:  2063122293 fast-export: do automatic reencoding of commit messages only if requested
    @@ -95,14 +95,14 @@
      	test_config i18n.commitencoding iso-8859-7 &&
      	test_tick &&
      	echo rosten >file &&
    - 	git commit -s -m "$(printf "Pi: \360")" file &&
    + 	git commit -s -F "$TEST_DIRECTORY/t9350/simple-iso-8859-7-commit-message.txt" file &&
     -	git fast-export wer^..wer >iso-8859-7.fi &&
     +	git fast-export --reencode=yes wer^..wer >iso-8859-7.fi &&
      	sed "s/wer/i18n/" iso-8859-7.fi |
      		(cd new &&
      		 git fast-import &&
     @@
    - 		 grep $(printf "\317\200") actual)
    + 		 ! grep ^encoding actual)
      '
      
     +test_expect_success 'aborting on iso-8859-7' '
    @@ -110,7 +110,7 @@
     +	test_when_finished "git reset --hard HEAD~1" &&
     +	test_config i18n.commitencoding iso-8859-7 &&
     +	echo rosten >file &&
    -+	git commit -s -m "$(printf "Pi: \360")" file &&
    ++	git commit -s -F "$TEST_DIRECTORY/t9350/simple-iso-8859-7-commit-message.txt" file &&
     +	test_must_fail git fast-export --reencode=abort wer^..wer >iso-8859-7.fi
     +'
     +
    @@ -119,13 +119,21 @@
     +	test_when_finished "git reset --hard HEAD~1" &&
     +	test_config i18n.commitencoding iso-8859-7 &&
     +	echo rosten >file &&
    -+	git commit -s -m "$(printf "Pi: \360")" file &&
    ++	git commit -s -F "$TEST_DIRECTORY/t9350/simple-iso-8859-7-commit-message.txt" file &&
     +	git fast-export --reencode=no wer^..wer >iso-8859-7.fi &&
     +	sed "s/wer/i18n-no-recoding/" iso-8859-7.fi |
     +		(cd new &&
     +		 git fast-import &&
    ++		 # The commit object, if not re-encoded, is 240 bytes.
    ++		 # Removing the "encoding iso-8859-7\n" header would drops 20
    ++		 # bytes.  Re-encoding the Pi character from \xF0 in
    ++		 # iso-8859-7 to \xCF\x80 in utf-8 would add a byte.  I would
    ++		 # grep for the # specific bytes, but Windows lamely does not
    ++		 # allow that, so just search for the expected size.
    ++		 test 240 -eq "$(git cat-file -s i18n-no-recoding)" &&
    ++		 # Also make sure the commit has the "encoding" header
     +		 git cat-file commit i18n-no-recoding >actual &&
    -+		 grep $(printf "\360") actual)
    ++		 grep ^encoding actual)
     +'
     +
      test_expect_success 'encoding preserved if reencoding fails' '
    @@ -133,7 +141,7 @@
      	test_when_finished "git reset --hard HEAD~1" &&
      	test_config i18n.commitencoding iso-8859-7 &&
      	echo rosten >file &&
    - 	git commit -s -m "$(printf "Pi: \360; Invalid: \377")" file &&
    + 	git commit -s -F "$TEST_DIRECTORY/t9350/broken-iso-8859-7-commit-message.txt" file &&
     -	git fast-export wer^..wer >iso-8859-7.fi &&
     +	git fast-export --reencode=yes wer^..wer >iso-8859-7.fi &&
      	sed "s/wer/i18n-invalid/" iso-8859-7.fi |

-- 
2.21.0.782.g2063122293