Web lists-archives.com

Re: [PATCH v5 3/5] fast-export: avoid stripping encoding header if we cannot reencode




On Mon, May 13, 2019 at 04:17:24PM -0700, Elijah Newren wrote:
> When fast-export encounters a commit with an 'encoding' header, it tries
> to reencode in utf-8 and then drops the encoding header.  However, if it
> fails to reencode in utf-8 because e.g. one of the characters in the
> commit message was invalid in the old encoding, then we need to retain
> the original encoding or otherwise we lose information needed to
> understand all the other (valid) characters in the original commit
> message.

Minor question: "utf-8" or "UTF-8" ?
Mostly we use UTF-8 in Git.

>
> Signed-off-by: Elijah Newren <newren@xxxxxxxxx>
> ---
>  builtin/fast-export.c                        |  7 +++++--
>  t/t9350-fast-export.sh                       | 21 ++++++++++++++++++++
>  t/t9350/broken-iso-8859-7-commit-message.txt |  1 +
>  3 files changed, 27 insertions(+), 2 deletions(-)
>  create mode 100644 t/t9350/broken-iso-8859-7-commit-message.txt
>
> diff --git a/builtin/fast-export.c b/builtin/fast-export.c
> index 9e283482ef..7734a9f5a5 100644
> --- a/builtin/fast-export.c
> +++ b/builtin/fast-export.c
> @@ -642,9 +642,12 @@ static void handle_commit(struct commit *commit, struct rev_info *rev,
>  	printf("commit %s\nmark :%"PRIu32"\n", refname, last_idnum);
>  	if (show_original_ids)
>  		printf("original-oid %s\n", oid_to_hex(&commit->object.oid));
> -	printf("%.*s\n%.*s\ndata %u\n%s",
> +	printf("%.*s\n%.*s\n",
>  	       (int)(author_end - author), author,
> -	       (int)(committer_end - committer), committer,
> +	       (int)(committer_end - committer), committer);
> +	if (!reencoded && encoding)
> +		printf("encoding %s\n", encoding);
> +	printf("data %u\n%s",
>  	       (unsigned)(reencoded
>  			  ? strlen(reencoded) : message
>  			  ? strlen(message) : 0),
> diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh
> index c721026260..4fd637312a 100755
> --- a/t/t9350-fast-export.sh
> +++ b/t/t9350-fast-export.sh
> @@ -118,6 +118,27 @@ test_expect_success 'iso-8859-7' '
>  		 ! grep ^encoding actual)
>  '
>
> +test_expect_success 'encoding preserved if reencoding fails' '
> +
> +	test_when_finished "git reset --hard HEAD~1" &&
> +	test_config i18n.commitencoding iso-8859-7 &&
> +	echo rosten >file &&
> +	git commit -s -F "$TEST_DIRECTORY/t9350/broken-iso-8859-7-commit-message.txt" file &&
> +	git fast-export wer^..wer >iso-8859-7.fi &&
> +	sed "s/wer/i18n-invalid/" iso-8859-7.fi |
> +		(cd new &&
> +		 git fast-import &&
> +		 git cat-file commit i18n-invalid >actual &&
> +		 # Make sure the commit still has the encoding header
> +		 grep ^encoding actual &&
> +		 # Verify that the commit has the expected size; i.e.
> +		 # that no bytes were re-encoded to a different encoding.
> +		 test 252 -eq "$(git cat-file -s i18n-invalid)" &&
> +		 # ...and check for the original special bytes
> +		 grep $(printf "\360") actual &&
> +		 grep $(printf "\377") actual)
> +'
> +
>  test_expect_success 'import/export-marks' '
>
>  	git checkout -b marks master &&
> diff --git a/t/t9350/broken-iso-8859-7-commit-message.txt b/t/t9350/broken-iso-8859-7-commit-message.txt
> new file mode 100644
> index 0000000000..d06ad75b44
> --- /dev/null
> +++ b/t/t9350/broken-iso-8859-7-commit-message.txt
> @@ -0,0 +1 @@
> +Pi: ?; Invalid: ?
> \ No newline at end of file
> --
> 2.21.0.782.gd8be4ee826
>