Web lists-archives.com

[PATCH v6 0/7] convert: add support for different encodings




From: Lars Schneider <larsxschneider@xxxxxxxxx>

Hi,

Patches 1-4, 6 are preparation and helper functions.
Patch 5,7 are the actual change.

This series depends on Torsten's 8462ff43e4 (convert_to_git():
safe_crlf/checksafe becomes int conv_flags, 2018-01-13) which is already
in next.

Changes since v5:

* added 'core.checkRoundtripEncoding' to let the end user define a list
  of encodings for which Git performs conversion round trip checks
  (Junio's feedback)
* add a test case for round trip conversions
* move all round trip related changes into a new patch
* rename has_missing_utf_bom() to is_missing_required_utf_bom() (Junio)
* remove dead code in conversion round trip check (Junio)
* use *.proj files instead of *.txt files as example (Torsten's feedback)
* recommend explicitly setting the eol attribute if working-tree-encoding
  attribute is used (Torsten)
* improve test case and check that 'git ls-files --eol' works as expected
* rename variables from checkout_encoding to working_tree_encoding


Thanks,
Lars

   RFC: https://public-inbox.org/git/BDB9B884-6D17-4BE3-A83C-F67E2AFA2B46@xxxxxxxxx/
    v1: https://public-inbox.org/git/20171211155023.1405-1-lars.schneider@xxxxxxxxxxxx/
    v2: https://public-inbox.org/git/20171229152222.39680-1-lars.schneider@xxxxxxxxxxxx/
    v3: https://public-inbox.org/git/20180106004808.77513-1-lars.schneider@xxxxxxxxxxxx/
    v4: https://public-inbox.org/git/20180120152418.52859-1-lars.schneider@xxxxxxxxxxxx/
    v5: https://public-inbox.org/git/20180129201855.9182-1-tboegi@xxxxxx/


Base Ref:
Web-Diff: https://github.com/larsxschneider/git/commit/9e31832f3c
Checkout: git fetch https://github.com/larsxschneider/git encoding-v6 && git checkout 9e31832f3c


### Interdiff (v5..v6):

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 0e25b2c92b..d7a56054a5 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -530,6 +530,12 @@ core.autocrlf::
    This variable can be set to 'input',
    in which case no output conversion is performed.

+core.checkRoundtripEncoding::
+   A comma separated list of encodings that Git performs UTF-8 round
+   trip checks on if they are used in an `working-tree-encoding`
+   attribute (see linkgit:gitattributes[5]). The default value is
+   `SHIFT-JIS`.
+
 core.symlinks::
    If false, symbolic links are checked out as small plain files that
    contain the link text. linkgit:git-update-index[1] and
diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index a8dbf4be30..ea5a9509c6 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -283,10 +283,10 @@ visualize the content.

 In these cases you can tell Git the encoding of a file in the working
 directory with the `working-tree-encoding` attribute. If a file with this
-attributes is added to Git, then Git reencodes the content from the
-specified encoding to UTF-8 and stores the result in its internal data
-structure (called "the index"). On checkout the content is encoded
-back to the specified encoding.
+attribute is added to Git, then Git reencodes the content from the
+specified encoding to UTF-8. Finally, Git stores the UTF-8 encoded
+content in its internal data structure (called "the index"). On checkout
+the content is reencoded back to the specified encoding.

 Please note that using the `working-tree-encoding` attribute may have a
 number of pitfalls:
@@ -296,32 +296,38 @@ number of pitfalls:
   expected encoding. Consequently, these files will appear different
   which typically causes trouble. This is in particular the case for
   older Git versions and alternative Git implementations such as JGit
-  or libgit2 (as of January 2018).
+  or libgit2 (as of February 2018).

-- Reencoding content to non-UTF encodings (e.g. SHIFT-JIS) can cause
-  errors as the conversion might not be round trip safe.
+- Reencoding content to non-UTF encodings can cause errors as the
+  conversion might not be UTF-8 round trip safe. If you suspect your
+  encoding to not be round trip safe, then add it to `core.checkRoundtripEncoding`
+  to make Git check the round trip encoding (see linkgit:git-config[1]).
+  SHIFT-JIS (Japanese character set) is known to have round trip issues
+  with UTF-8 and is checked by default.

 - Reencoding content requires resources that might slow down certain
   Git operations (e.g 'git checkout' or 'git add').

-Use the `working-tree-encoding` attribute only if you cannot store a file in
-UTF-8 encoding and if you want Git to be able to process the content as
-text.
+Use the `working-tree-encoding` attribute only if you cannot store a file
+in UTF-8 encoding and if you want Git to be able to process the content
+as text.

-Use the following attributes if your '*.txt' files are UTF-16 encoded
-with byte order mark (BOM) and you want Git to perform automatic line
-ending conversion based on your platform.
+As an example, use the following attributes if your '*.proj' files are
+UTF-16 encoded with byte order mark (BOM) and you want Git to perform
+automatic line ending conversion based on your platform.

 ------------------------
-*.txt      text working-tree-encoding=UTF-16
+*.proj     text working-tree-encoding=UTF-16
 ------------------------

-Use the following attributes if your '*.txt' files are UTF-16 little
+Use the following attributes if your '*.proj' files are UTF-16 little
 endian encoded without BOM and you want Git to use Windows line endings
-in the working directory.
+in the working directory. Please note, it is highly recommended to
+explicitly define the line endings with `eol` if the `working-tree-encoding`
+attribute is used to avoid ambiguity.

 ------------------------
-*.txt      working-tree-encoding=UTF-16LE text eol=CRLF
+*.proj         working-tree-encoding=UTF-16LE text eol=CRLF
 ------------------------

 You can get a list of all available encodings on your platform with the
@@ -331,6 +337,13 @@ following command:
 iconv --list
 ------------------------

+If you do not know the encoding of a file, then you can use the `file`
+command to guess the encoding:
+
+------------------------
+file foo.proj
+------------------------
+

 `ident`
 ^^^^^^^
diff --git a/config.c b/config.c
index 1f003fbb90..d0ada9fcd4 100644
--- a/config.c
+++ b/config.c
@@ -1172,6 +1172,11 @@ static int git_default_core_config(const char *var, const char *value)
        return 0;
    }

+   if (!strcmp(var, "core.checkroundtripencoding")) {
+       check_roundtrip_encoding = xstrdup(value);
+       return 0;
+   }
+
    if (!strcmp(var, "core.notesref")) {
        notes_ref_name = xstrdup(value);
        return 0;
diff --git a/convert.c b/convert.c
index 13fad490ce..71dffc7167 100644
--- a/convert.c
+++ b/convert.c
@@ -269,7 +269,7 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
 static void trace_encoding(const char *context, const char *path,
               const char *encoding, const char *buf, size_t len)
 {
-   static struct trace_key coe = TRACE_KEY_INIT(CHECKOUT_ENCODING);
+   static struct trace_key coe = TRACE_KEY_INIT(WORKING_TREE_ENCODING);
    struct strbuf trace = STRBUF_INIT;
    int i;

@@ -289,6 +289,39 @@ static void trace_encoding(const char *context, const char *path,
    strbuf_release(&trace);
 }

+static int check_roundtrip(const char* enc_name)
+{
+   /*
+    * check_roundtrip_encoding contains a string of space and/or
+    * comma separated encodings (eg. "UTF-16, ASCII, CP1125").
+    * Search for the given encoding in that string.
+    */
+   const char *found = strcasestr(check_roundtrip_encoding, enc_name);
+   const char *next = found + strlen(enc_name);
+   int len = strlen(check_roundtrip_encoding);
+   return (found && (
+           /*
+            * check that the found encoding is at the
+            * beginning of check_roundtrip_encoding or
+            * that it is prefixed with a space or comma
+            */
+           found == check_roundtrip_encoding || (
+               found > check_roundtrip_encoding &&
+               (*(found-1) == ' ' || *(found-1) == ',')
+           )
+       ) && (
+           /*
+            * check that the found encoding is at the
+            * end of check_roundtrip_encoding or
+            * that it is suffixed with a space or comma
+            */
+           next == check_roundtrip_encoding + len || (
+               next < check_roundtrip_encoding + len &&
+               (*next == ' ' || *next == ',')
+           )
+       ));
+}
+
 static struct encoding {
    const char *name;
    struct encoding *next;
@@ -333,7 +366,7 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
            error(error_msg, path, enc->name);


-   } else if (has_missing_utf_bom(enc->name, src, src_len)) {
+   } else if (is_missing_required_utf_bom(enc->name, src, src_len)) {
        const char *error_msg = _(
            "BOM is required for '%s' if encoded as %s");
        const char *advise_msg = _(
@@ -367,22 +400,25 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
    trace_encoding("destination", path, default_encoding, dst, dst_len);

    /*
-    * UTF supports lossless round tripping [1]. UTF to other encoding are
-    * mostly round trip safe as Unicode aims to be a superset of all other
-    * character encodings. However, the SHIFT-JIS (Japanese character set)
-    * is an exception as some codes are not round trip safe [2].
+    * UTF supports lossless conversion round tripping [1] and conversions
+    * between UTF and other encodings are mostly round trip safe as
+    * Unicode aims to be a superset of all other character encodings.
+    * However, certain encodings (e.g. SHIFT-JIS) are known to have round
+    * trip issues [2]. Check the round trip conversion for all encodings
+    * listed in core.checkRoundTripEncoding.
     *
-    * Reverse the transformation of 'dst' and check the result with 'src'
-    * if content is written to Git. This ensures no information is lost
-    * during conversion to/from UTF-8.
+    * The round trip check is only performed if content is written to Git.
+    * This ensures that no information is lost during conversion to/from
+    * the internal UTF-8 representation.
     *
     * Please note, the code below is not tested because I was not able to
-    * generate a faulty round trip without iconv error.
+    * generate a faulty round trip without an iconv error. Iconv errors
+    * are already caught above.
     *
     * [1] http://unicode.org/faq/utf_bom.html#gen2
     * [2] https://support.microsoft.com/en-us/help/170559/prb-conversion-problem-between-shift-jis-and-unicode
     */
-   if ((conv_flags & CONV_WRITE_OBJECT) && !strcmp(enc->name, "SHIFT-JIS")) {
+   if ((conv_flags & CONV_WRITE_OBJECT) && check_roundtrip(enc->name)) {
        char *re_src;
        int re_src_len;

@@ -390,6 +426,7 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
                         enc->name, default_encoding,
                         &re_src_len);

+       trace_printf("Checking roundtrip encoding for %s...\n", enc->name);
        trace_encoding("reencoded source", path, enc->name,
                   re_src, re_src_len);

@@ -397,10 +434,7 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
            memcmp(src, re_src, src_len)) {
            const char* msg = _("encoding '%s' from %s to %s and "
                        "back is not the same");
-           if (conv_flags & CONV_WRITE_OBJECT)
-               die(msg, path, enc->name, default_encoding);
-           else
-               error(msg, path, enc->name, default_encoding);
+           die(msg, path, enc->name, default_encoding);
        }

        free(re_src);
@@ -1165,8 +1199,12 @@ static struct encoding *git_path_check_encoding(struct attr_check_item *check)
    if (!strcasecmp(value, default_encoding))
        return NULL;

-   enc = xcalloc(1, sizeof(struct convert_driver));
-   enc->name = xstrdup_toupper(value);  /* aways use upper case names! */
+   enc = xcalloc(1, sizeof(*enc));
+   /*
+    * Ensure encoding names are always upper case (e.g. UTF-8) to
+    * simplify subsequent string comparisons.
+    */
+   enc->name = xstrdup_toupper(value);
    *encoding_tail = enc;
    encoding_tail = &(enc->next);

@@ -1228,7 +1266,7 @@ struct conv_attrs {
    enum crlf_action attr_action; /* What attr says */
    enum crlf_action crlf_action; /* When no attr is set, use core.autocrlf */
    int ident;
-   struct encoding *checkout_encoding; /* Supported encoding or default encoding if NULL */
+   struct encoding *working_tree_encoding; /* Supported encoding or default encoding if NULL */
 };

 static void convert_attrs(struct conv_attrs *ca, const char *path)
@@ -1262,7 +1300,7 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)
            else if (eol_attr == EOL_CRLF)
                ca->crlf_action = CRLF_TEXT_CRLF;
        }
-       ca->checkout_encoding = git_path_check_encoding(ccheck + 5);
+       ca->working_tree_encoding = git_path_check_encoding(ccheck + 5);
    } else {
        ca->drv = NULL;
        ca->crlf_action = CRLF_UNDEFINED;
@@ -1344,7 +1382,7 @@ int convert_to_git(const struct index_state *istate,
        len = dst->len;
    }

-   ret |= encode_to_git(path, src, len, dst, ca.checkout_encoding, conv_flags);
+   ret |= encode_to_git(path, src, len, dst, ca.working_tree_encoding, conv_flags);
    if (ret && dst) {
        src = dst->buf;
        len = dst->len;
@@ -1373,7 +1411,7 @@ void convert_to_git_filter_fd(const struct index_state *istate,
    if (!apply_filter(path, NULL, 0, fd, dst, ca.drv, CAP_CLEAN, NULL))
        die("%s: clean filter '%s' failed", path, ca.drv->name);

-   encode_to_git(path, dst->buf, dst->len, dst, ca.checkout_encoding, conv_flags);
+   encode_to_git(path, dst->buf, dst->len, dst, ca.working_tree_encoding, conv_flags);
    crlf_to_git(istate, path, dst->buf, dst->len, dst, ca.crlf_action, conv_flags);
    ident_to_git(path, dst->buf, dst->len, dst, ca.ident);
 }
@@ -1405,7 +1443,7 @@ static int convert_to_working_tree_internal(const char *path, const char *src,
        }
    }

-   ret |= encode_to_worktree(path, src, len, dst, ca.checkout_encoding);
+   ret |= encode_to_worktree(path, src, len, dst, ca.working_tree_encoding);
    if (ret) {
        src = dst->buf;
        len = dst->len;
@@ -1877,7 +1915,7 @@ struct stream_filter *get_stream_filter(const char *path, const unsigned char *s
    if (ca.drv && (ca.drv->process || ca.drv->smudge || ca.drv->clean))
        return NULL;

-   if (ca.checkout_encoding)
+   if (ca.working_tree_encoding)
        return NULL;

    if (ca.crlf_action == CRLF_AUTO || ca.crlf_action == CRLF_AUTO_CRLF)
diff --git a/convert.h b/convert.h
index 1d9539ed0b..765abfbd60 100644
--- a/convert.h
+++ b/convert.h
@@ -56,6 +56,7 @@ struct delayed_checkout {
 };

 extern enum eol core_eol;
+extern char *check_roundtrip_encoding;
 extern const char *get_cached_convert_stats_ascii(const struct index_state *istate,
                          const char *path);
 extern const char *get_wt_convert_stats_ascii(const char *path);
diff --git a/environment.c b/environment.c
index 10a32c20ac..5bae9131ad 100644
--- a/environment.c
+++ b/environment.c
@@ -50,6 +50,7 @@ int check_replace_refs = 1;
 char *git_replace_ref_base;
 enum eol core_eol = EOL_UNSET;
 int global_conv_flags_eol = CONV_EOL_RNDTRP_WARN;
+char *check_roundtrip_encoding = "SHIFT-JIS";
 unsigned whitespace_rule_cfg = WS_DEFAULT_RULE;
 enum branch_track git_branch_track = BRANCH_TRACK_REMOTE;
 enum rebase_setup_type autorebase = AUTOREBASE_NEVER;
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index 0f36d4990a..5dcdd5f899 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -4,7 +4,7 @@ test_description='working-tree-encoding conversion via gitattributes'

 . ./test-lib.sh

-GIT_TRACE_CHECKOUT_ENCODING=1 && export GIT_TRACE_CHECKOUT_ENCODING
+GIT_TRACE_WORKING_TREE_ENCODING=1 && export GIT_TRACE_WORKING_TREE_ENCODING

 test_expect_success 'setup test repo' '
    git config core.eol lf &&
@@ -22,6 +22,8 @@ test_expect_success 'setup test repo' '
 test_expect_success 'ensure UTF-8 is stored in Git' '
    git cat-file -p :test.utf16 >test.utf16.git &&
    test_cmp_bin test.utf8.raw test.utf16.git &&
+
+   # cleanup
    rm test.utf8.raw test.utf16.git
 '

@@ -122,6 +124,10 @@ test_expect_success 'eol conversion for UTF-16 encoded files on checkout' '
    cat crlf.utf8.raw | iconv -f UTF-8 -t UTF-16 >crlf.utf16.raw &&
    cp crlf.utf16.raw eol.utf16 &&

+   cat >expectIndexLF <<-\EOF &&
+       i/lf    w/-text attr/text               eol.utf16
+   EOF
+
    git add eol.utf16 &&
    git commit -m eol &&

@@ -130,11 +136,19 @@ test_expect_success 'eol conversion for UTF-16 encoded files on checkout' '
    git -c core.eol=crlf checkout eol.utf16 &&
    test_cmp_bin crlf.utf16.raw eol.utf16 &&

+   # Although the file has CRLF in the working tree, ensure LF in the index
+   git ls-files --eol eol.utf16 >actual &&
+   test_cmp expectIndexLF actual &&
+
    # UTF-16 with LF (Unix line endings)
    rm eol.utf16 &&
    git -c core.eol=lf checkout eol.utf16 &&
    test_cmp_bin lf.utf16.raw eol.utf16 &&

+   # The file LF in the working tree, ensure LF in the index
+   git ls-files --eol eol.utf16 >actual &&
+   test_cmp expectIndexLF actual&&
+
    rm crlf.utf16.raw crlf.utf8.raw lf.utf16.raw lf.utf8.raw &&

    # cleanup
@@ -195,4 +209,45 @@ test_expect_success 'error if encoding garbage is already in Git' '
    git reset --hard $BEFORE_STATE
 '

+test_expect_success 'check roundtrip encoding' '
+   text="hallo there!\nroundtrip test here!" &&
+   printf "$text" | iconv -f UTF-8 -t SHIFT-JIS >roundtrip.shift &&
+   printf "$text" | iconv -f UTF-8 -t UTF-16 >roundtrip.utf16 &&
+   echo "*.shift text working-tree-encoding=SHIFT-JIS" >>.gitattributes &&
+
+   # SHIFT-JIS encoded files are round-trip checked by default...
+   GIT_TRACE=1 git add .gitattributes roundtrip.shift 2>&1 >/dev/null |
+       grep "Checking roundtrip encoding for SHIFT-JIS" &&
+   git reset &&
+
+   # ... unless we overwrite the Git config!
+   test_config core.checkRoundTripEncoding "garbage" &&
+   ! GIT_TRACE=1 git add .gitattributes roundtrip.shift 2>&1 >/dev/null |
+       grep "Checking roundtrip encoding for SHIFT-JIS" &&
+   test_unconfig core.checkRoundTripEncoding &&
+   git reset &&
+
+   # UTF-16 encoded files should not be round-trip checked by default...
+   ! GIT_TRACE=1 git add roundtrip.utf16 2>&1 >/dev/null |
+       grep "Checking roundtrip encoding for UTF-16" &&
+   git reset &&
+
+   # ... unless we tell Git to check it!
+   test_config_global core.checkRoundTripEncoding "UTF-16, UTF-32" &&
+   GIT_TRACE=1 git add roundtrip.utf16 2>&1 >/dev/null |
+       grep "Checking roundtrip encoding for UTF-16" &&
+   git reset &&
+
+   # ... unless we tell Git to check it!
+   # (here we also check that the casing of the encoding is irrelevant)
+   test_config_global core.checkRoundTripEncoding "UTF-32, utf-16" &&
+   GIT_TRACE=1 git add roundtrip.utf16 2>&1 >/dev/null |
+       grep "Checking roundtrip encoding for UTF-16" &&
+   git reset &&
+
+   # cleanup
+   rm roundtrip.shift roundtrip.utf16 &&
+   git reset --hard HEAD
+'
+
 test_done
diff --git a/utf8.c b/utf8.c
index f033fec1c2..5113d26e56 100644
--- a/utf8.c
+++ b/utf8.c
@@ -562,7 +562,7 @@ int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
    );
 }

-int has_missing_utf_bom(const char *enc, const char *data, size_t len)
+int is_missing_required_utf_bom(const char *enc, const char *data, size_t len)
 {
    return (
       !strcmp(enc, "UTF-16") &&
diff --git a/utf8.h b/utf8.h
index 26b5e91852..62f86fba64 100644
--- a/utf8.h
+++ b/utf8.h
@@ -93,6 +93,6 @@ int has_prohibited_utf_bom(const char *enc, const char *data, size_t len);
  *     Section 3.10, D98, page 132
  * [3] https://encoding.spec.whatwg.org/#utf-16le
  */
-int has_missing_utf_bom(const char *enc, const char *data, size_t len);
+int is_missing_required_utf_bom(const char *enc, const char *data, size_t len);

 #endif



### Patches

Lars Schneider (7):
  strbuf: remove unnecessary NUL assignment in xstrdup_tolower()
  strbuf: add xstrdup_toupper()
  utf8: add function to detect prohibited UTF-16/32 BOM
  utf8: add function to detect a missing UTF-16/32 BOM
  convert: add 'working-tree-encoding' attribute
  convert: add tracing for 'working-tree-encoding' attribute
  convert: add round trip check based on 'core.checkRoundtripEncoding'

 Documentation/config.txt         |   6 +
 Documentation/gitattributes.txt  |  73 +++++++++++
 config.c                         |   5 +
 convert.c                        | 256 ++++++++++++++++++++++++++++++++++++++-
 convert.h                        |   2 +
 environment.c                    |   1 +
 sha1_file.c                      |   2 +-
 strbuf.c                         |  13 +-
 strbuf.h                         |   1 +
 t/t0028-working-tree-encoding.sh | 253 ++++++++++++++++++++++++++++++++++++++
 utf8.c                           |  37 ++++++
 utf8.h                           |  25 ++++
 12 files changed, 671 insertions(+), 3 deletions(-)
 create mode 100755 t/t0028-working-tree-encoding.sh


base-commit: 8a2f0888555ce46ac87452b194dec5cb66fb1417
--
2.16.1