Web lists-archives.com

Re: [PATCH v10 7/9] convert: check for detectable errors in UTF encodings




On Wed, Mar 7, 2018 at 12:30 PM,  <lars.schneider@xxxxxxxxxxxx> wrote:
> Check that new content is valid with respect to the user defined
> 'working-tree-encoding' attribute.
>
> Signed-off-by: Lars Schneider <larsxschneider@xxxxxxxxx>
> ---
> diff --git a/convert.c b/convert.c
> @@ -266,6 +266,58 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
> +static int validate_encoding(const char *path, const char *enc,
> +                     const char *data, size_t len, int die_on_error)
> +{
> +       /* We only check for UTF here as UTF?? can be an alias for UTF-?? */
> +       if (startscase_with(enc, "UTF")) {
> +               /*
> +                * Check for detectable errors in UTF encodings
> +                */
> +               if (has_prohibited_utf_bom(enc, data, len)) {
> +                       const char *error_msg = _(
> +                               "BOM is prohibited in '%s' if encoded as %s");
> +                       /*
> +                        * This advice is shown for UTF-??BE and UTF-??LE encodings.
> +                        * We cut off the last two characters of the encoding name
> +                        # to generate the encoding name suitable for BOMs.

s/#/*/

> +                        */
> +                       const char *advise_msg = _(
> +                               "The file '%s' contains a byte order "
> +                               "mark (BOM). Please use %s as "
> +                               "working-tree-encoding.");
> +                       char *upper_enc = xstrdup_toupper(enc);
> +                       upper_enc[strlen(upper_enc)-2] = '\0';

Due to startscase_with(...,"UTF"), we know at this point that the
string is at least 3 characters long, thus it's safe to back up by 2.
Good.

> +                       advise(advise_msg, path, upper_enc);
> +                       free(upper_enc);
> +                       if (die_on_error)
> +                               die(error_msg, path, enc);
> +                       else {
> +                               return error(error_msg, path, enc);
> +                       }
> diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
> @@ -62,6 +62,46 @@ test_expect_success 'check $GIT_DIR/info/attributes support' '
>  for i in 16 32
>  do
> +       test_expect_success "check prohibited UTF-${i} BOM" '
> +               test_when_finished "git reset --hard HEAD" &&
> +
> +               echo "*.utf${i}be text working-tree-encoding=utf-${i}be" >>.gitattributes &&
> +               echo "*.utf${i}le text working-tree-encoding=utf-${i}le" >>.gitattributes &&

v10 is checking only hyphenated lowercase encoding name; earlier
versions checked uppercase. For better coverage, it would be nice to
check several combinations: all uppercase, all lowercase, mixed case,
hyphenated, not hyphenated.

I'm not suggesting running all the tests repeatedly but rather just
varying the format of the encoding name in these tests you're adding.
For instance, the above could instead be:

    echo "*.utf${i}be text working-tree-encoding=UTF-${i}be" >>.gitattributes &&
    echo "*.utf${i}le text working-tree-encoding=utf${i}LE" >>.gitattributes &&

or something.

> +               # Here we add a UTF-16 (resp. UTF-32) files with BOM (big/little-endian)
> +               # but we tell Git to treat it as UTF-16BE/UTF-16LE (resp. UTF-32).
> +               # In these cases the BOM is prohibited.
> +               cp bebom.utf${i}be.raw bebom.utf${i}be &&
> +               test_must_fail git add bebom.utf${i}be 2>err.out &&
> +               test_i18ngrep "fatal: BOM is prohibited .* utf-${i}be" err.out &&
> +
> +               cp lebom.utf${i}le.raw lebom.utf${i}be &&
> +               test_must_fail git add lebom.utf${i}be 2>err.out &&
> +               test_i18ngrep "fatal: BOM is prohibited .* utf-${i}be" err.out &&
> +
> +               cp bebom.utf${i}be.raw bebom.utf${i}le &&
> +               test_must_fail git add bebom.utf${i}le 2>err.out &&
> +               test_i18ngrep "fatal: BOM is prohibited .* utf-${i}le" err.out &&
> +
> +               cp lebom.utf${i}le.raw lebom.utf${i}le &&
> +               test_must_fail git add lebom.utf${i}le 2>err.out &&
> +               test_i18ngrep "fatal: BOM is prohibited .* utf-${i}le" err.out
> +       '
> +
> +       test_expect_success "check required UTF-${i} BOM" '
> +               test_when_finished "git reset --hard HEAD" &&
> +
> +               echo "*.utf${i} text working-tree-encoding=utf-${i}" >>.gitattributes &&

This is another opportunity for checking a variation (case,
hyphenation) of the encoding name rather than using only hyphenated
lowercase.

> +
> +               cp nobom.utf${i}be.raw nobom.utf${i} &&
> +               test_must_fail git add nobom.utf${i} 2>err.out &&
> +               test_i18ngrep "fatal: BOM is required .* utf-${i}" err.out &&
> +
> +               cp nobom.utf${i}le.raw nobom.utf${i} &&
> +               test_must_fail git add nobom.utf${i} 2>err.out &&
> +               test_i18ngrep "fatal: BOM is required .* utf-${i}" err.out
> +       '