Web lists-archives.com

Re: Cygwin fails to utilize Unicode replacement character




Am 04.09.2018 um 21:53 schrieb Steven Penny:
On Tue, 4 Sep 2018 20:41:48, Thomas Wolff wrote:
...
the .notdef glyph is not an appropriate indication of illegal encoding (like broken UTF-8 bytes)

true, but neither is U+2592. as far as i know U+2592 is not defined officially
anywhere as being a representation of anything other than "MEDIUM SHADE".
Traditionally, many terminals used to display the DEL character as a checkered block, which is more or less the MEDIUM SHADE.
This makes the glyph appear somewhat "erroneous" by convention.

Corinna originally added it in 2009:

http://cygwin.com/git/gitweb.cgi?p=newlib-cygwin.git&a=commitdiff&h=161211d

with no justification of why it was chosen that i can tell.
Justification is traditional usage of the symbol as described above.

similarly, mintty
actually changed from U+FFFD to U+2592 in 2009:

http://github.com/mintty/mintty/commit/90c11d3

with actually a good reason, which was to avoid ambiguity with fonts that didnt have U+FFFD. but again, no reason why U+2592 was chosen. i personally see both sides of the argument but i tend to land of the side of any standards if they
exist.

Here is the standard for U+FFFD:

http://unicode.org/charts/nameslist/n_FFF0.html
FFFD     �     Replacement Character
          •    used to replace an incoming character whose value is unknown or unrepresentable in Unicode

if we were to use something other than U+FFFD, I would propose U+25A1, as it is
also defined by Unicode:

   25A1     □     White Square
   •    may be used to represent a missing ideograph

http://unicode.org/charts/nameslist/n_25A0.html
Quoting yourself from your other response:
U+2592 MEDIUM SHADE is *only* used in cases of invalid UTF-8. In case of missing character - the ".notdef" glyph is used
This is my point. We have two use cases here:
invalid code point -> MEDIUM SHADE
valid code point with no glyph in font -> .notdef glyph -> WHITE SQUARE
Now if you switch to FFFD REPLACEMENT CHARACTER for invalid code point, and considering that it does not exist in most actual fonts and that the console does not apply font fallback, it will resolve to WHITE SQUARE, thus:
folding the two different use cases into the same appearance,
which is bad.
Thomas

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple