Web lists-archives.com

Re: Invalid UTF-8 byte? (was: Re: utf)




On Mon, 02 Apr 2018, rhkramer@xxxxxxxxx wrote:
> The wikipedia article is rather interesting, in a quick skim, I learned some 
> interesting things about UTF-8, especially the property of self-
> synchronization.

Yes, UTF-8 is a brilliant design.

> I had trouble reading that large table--but if I simply take the red boxes at 
> face value, maybe there are 10 or so bytes that are not valid UTF-8.  I'll 
> probably first consider the bytes that tomas also mentions, i.e., decimal 254 
> and 255).

On that table, columns are the least significant bits (second hex
digit), and rows are the most significant bits (first hex digit) of a
byte.  As in C1 is row C, column 1.

The "2-byte", "3-byte" and "4-byte" are comments that remind you of the
self-sinchronizing nature of UTF-8, and that these bytes would be
invalid outside of that position in an UTF-8 sequence that encodes a
single code point (but they would be valid in the correct position).

The stuff in "red" on that table is always invalid for Unicode: if you
find one of those in a data file, that file is *not* valid UTF-8 (but it
could be valid UTF-16, valid UTF-32, or valid ISO-8859-*, etc).

> I guess I have a followup question--are those two bytes (or either one of 
> them) also unused in all possible "code pages"?  

For Unicode, yes, because Unicode can't go past code point 0x10ffff.
And that isn't about to change anytime soon (lots of stuff hardcode it
somehow, e.g., by limiting the number of UTF-8 bytes that can be used to
encode a single code point...).  I have not read the Unicode standard to
check what it says about future expansions related to the valid code
point range, though.

> The problem is that I copy snippets of text from all kinds of sources into 
> those text files (which are formatted like mbox files), so I might find one or 
> both of those bytes in the file already.

Then it isn't a valid unicode text file in UTF-8 format, and it needs to
be converted (or fixed) first to be encoded in UTF-8 :-)

-- 
  Henrique Holschuh