Web lists-archives.com

Re: Invalid UTF-8 byte? (was: Re: utf)




-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Mon, Apr 02, 2018 at 03:18:38PM -0300, Henrique de Moraes Holschuh wrote:
> On Mon, 02 Apr 2018, rhkramer@xxxxxxxxx wrote:
> > The wikipedia article is rather interesting, in a quick skim, I learned some 
> > interesting things about UTF-8, especially the property of self-
> > synchronization.
> 
> Yes, UTF-8 is a brilliant design.

Possibly relevant, definitely entertaining, Rob Pike's account
of UTF-8's gestation [1]

Yeah. Elegant design. Until the Unicode Consortium left Microsoft
near it (Byte Order Mark, I'm looking at you!).

[...]

> > I guess I have a followup question--are those two bytes (or either one of 
> > them) also unused in all possible "code pages"?  

I'm not sure what you mean here: there are two layers at work (at least
if you have UTF-8 encoded Unicode). As Henrique says, if you assume
both to be "correct" then you get more illegal things. But sometimes
UTF-8 encoding is used for other things (notably Emacs encodes a superset
of Unicode, to be able to express "raw byte values" next to "Unicode
characters".

> > The problem is that I copy snippets of text from all kinds of sources into 
> > those text files (which are formatted like mbox files), so I might find one or 
> > both of those bytes in the file already.
> 
> Then it isn't a valid unicode text file in UTF-8 format, and it needs to
> be converted (or fixed) first to be encoded in UTF-8 :-)

Agreed: if you don't know what's coming in, you better plan for anything :)

Cheers
- -- t
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)

iEYEARECAAYFAlrCeTcACgkQBcgs9XrR2kbRtgCfaRHoodlkFFt8Gm0Oq438ymvg
0oMAn2NkpsqMJ3Tcy5BvAJIpTvfG8mdj
=iVqF
-----END PGP SIGNATURE-----