Re: Invalid UTF-8 byte? (was: Re: utf)
- Date: Mon, 2 Apr 2018 20:40:55 +0200
- From: <tomas@xxxxxxxxxx>
- Subject: Re: Invalid UTF-8 byte? (was: Re: utf)
-----BEGIN PGP SIGNED MESSAGE-----
On Mon, Apr 02, 2018 at 03:18:38PM -0300, Henrique de Moraes Holschuh wrote:
> On Mon, 02 Apr 2018, rhkramer@xxxxxxxxx wrote:
> > The wikipedia article is rather interesting, in a quick skim, I learned some
> > interesting things about UTF-8, especially the property of self-
> > synchronization.
> Yes, UTF-8 is a brilliant design.
Possibly relevant, definitely entertaining, Rob Pike's account
of UTF-8's gestation 
Yeah. Elegant design. Until the Unicode Consortium left Microsoft
near it (Byte Order Mark, I'm looking at you!).
> > I guess I have a followup question--are those two bytes (or either one of
> > them) also unused in all possible "code pages"?
I'm not sure what you mean here: there are two layers at work (at least
if you have UTF-8 encoded Unicode). As Henrique says, if you assume
both to be "correct" then you get more illegal things. But sometimes
UTF-8 encoding is used for other things (notably Emacs encodes a superset
of Unicode, to be able to express "raw byte values" next to "Unicode
> > The problem is that I copy snippets of text from all kinds of sources into
> > those text files (which are formatted like mbox files), so I might find one or
> > both of those bytes in the file already.
> Then it isn't a valid unicode text file in UTF-8 format, and it needs to
> be converted (or fixed) first to be encoded in UTF-8 :-)
Agreed: if you don't know what's coming in, you better plan for anything :)
- -- t
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
-----END PGP SIGNATURE-----