Web lists-archives.com

Re: Invalid UTF-8 byte? (was: Re: utf)




-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Wed, Apr 04, 2018 at 08:18:23AM -0300, Henrique de Moraes Holschuh wrote:
> On Tue, 03 Apr 2018, Michael Lange wrote:
> > I believe (please anyone correct me if I am wrong) that "text" files
> > won't contain any null byte; many text editors even refuse to open such a
> 
> Depends on the encoding.  For ASCII, ISO-8859-* and UTF-8 (and any other
> modern encoding AFAIK, other than modified UTF-8), any zero bytes map
> one-to-one to the NUL character/code point.  I don't recall how it is on
> other common encodings of the 80's and 90's, though.

Try UTF-16, what Microsoft (and a couple of years ago Apple) love to
call "Unicode": in more "Western" contexts every second byte is NULL!

> Some even-more-modern encodings (modified UTF-8 :p) simply do NOT use
> bytes with the value of zero when encoding characters, so NUL is encoded
> by a different sequence, and you can safely use a byte with the value of
> zero for some out-of-band control [...]

Yes, the problem is that someone else before you could have been doing
exactly that.

I'd guard against that. It's not exactly difficult, the traditional
"escape" mechanism (aka character stuffing) does it pretty well...

Cheers
- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)

iEYEARECAAYFAlrE3JcACgkQBcgs9XrR2kYW7ACeMG0SQB23RSySoeSJBItB+Eji
QEgAnipwAcoVJuzynJVBO1CR2rrLeuFs
=xhja
-----END PGP SIGNATURE-----