Web lists-archives.com

Re: Invalid UTF-8 byte? (was: Re: utf)




On Wed, 04 Apr 2018, tomas@xxxxxxxxxx wrote:
> On Wed, Apr 04, 2018 at 08:18:23AM -0300, Henrique de Moraes Holschuh wrote:
> > On Tue, 03 Apr 2018, Michael Lange wrote:
> > > I believe (please anyone correct me if I am wrong) that "text" files
> > > won't contain any null byte; many text editors even refuse to open such a
> > 
> > Depends on the encoding.  For ASCII, ISO-8859-* and UTF-8 (and any other
> > modern encoding AFAIK, other than modified UTF-8), any zero bytes map
> > one-to-one to the NUL character/code point.  I don't recall how it is on
> > other common encodings of the 80's and 90's, though.
> 
> Try UTF-16, what Microsoft (and a couple of years ago Apple) love to
> call "Unicode": in more "Western" contexts every second byte is NULL!

Ah, yes.  I forgot about them, indeed.  UTF-16BE and UTF-16LE will have
zero bytes in the resulting byte stream.  And I suppose one could call
them "modern encodings", even if they are horrifying to use when
compared to UTF-8 (UTF-16 has byte-order issues) or UTF-32 (UTF-16 has
surrogate pairs).

> > Some even-more-modern encodings (modified UTF-8 :p) simply do NOT use
> > bytes with the value of zero when encoding characters, so NUL is encoded
> > by a different sequence, and you can safely use a byte with the value of
> > zero for some out-of-band control [...]
> 
> Yes, the problem is that someone else before you could have been doing
> exactly that.

You can modified-UTF-8 bit-packing to encode anything, and the result
will be zero-free (and it will restore the zeroes when decoded).  The
price is a size increase (it is a variant of UTF-8 that uses two bytes
to encode NUL, which would take just one byte in normal UTF-8).  There
are much better bit packing schemes if you just need to escape zeroes
;-)

That said, it is always safe to break valid "modified UTF-8" into
records using zeroes, as long as you don't expect the result to be valid
UTF-8 (it isn't valid UTF-8 because NULs will be encoded using a
non-minimal byte sequence that *will* decode to a zero even if it is
invalid) or valid modified UTF-8 (it isn't valid modified UTF-8 because
0 is not valid as an encoding for NUL in modified UTF-8).  But a lax
UTF-8 or modified UTF-8 *would* parse "modified UTF-8 with zero as
record separators" and reconstruct the unicode text properly (but it
would read the record separators as NULs, so you'd get extra NULs in the
resulting text).

That, of course, assumes you have unicode text as the input (encoding
doesn't matter, as long as you know it), and recode it to modified UTF-8
before you add the zeroes as end-of-record marks.  This is not about
bit-packing generic binary data.

> I'd guard against that. It's not exactly difficult, the traditional
> "escape" mechanism (aka character stuffing) does it pretty well...

Yes, any bitstuffing/escape-based wrapping would do.

-- 
  Henrique Holschuh