Web lists-archives.com

Invalid UTF-8 byte? (was: Re: utf)




On Monday, April 02, 2018 03:39:05 AM Andre Majorel wrote:
> > Why? UTF (especially UTF-8) is vastly superior for all purposes:
> I wouldn't say that. UTF-8 breaks a number of assumptions. For
> instance,
> 1) every character has the same size,
> 2) every byte sequence is a valid character,

A few weeks ago, I was looking for a byte that, in UTF-8, would be a totally 
invalid byte (not an invalid sequence of bytes).  At the time, I tried some 
googling, but it looked rather hopeless (maybe it was my googling that was 
hopeless).

I know that your statement does not imply there is such a byte, but maybe you 
(or someone else reading this) know(s)?

(The reason I wanted such a byte was to use it as a record separator in a set 
of text files (that I use as an askSam "workalike" (or "worksimilar") so that I 
could use msort (which depends on a 1 byte record separator to --separate the 
records ;-) while sorting.)  (Some of the files already include UTF-8, and, in 
the future, I anticpate all will be in UTFF-8.)



> 3) the equality or inequality of two characters comes down to
>    the equality or inequality of the bytes they encode to.