Web lists-archives.com

Re: Invalid UTF-8 byte? (was: Re: utf)




On Monday, April 02, 2018 06:43:28 PM Michael Lange wrote:
> On Mon, 2 Apr 2018 08:37:54 -0400
> 
> rhkramer@xxxxxxxxx wrote:
> > A few weeks ago, I was looking for a byte that, in UTF-8, would be a
> > totally invalid byte (not an invalid sequence of bytes).  At the time,
> > I tried some googling, but it looked rather hopeless (maybe it was my
> > googling that was hopeless).
> > 
> > I know that your statement does not imply there is such a byte, but
> > maybe you (or someone else reading this) know(s)?
> > 
> > (The reason I wanted such a byte was to use it as a record separator in
> > a set of text files (that I use as an askSam "workalike" (or
> > "worksimilar") so that I could use msort (which depends on a 1 byte
> > record separator to --separate the records ;-) while sorting.)  (Some
> > of the files already include UTF-8, and, in the future, I anticpate all
> > will be in UTFF-8.)
> 
> maybe you could use the null byte?

Thanks!

Surprisingly (to me), this (and maybe several other of the control characters 
might work--I did a search of one of the files, and there are no null bytes.

Next I'll have to refresh my memory on how to replace the existing From with 
From preceded by the null character, i.e., something like:

Find: \n\nFrom 
Replace with \n\n0x00\nFrom

I'll probably look into doing that with something like Awk or Perl.  I'll have 
to review how to represent hex 00 in the Awk or Perl statement.

(I didn't check to see if any 0xff bytes are present in the file, I suspect 
there aren't, and I could use that as well.