Web lists-archives.com

Re: Invalid UTF-8 byte? (was: Re: utf)




Thanks to tomas and Henrique!

The wikipedia article is rather interesting, in a quick skim, I learned some 
interesting things about UTF-8, especially the property of self-
synchronization.

I had trouble reading that large table--but if I simply take the red boxes at 
face value, maybe there are 10 or so bytes that are not valid UTF-8.  I'll 
probably first consider the bytes that tomas also mentions, i.e., decimal 254 
and 255).

I guess I have a followup question--are those two bytes (or either one of 
them) also unused in all possible "code pages"?  

The problem is that I copy snippets of text from all kinds of sources into 
those text files (which are formatted like mbox files), so I might find one or 
both of those bytes in the file already.

I guess it's not a big deal as, I will either:

   * search the file (with a hex editor, I guess) to see if decimal 254 or 255 
is in use already--if only one or a few cases, I might replace it with 
something else before adding additional instances to serve as a temporary 
record separator (for use by msort), or

   * use one of the other utilities that I've since found which can apparently 
sort mbox files while keeping emails intact (I have to read up on those (or 
it?) again as there were, iirc, also some limitations there that might not let 
me accomplish what I want (usually, sorting the emails by the title in the 
mbox "From " header (and ususally not by the email From: header).

Thanks again!


On Monday, April 02, 2018 09:05:52 AM Henrique de Moraes Holschuh wrote:
> On Mon, 02 Apr 2018, rhkramer@xxxxxxxxx wrote:
> > A few weeks ago, I was looking for a byte that, in UTF-8, would be a
> > totally invalid byte (not an invalid sequence of bytes).  At the time, I
> > tried some googling, but it looked rather hopeless (maybe it was my
> > googling that was hopeless).
> 
> 0xff should work.  But any of those in RED on the wikipedia article
> about UTF-8 would do for Unicode text:
> 
> https://en.wikipedia.org/wiki/UTF-8