Re: Invalid UTF-8 byte? (was: Re: utf)
- Date: Mon, 2 Apr 2018 13:41:28 -0400
- From: rhkramer@xxxxxxxxx
- Subject: Re: Invalid UTF-8 byte? (was: Re: utf)
Thanks to tomas and Henrique!
The wikipedia article is rather interesting, in a quick skim, I learned some
interesting things about UTF-8, especially the property of self-
I had trouble reading that large table--but if I simply take the red boxes at
face value, maybe there are 10 or so bytes that are not valid UTF-8. I'll
probably first consider the bytes that tomas also mentions, i.e., decimal 254
I guess I have a followup question--are those two bytes (or either one of
them) also unused in all possible "code pages"?
The problem is that I copy snippets of text from all kinds of sources into
those text files (which are formatted like mbox files), so I might find one or
both of those bytes in the file already.
I guess it's not a big deal as, I will either:
* search the file (with a hex editor, I guess) to see if decimal 254 or 255
is in use already--if only one or a few cases, I might replace it with
something else before adding additional instances to serve as a temporary
record separator (for use by msort), or
* use one of the other utilities that I've since found which can apparently
sort mbox files while keeping emails intact (I have to read up on those (or
it?) again as there were, iirc, also some limitations there that might not let
me accomplish what I want (usually, sorting the emails by the title in the
mbox "From " header (and ususally not by the email From: header).
On Monday, April 02, 2018 09:05:52 AM Henrique de Moraes Holschuh wrote:
> On Mon, 02 Apr 2018, rhkramer@xxxxxxxxx wrote:
> > A few weeks ago, I was looking for a byte that, in UTF-8, would be a
> > totally invalid byte (not an invalid sequence of bytes). At the time, I
> > tried some googling, but it looked rather hopeless (maybe it was my
> > googling that was hopeless).
> 0xff should work. But any of those in RED on the wikipedia article
> about UTF-8 would do for Unicode text: