Web lists-archives.com

Re: Invalid UTF-8 byte? (was: Re: utf)




On Tuesday, April 03, 2018 08:30:04 AM Greg Wooledge wrote:
> WHAT ARE YOU TRYING TO DO?

I am building (have built several iterations) of a free format database to 
work something like askSam.  It is a mashup of several applications, things 
like recol, kmail, nail, kate and the data is stored in mbox formatted files.

Each record is treated as an email.

(Earlier iterations of the mashup used nedit, some nedit macros, and a custome 
file format using a 4 byte sequence (0x80 0x81 0x82 0x83) as the record 
separator.)

I occasionally want to sort the data in order by what I call the record title, 
which is stored in quotation marks ("") after the "From" in the mbox header.

I have seen some utilities that might sort the mbox files, but may require that 
I somehow move them into IMAP or some other manipulations that might be 
inconvenient.

msort was mentioned on this list within the last few months and might do the 
job for me if I can insert a temporary one byte record separator into the file 
(in addition to the mbox From line).  

Most likely this would be only a temporary addition, and I would need to do 
things like make sure that one byte will be unique in the file.  It sounds like 
there are at least a few candidates.


> 
> There was a glimpse a few messages back that looked like you were trying
> to parse information out of an mbox-format mail folder.  (I.e. a flat
> file that has a concatenated series of mbox-format mail messages in it,
> with all the silliness and problems inherent in this format, like having
> to prefix body lines with ">" if they begin with the word "From".)
> 
> "I want to write a shell script to parse an mbox folder..." is enough
> to send most people running away screaming.  What other horrors are we
> in store for next?
> 
> Of course, that might be a red herring, since you didn't actually tell
> us what your goal is, or what your inputs are, and we're having to
> guess at the moment based on tiny hints and information leaks.