Web lists-archives.com

Re: searching non plain text files

There is no hammer that will crack them all. The best solution is to be able to recognize what type of file it is and do whatever processing is appropriate for that file. Many file formats are difficult to deal with. The original Microsoft Word file format was never totally documented, I think, because it was too complicated. You can probably find utilities for each file format but even then you must deal with all the many types separately.

Also note that many utilities were not designed for use in an online environment. Since the subject is PHP I assume you need to do this online.

There are utilities that will search files for ASCII text. Even that is less likely to work with Unicode text. If it is just ASCII then you could search for sequences of bytes that contain data in the range of character data that is normally printed. It is a very inaccurate algorithm. Note that PDF files are supposed to contain no binary data; in other words, no bytes that in the range 0 to 32, decimal. Binary data (in a PDF) is supposed to be stored in non-binary format.

You need to do some studying. For example, the Portable Executable format used for most Windows executables has a "Magic Number" (the characters "MZ") at the beginning and a pointer to a PE signature ("PE\0\0", the letters "P" and "E" followed by two null bytes). For Unix/Linux systems see COFF and ELF. That is just for executables, you need to study the many other formats too.

If you can be more specific then perhaps someone can provide a more specific answer. For example, if there are requirements that limit what needs to be searched for then it might be possible to be more specific.

Friday, December 14, 2018 8:19 PM

Can anyone point me to instruction/advice about
opening and reading files that are not plain text:

word processing docs, pdf, ps, image files,
even complied code.

I am writing a search function to search file systems
and don't know a lot about the formatting of non plain
text files.

The immediate concern is line breaks in word
processing docs, pdf and ps files.

Then detecting compiled code files so I can
leave them alone. This type of file might not
have a suffix to consider.

Then the various image files that might be

Even suffixes aren't a guarantee of the content.


Jeff K.