File Nature Determination: How the
Preprocessor Works
The HandyFile Find and Replace is enabled to process files of many
different formats: ANSI text files, Unicode files, UTF-8 files, Microsoft Word and
Microsoft Excel documents. Besides, it can even find and display image files of
the most common formats (GIF, JPEG, PNG, BMP, TIFF, Windows metafiles and icons).
Such flexibility cannot be achieved without having to preliminary preprocess
files. All files that the TW software encounters while traversing the
specified folder(s), are first assorted and then a set of statistical and deterministic
algorithms are applied to them.
The sequence below illustrates the preprocessing steps that the TW
undertakes before starting the actual file search and/or replace operation.
-
The finder encounters a new file - say, named file.ext.
-
A set of tests is applied to the file to determine whether it can be processed.
- If the file matches at least one of the supplied file search masks, the system continues processing the file.
- If the file matches at least one of the supplied file exclude masks, the system skips the file.
- The system checks the file timestamp. If any date and time restriction is set on the Date Tab, and the current file does not match the date and time criteria, the
programme skips the file.
- The file read-only attribute is verified. If the Properties Tab settings does not allow processing read-only files, and the current file is read-only, the system skips the file.
- If the file size restriction is set on the Properties Tab, the system matches the current file size against the file size restriction criteria. If the current file does not match, the system skips the file.
- If a user specifies to create file back-ups on the Properties Tab, and the actual replace operation is performed (as opposed to the simple search operation), the system checks the current file extension. If it is the same as the one specified to be used to create back-up files (on the Properties Tab), the system skips the file.
- If the current file resides in a folder with the name same as specified to be used for back-up folders on the Storage Folders Tab, and the actual replace (not just the search) operation is performed, the system skips the file.
- If all above tests are passed, and the option Replace text in file names, not in file content (rename files)
is not set, the actual preprocessing is performed.
- The system checks the file extension.
- If it is .doc, .dot or .rtf, the file is a Microsoft
Word document.
- If the extension is .xls, the file is a Microsoft
Excel document.
- If the extension is .bmp, .gif, .png, .ico, .jpg, .jpeg, .tif
or .wmf, the file is an image.
- If the extension pertains to file types which are binary a
priori (.exe, .dll, .vxd, .obj
etc.), the file is marked as binary.
- The system suspects files to be non-ANSI a priori. This is why the
system checks some first bytes of a file (in fact, the BOM - Unicode
byte order mark - is verified):
- If the first three bytes are EF, BB
and BF, the file is a UTF-8 file.
- If the first two bytes are FE and FF,
the file is a Unicode file.
- If the first two bytes define file as of some other exotic
Unicode format, the file nature is set as binary.
- If the option Analyse input and exclude binary and image files
is not set (Options Dialog - Processing),
the system handles the current file as an ANSI file. The system
determines the real file nature only if a user attempts to view the
file.
- Otherwise (if the option Analyse input and exclude binary and image files
is set), the programme applies the statistical analysis to
the file content. If any byte of the file is below 0x20 and is not
any printable symbol except CR, LF and the tab, this byte is counted
as the one that may indicate a binary file. If the percentage of suspect
bytes is above the value specified in the Options Dialog - Processing,
the file is marked as binary.
On the other hand, if the
deterministic analysis of the byte stream indicates that the found
sequences (lead bits followed by encoded symbols) inhere in a UTF-8 file, the current file nature is set to UTF-8
even if the file does have BOM at the start.
- If the option Replace text in file names, not in file content (rename files)
is set, the TW tries to rename the found files. If a user has
clicked Replace, the actual renaming take place. If a user has
clicked Search, the TW suggests and displays new file names.
If
this option is unchecked, the HandyFile Find and Replace
processes (searches or replaces) the file contents.