Text Workbench Online Help Submit feedback on this topic   

File Nature Determination: How the Preprocessor Works

The HandyFile Find and Replace is enabled to process files of many different formats: ANSI text files, Unicode files, UTF-8 files, Microsoft Word and Microsoft Excel documents. Besides, it can even find and display image files of the most common formats (GIF, JPEG, PNG, BMP, TIFF, Windows metafiles and icons).

Such flexibility cannot be achieved without having to preliminary preprocess files. All files that the TW software encounters while traversing the specified folder(s), are first assorted and then a set of statistical and deterministic algorithms are applied to them.

The sequence below illustrates the preprocessing steps that the TW undertakes before starting the actual file search and/or replace operation.

  1. The finder encounters a new file - say, named file.ext.

  2. A set of tests is applied to the file to determine whether it can be processed.
    1. If the file matches at least one of the supplied file search masks, the system continues processing the file.
    2. If the file matches at least one of the supplied file exclude masks, the system skips the file.
    3. The system checks the file timestamp. If any date and time restriction is set on the Date Tab, and the current file does not match the date and time criteria, the programme skips the file.
    4. The file read-only attribute is verified. If the Properties Tab settings does not allow processing read-only files, and the current file is read-only, the system skips the file.
    5. If the file size restriction is set on the Properties Tab, the system matches the current file size against the file size restriction criteria. If the current file does not match, the system skips the file.
    6. If a user specifies to create file back-ups on the Properties Tab, and the actual replace operation is performed (as opposed to the simple search operation), the system checks the current file extension. If it is the same as the one specified to be used to create back-up files (on the Properties Tab), the system skips the file.
    7. If the current file resides in a folder with the name same as specified to be used for back-up folders on the Storage Folders Tab, and the actual replace (not just the search) operation is performed, the system skips the file.
  3. If all above tests are passed, and the option Replace text in file names, not in file content (rename files) is not set, the actual preprocessing is performed.
    1. The system checks the file extension.
      • If it is .doc, .dot or .rtf, the file is a Microsoft Word document.
      • If the extension is .xls, the file is a Microsoft Excel document.
      • If the extension is .bmp, .gif, .png, .ico, .jpg, .jpeg, .tif or .wmf, the file is an image.
      • If the extension pertains to file types which are binary a priori (.exe, .dll, .vxd, .obj etc.), the file is marked as binary.
    2. The system suspects files to be non-ANSI a priori. This is why the system checks some first bytes of a file (in fact, the BOM - Unicode byte order mark - is verified):
      • If the first three bytes are EF, BB and BF, the file is a UTF-8 file.
      • If the first two bytes are FE and FF, the file is a Unicode file.
      • If the first two bytes define file as of some other exotic Unicode format, the file nature is set as binary.
    3. If the option Analyse input and exclude binary and image files is not set (Options Dialog - Processing), the system handles the current file as an ANSI file. The system determines the real file nature only if a user attempts to view the file.
    4. Otherwise (if the option Analyse input and exclude binary and image files is set), the programme applies the statistical analysis to the file content. If any byte of the file is below 0x20 and is not any printable symbol except CR, LF and the tab, this byte is counted as the one that may indicate a binary file. If the percentage of suspect bytes is above the value specified in the Options Dialog - Processing, the file is marked as binary.

      On the other hand, if the deterministic analysis of the byte stream indicates that the found sequences (lead bits followed by encoded symbols) inhere in a UTF-8 file, the current file nature is set to UTF-8 even if the file does have BOM at the start. 
  4. If the option Replace text in file names,  not in file content  (rename files) is set, the TW tries to rename the found files. If a user has clicked Replace, the actual renaming take place. If a user has clicked Search, the TW suggests and displays new file names.

    If this option is unchecked, the HandyFile Find and Replace processes (searches or replaces) the file contents.