Document Classification Engine 11 (Code Sample)

Language:
EN
Product-Line:
FineReader Engine
Version:
11
Platform:
Windows
Type:
Knowledge Base & Support
KB-Type:
Code Samples Collection
KB-Topic:
General
Category:
Document Classification
Image:
image: icon_classification.gif

This sample demonstrates how ABBYY FineReader Engine can be used for document classification. You can use it to classify a batch of your own images. The pretrained classification models are available for the following languages: Chinese (PRC), English, French, German, Italian, Japanese, Korean, Portuguese (Brazil), Russian, Spanish.

Description

The sample classifies selected documents using either pretrained classifier or user-trained classifier. Classified documents are displayed in groups by their type.

To view how it works:

  • If you want to classify documents using pretrained classifier:
    1. Select the folder with images to classify.
    2. Select recognition language of your documents. You can view the list of available classes for that language in the Classified documents window.
    3. Run classification by clicking Run classification.
  • If you want to train FineReader Engine to classify your types of documents:
    1. Create a database, which contains images of all types of documents you want to classify. See details on database creation in Classifying Documents in the Developer's Help. Group the images in folders named by the document types you want to train and place all the folders to one folder.
    2. Select recognition language of your documents.
    3. Click Train.
    4. In the dialog box that opens specify the path to the folder, which contains your database.
    5. Check the list of classes and click Start training.
    6. After classification training is finished, you can classify images using your classifier. Select User-trained classifier and then click Run classification.

Technical Notes

Comment to #1

  • If you run classification with FineReader Engine 11 you will get the classification results / document class in an array.
  • There is also a suspicious flag (you might know it form the OCR results) that can be either true or false. This indicates that the internal algorithms are not sure.
  • You also get access to the most likely classes and their confidence value, so you can make your own decision based on additional meta data

Comment to #2

  • It is also possible to combine both image and content-based classification in a multi-tier approach. So in the case that the image classification does not deliver a secure result, you still can apply the content based classification as a next step to improve the confidence in the result.

Comment to #3

  • You as developers can adjust the training database (Etalon file) with your own rules. This extra flexibility was added in Release 2 of FineReader Engine 11.
  • How does it work?
    • A feature is defined by its name (you can use any string for a name, as long as they are different for different features), and you can add several features for one image, separating the names of features with semicolons.
    • An example possible case in which using custom features will improve classification:
      1. Some of your documents are not reliably distinguished by the standard classification methods. However, you know that the documents of one class always have the company logo in the upper left corner, while the documents of the other class will not not have it.
      2. You implement a function which uses the results of layout analysis (e.g. obtained by the call to the Analyze method of the FRPage object) to look for a picture block in the upper left corner. It returns one string (e.g. “haslogo”) if a picture block was found and another string (e.g. “nologo”) if it was not found. These strings are the names of your custom features.
      3. During classification database training, you apply your function to every image that is added to the database and use its return value as the name of the feature that must be added, calling the AddFeaturesForPage method with the string name of the feature as the Features parameter.
      4. During classification, apply your function to every image that must be classified and use its return value as the name of the feature that distinguishes the image, calling the ClassifyEx method of the FRPage object with the string name of the feature as the Features parameter.

Video