Table of Contents
Classification Comparison: Smart Classifier, FineReader Engine & FlexiCapture (Engine)
General on Document Classifications
- Information and knowledge management is crucial for every company and organisation. For a successful business, it is important that the relevant documents “meet” the proper department and the right process. But at the same time, the information, as well as the structure and processes, are very different. Therefore classification systems are and have to be tuned to specific needs and document types.
- It is important to manage and to automate the classification process!
Only then it is possible to get consistent classification results for a high quantity of documents as they exist in SharePoint, Document Management or Search systems.
- This article compares different classification implementations that are available in ABBYY's products and toolkits.
Learn about the different flavours, technologies and use cases how the category of a document can be detected. Since there are many different document types and formats, there are also multiple ways to classify it.
... for different Document Types
When talking about classification of documents, we should remember that the term “document” is not precise and is used in very different ways, for example:
- an image or scanned page(s)
- a simple text file
- an office document e.g. Word file or a PowerPoint presentation
- a PDF that with vectors, images and may be properly encoded text (in a usable form)
- an email with or without attachment
- a survey form
- a HTML page
- a JSON document
- a mixture of different “components” that belong into one logical document, e.g. the application for a new insurance for different members of your family
Not only the file format, also the content inside documents differ:
- highly structured = a strict order of the texts or the location of information, a typical example are forms.
- semi-structured = varying layout, but there is a high likelihood that certain information is available. The data has a particular, consistent structure. Rules can help to identify the type of a document.
- unstructured = content is written in “normal” natural language. The content follows the grammatical rules, but there is no fixed structure in the text/sentences that can easily be used to classify a document.
For humans, it is easy to differentiate between different document types. So you can expect consistent classification results, when only one person has to choose between a few categories on a small amount of documents. In this simple scenario it is easy to make a decision.
It gets tricky when there are a lot of classes, different persons and a large amount of varying documents. Then it is very likely that at the end the result will be inconsistent and therefore not a reliable enough for business processes.
Using the right approach and technology allows using computers for the classification of document repositories. Below a short overview and comparison of the available classification technologies that are available in
- ABBYY Smart Classifier
- ABBYY FineReader Engine and
- ABBYY FlexiCapture (Engine) in combination with FlexiLayout Studio.
As stated before, there are different document and content types, and ABBYY offers different approaches to identify the type of document/content to be able to assign the right document class.
|Smart Classifier||FineReader Engine||FlexiCapture (Engine) with FlexiLayout Studio|
|Product Type||- Scaleable, server-based system with REST API for integration.||- Toolkit with libraries/DLLs for integration in other applications|| - FlexiCapture Engine: DLLs for integration in custom developed applications & systems
- FlexiCapture: Standalone/Distributed Solution for Classification and Data Extraction
- Both: External development of the classification and extraction logic (FlexiLayout Studio)
|Intended usage|| - Classification for a broad information management, workflow scenarios |
- Focus on unstructured text and documents
Documents can be in a repository or incoming process
| - Classification is integrated and used together in the full-text OCR Engine. |
- The main goal: understand what image/page is currently processed, e.g. a receipt, a business card, an invoice, a letter or fax or a screenshot…
| - FlexiCapture (Engine) can extract information that is used in a business process. The technology is capable to process complex multi-page documents. To be able to to so, it is important that the document structure is known, and can be set up.
- Classification is the initial step to know what capture logic has to be applied
|Market Availability||- New product, launched in 2016||- Added as SDK feature add-on with Version 11 (2013)||- Forms Classification and FlexiLayout Studio in one product are available in FlexiCapture since 2007 1)
- In FC Engine since 2009
- Since then extended in every new technology cycle
|Classification Technologies used|| - Content-based classification with linguistic features, |
- Semantic classification 2)
|- Image & Content Classification||- Image, Layout, Rules & Content Classification|
|Supported File Formats||- Plain text, Office Formats, PDFs, Images||- Images, PDFs||- Images, PDFs|
|Used for Classification|| - Plain text extracted of the processed documents, |
- Linguistic Analytic
- Semantic Features (EN, RU)
| - Visual: Pixel/density-distribution |
- Headline texts (large fonts)
- OCRed Full-Text & statistical analytic
- No linguistic features used
| - Visual Features
- Custom created rules and decision trees
|Training Interface|| - Web UI for Model creation and tuning |
- RestAPI for coded setup & re-training,
- customized UI can be developed
|- UI based on custom integration of the SDK in your application||- FlexiLayout Studio used to develop document descriptions, rules creation and training of classes|
|Classification Logic Creation|| - Machine Learning on Training Sets |
- Quality Check with Control Sets
| - Machine Learning on Training Sets |
- Custom rules based on own code
| - Custom document definitions, Custom rules,
- Training of certain classes is possible (ML)
|Classification Range|| - One text = One document |
- A document can belong to only one or multiple classes
| - One image = One document |
- The document should be assigned to only one class so that it can be processed accordingly
| - Multiple images can be separated or merged into one or more documents
- Set of pages = one class per document
|Classification Quality Evaluation||- Control Set for each document class is used to calculate Precision and Recall||- No built in quality evaluation, calculation has to be implemented with own code||- Classification visualization and processing log|
|Related Articles|| - Smart Classifier Product Overview |
- Model Editor (WebUI)
| - FineReader Engine Product Overview |
- FineReader Engine Classification Code Sample
| - FlexiCapture Engine Product Overview
- FlexiLayout Studio Overview
All three ABBYY products contain technologies that allow classifying images, pages, documents. The products differ in their architecture and in the way classification is used.
- FineReader Engine: offers classification that allows detecting the type of documents. Based on this information the further processing can be selected. For example, apply business card recognition (single or multiple cards on the image), process a receipt or convert a contract into a PDF/A and a DOCX.
- FlexiCapture (Engine) and FlexiLayout Studio: are intended to have a powerful classification in projects where also structured data should be extracted. Since the FineReader Engine API can also be licensed in FlexiCapture Engine, both classification approaches are available in an application or project.
- Smart Classifier: is approaching classification from a linguistic side and was designed to process a large variety of formats. It makes machine learning very easy to use so that content experts can setup and tune new classification projects. Example: Classification for content repositories (DMS, SharePoint) data migration, emails, etc.
Since Smart Classifier as part of the ABBYY Compreno product line, it is also intended to classify unstructured content, before entities, facts and the relationship between them are extracted with InfoExtractor.
Back to: Comparisons