Dictionaries and OCR

Language:
EN
Product-Line:
FlexiCapture Engine, FineReader Engine, Mobile OCR Engine, Cloud OCR SDK
Version:
10, 11
Type:
Technology & Features
Category:
Recognition, Languages & OCR
  • A dictionary is, roughly speaking, a list of words available in a specific language.
  • A Recognition language can contain also a dictionary. Dictionaries can improve the recognition quality significantly.
  • ABBYY OCR SDKs ships already with dictionaries for certain languages, for custom implementation there are several other options from a simple word list up to a powerful Dictionary API.
  • Extended & improved in Version 11 of the FineReader Engine: OCR Language Auto-Detection

Dictionary Types

ABBYY SDKs support different types of dictionaries:

Standard dictionary

  • The supported OCR languages with dictionary support are shipped as this type.
  • Standard dictionaries are represented by three or four files. They have names which are usually the same as the full or short name of the language and an
    • .amd
    • .amm
    • .amt.
  • Such dictionaries can be extended by user - additional words can be added to the dictionary with help of ABBYY FineReader interface.
    A file for storing this dictionary extension used the .ame file extension.

User dictionary

  • This dictionary can be created with help of ABBYY FineReader interface. The interface allows to add, edit and remove words.
  • The dictionary can be also filled in by importing any text file in Windows ANSI and Unicode encoding.

  • A regular-expression-based dictionary.
    This dictionary contains the rules that define what words are allowed in a language and what are not.

An external dictionary

  • This dictionary is available in ABBYY SDK only and is implemented on the client side.
  • It allows to implement your own type of dictionary. This dictionary is useful in On-The-Fly Recognition scenario.

Cache Dictionaries

  • A cache dictionary is a small dictionary (about a hundred words) which can be changed easily during processing.
  • Cache dictionaries can be used when it is possible to select a dictionary more precisely, e.g. if you find new information about the document during processing. Such dictionaries are suitable for field level recognition.

For example, suppose there are two fields on a form you need to recognize: the name of a city and the name of a street. You have recognized the name of the city and you have the list of streets in this city. In this case you may load the appropriate cache dictionary with the street names and thus recognize the name of the street more quickly and accurately.

ABBYY FineReader Engine provides the

  • AddWordsToCacheDictionary
  • AddWordToCacheDictionary, and
  • CleanCacheDictionary methods of the DocumentAnalyzer object for working with cache dictionaries.

Dictionaries Influence on Recognition

  • It is known that language dictionaries are intended to leverage recognition quality by escalating words hypotheses found in a dictionary.
  • ABBYY SDKs have many built-in general vocabularies prepared by ABBYY and with it there is an ability to plug-in custom/user dictionaries for special, dedicated text.
  • Custom dictionaries may help in cases when text contains many non-common words.

  • But in case when a text contains little or even no words from a custom dictionary its usage may downgrade recognition quality. That is due to an algorithm used to choose between words' hypotheses.

Full word confidence is calculated by the formula:

  • Full confidence = Recognition confidence + Dictionary bonus, where
  • Dictionary bonus = Word length * Dictionary weight * Word weight in Dictionary


The comparison algorithm works the following way:

  • If two hypotheses are non-lexical then only Recognition confidences are used to compare.
  • If two hypotheses are both lexical then again only Recognition confidences are used to compare.
  • If only one of two hypotheses is lexical then the full confidences are compared.
    There are two nuances:
    • Words are added to a custom dictionary in all possible capitalizations. For example, you are adding abbreviation “ABBYY”. A dictionary will contain 3 words: ABBYY, Abbyy, abbyy.
    • When Full confidence is very low the Recognizer doesn't rely on it and starts additional heuristics to choose right hypothesis.

ABBYY technology compares lexical hypotheses with a very low Recognition confidence with their Full confidence. Is it still quite high thanks to a dictionary and according to rules above, the hypotheses are compared by their initial (very) low Recognition confidences.

Real-world OCR & dictionary example

  • You plugged-in custom dictionary with an abbreviation “THC”.
  • English text doubtlessly contains definite articles “the”.
  • Some characters “e” have badly printed horizontal tick and look like “c”.
  • In these cases you'll get very low Recognition confidence and both words (“the” and “thc”, remember adding rules for custom dictionaries) are lexical.
  • So no additional heuristics will be used and the initial recognition confidences are compared; and it is chance who governs here:
    • somewhere in the text “e” is more like itself and you'll get “the”,
    • somewhere it is more like “c” and you'll get “thc”.

Further Information

Further Information on how to work with dictionaries can be found in the documentation of FineReader Engine Guided Tour: Advanced Techniques:Working with Dictionaries


Back to: