OCR Language Auto-Detection

Language:
EN
Product-Line:
FineReader Engine
Version:
9.x, 10, 11
Type:
Technology & Features
Category:
Recognition, Languages & OCR

ABBYY OCR technology makes a heavy use of language information and dictionaries to achieve high recognition quality during the process of optical character recognition. Real documents can contain multiple languages on one page or the document stream contains a large number of different languages, e.g.

  • a publication that has the same content in two or more languages in different columns, for example airline magazines
    or
  • imagine you have to process hundred thousands of documents from the European Union, they can contain up to 25 or more different languages, and manual pre-sorting is probably not an option. The same applies for internal business documents of a world wide acting enterprise. Edit

Up to V10 Technologies

  • Even up to the ABBYY technology cycle V10 the OCR engine is able to process multiple languages documents.
  • The technology selected the best matching language from a group pre-defined group of languages, this group can/has to be set/edited by the user/developer.
  • It is/was recommended to use max. 5 different languages in a group, because the if more languages were selected the number of internal OCR hypothesis are increased. This will (in most of the cases) decrease the OCR quality also the processing time will be longer.
  • If the language input is very mixed and consist of a lot of different languages, then manual pre-sorting is often not an option. Instead multiple OCR runs with different language settings have to be made. Based on the internal recognition statistics the system had to decide what combination delivered the best results.
    This “brute-force” approach works, but takes time. At the end CPU time is cheaper than labor time.

New in V11 Technologies

FineReader Engine 11 is the first SDK where a new language detection is implemented, it is part of the “FRDocument Object”

  • The recognition language of a document can be automatically detected, but the developer has to specify at least 3 languages that might show up in the document.
  • The recognition language is detected for each word in the text.

The API contains several different objects within the FRDocument object:

Name Description
BasicLanguage Returns the main language of the recognized document. The property contains the internal name of the first language in the collection of detected languages (DetectedLanguages property).
This property has a meaningful value only if the IRecognizerParams::DetectLanguage property has been set to TRUE during recognition; otherwise it is an empty string.
DetectedLanguages Provides access to the collection of recognition languages detected in the recognized document. Languages in the collection are sorted by the frequency of occurrence: from the most frequently occurred to the least.
This property has a meaningful value only if the IRecognizerParams::DetectLanguage property has been set to TRUE during recognition.
The list of languages is updated only after recognition, i.e. if you edit the layout of the document manually, the collection remains the same.

The Object Model FineReader Engine 11 gives you an overview what else is part of the FRDocument object

Here an illustration of the ABBYY desktop application FineReader 11 – developers of course use the API in the SDK.

Related Articles

This website uses cookies which enable you to see pages or use other functions of our websites. You can turn off such cookies in your browser’s settings. If you continue to use these pages, you consent to the use of cookies.