"Text only" OCR

Language:
EN
Product-Line:
Mobile OCR Engine, Cloud OCR SDK
Version:
4.0, 9.x, 10, 11
Type:
Scenarios/Tasks
Category:
Recognition, Export

Optical Character Recognition is a pipe of different processing steps with the ultimate goal to convert scanned paper documents or image only PDFs into new formats like: .DOC(X), .ODT, .HTML, XML, .epub or a searchable PDF with a text layer under the image.

For all these export formats it is very important that

  • the original layout is analyzed
  • the different elements are detected, e.g. text-areas, images, tables and barcodes,
  • the text recognition is applied and then executed on the blocks where there is readable data, then
  • the original look and feel is reconstructed as close to the original as possible.

For “text only” OCR scenarios, the main goal is to get only the text that is on a page.
The export format is Unicode TXT - without any layout information, images or character coordinates. This “simple” export is/was often used in simple search scenarios, where only the text on an image/document is required.
ABBYY technology and products can, of course, also deliver text only. ;-)

The following screenshot of ABBYY FineReader shows the result of a formatted page as “editable” and text only version.

Layout reconstructed

On the left side you can see the result of the document analysis - in the right pane of the Window you see the reconstructed result as it would be exported to Microsoft Word.

Text only result

On the left side you can see the result of the layout analysis - in the right you see the “text only result”.

Further comments

  • ABBYY FineReader Engine SDKs have different document analysis routines that tell the engine also to look into text on embedded images
  • Even for text only scenarios good document layout analysis is important, because if you have a 2 or 3 column text, you probably like to get the text in reading order and not across the columns. The lack of good layout analysis, even for text only scenarios, becomes obvious when you test an open source OCR engine. The layout analysis is by far not that advanced as in commercial products.
    • Text result using the layout analysis
    • Text result not using the layout analysis
  • The importance of text only OCR is going down, because a lot of search engines make use of additional information, like
    • formatting of text (bold)
    • logical structure of a page, like the headlines or table information in an HTML file
  • When the layout structure of a document does matter, most developer prefer/use the ABBYY XML export, because the format contains all character coordinates and also the format information. This allows to implement much more features in an existing application than the “text only” approach.
This website uses cookies which enable you to see pages or use other functions of our websites. You can turn off such cookies in your browser’s settings. If you continue to use these pages, you consent to the use of cookies.