What is OCR?

Language:
EN
Product-Line:
FineReader Engine, Mobile OCR Engine, Cloud OCR SDK
Version:
7.x, 8.x, 9.x, 10, 11
Type:
Knowledge Base & Support
Category:
General Features
KB-Type:
Tips & How to Information

General

  • OCR is the abbreviation for “Optical Character Recogition”
    In simple words: Text (as pixels) within images is found and “read” by a computer. The result is “real”, editable text. The image can be converted a large variety of formats such as DOCX, ODT, RTF, TXT, HTML, XML and PDF(/A)s. So humans and IT systems can process and work with the information hidden in the documents.
  • The alternative to OCR would be manual keying / re-typing of the wanted information.

But OCR does not mean “Only Character Recogition” - because modern OCR technology and products do much more:

OCR steps - more than character recgognition

Steps from Image to Text

  • Opening images and PDFs
    • Enabling scanning, e.g. via TWAIN
    • Opening a large variety of different image formats and PDFs
  • Preparing the images
    • Split multi-page files into single pages to increase speed and scalability on multiple cores machines
    • Rotate images so that the technology can read the text
    • Clean the images; e.g. remove scanning dust or ISO-noise from digital cameras
  • Analyze the layout, detected text, images, barcodes tables areas.
    • Detect the reading order of the texts
    • Analyze the text blocks and detect the lines and find/identify the individual characters
  • Read the “individual” characters = Apply optical character recognition
    • Vote different hypothesis of single characters, e.g. is it
    • “0” or “O” or “o” or a “Q” or “Ö”
    • “I” or “1” or “!” or “|”
    • more on: OCR Voting API
  • Rebuild the text on a word level by using language information
    • What characters are used and allowed in the language
    • What recognition settings are set-up internally
    • Are defined word lists available or can some details be looked up in a database
    • Use linguistic and morphology dictionaries
      more on: Dictionaries and OCR
  • Export the recognized:
    • text in the proper “Unicode”
    • Provide all the details “found” in XML, e.g.
      • Character Positions (original and after de-skewing of the page)
      • Fonts, Formats (normal/bold), color
      • Hypothesis
      • Word in dictionaries
  • Reconstruct and synthesize the original layout for different output formats:
  • Export/Save the different formats to RAM or disk so that they comply to the format standards.

Resume

  • OCR is a set of very different, computing intense processes where a lot of mathematics, statistics and linguistic is involved.
  • OCR is a fuzzy process, and development and improvement needs a lot of know-how and testing.
  • Each processing step for text recognition is already complex, but at the end “The whole is greater than the sum of its parts”

Related Articles