What is OCR?
- OCR is the abbreviation for “Optical Character Recogition”
In simple words: Text (as pixels) within images is found and “read” by a computer. The result is “real”, editable text. The image can be converted a large variety of formats such as DOCX, ODT, RTF, TXT, HTML, XML and PDF(/A)s. So humans and IT systems can process and work with the information hidden in the documents.
- The alternative to OCR would be manual keying / re-typing of the wanted information.
But OCR does not mean “Only Character Recogition” - because modern OCR technology and products do much more:
Steps from Image to Text
- Opening images and PDFs
- Enabling scanning, e.g. via TWAIN
- Opening a large variety of different image formats and PDFs
- more on: Import & Scanning
- Preparing the images
- Split multi-page files into single pages to increase speed and scalability on multiple cores machines
- Rotate images so that the technology can read the text
- Clean the images; e.g. remove scanning dust or ISO-noise from digital cameras
- Analyze the layout, detected text, images, barcodes tables areas.
- Detect the reading order of the texts
- Analyze the text blocks and detect the lines and find/identify the individual characters
- more on: Document/Layout Analysis for OCR
- Read the “individual” characters = Apply optical character recognition
- Vote different hypothesis of single characters, e.g. is it
- “0” or “O” or “o” or a “Q” or “Ö”
- “I” or “1” or “!” or “|”
- more on: OCR Voting API
- Rebuild the text on a word level by using language information
- What characters are used and allowed in the language
- What recognition settings are set-up internally
- Are defined word lists available or can some details be looked up in a database
- Use linguistic and morphology dictionaries
more on: Dictionaries and OCR
- Export the recognized:
- text in the proper “Unicode”
- Provide all the details “found” in XML, e.g.
- Character Positions (original and after de-skewing of the page)
- Fonts, Formats (normal/bold), color
- Word in dictionaries
- Reconstruct and synthesize the original layout for different output formats:
- Office formats
- PDFs with a image and text layer, text only, PDF/A
- more on:
- Reconstruct the logical structure of a document (headers, footers, etc)
more on: Adaptive Document Recognition Technology (ADRT)
- Export/Save the different formats to RAM or disk so that they comply to the format standards.
- OCR is a set of very different, computing intense processes where a lot of mathematics, statistics and linguistic is involved.
- OCR is a fuzzy process, and development and improvement needs a lot of know-how and testing.
- Each processing step for text recognition is already complex, but at the end “The whole is greater than the sum of its parts”