Table of Contents
Why OCR a PDF
One of the primary goals of the PDF format was to allow sharing documents across different computer systems without having the original software installed. The documents should
- look (ideally) exactly as the original format
- the user should be able to print out the pages
The portability goal was perfectly achieved, and PDFs are used in business and private communication all over the world.
But some of the original facts were forgotten over the last 20+ years since PDFs are out:
- The roots of the PDF formats are in printing and editing was not on the to-do list.
- A PDF page can be compared to a canvas. And like artists, the elements on the page can be drawn very creatively (= with no defined structure is required ). This is okay from a technical point of view, as long as the PDF displays (=renders) correctly.
- This flexible approach comes with some negative side effects:
- only loose guidelines exist, how to embed text
- no (forced) rules to build a logical structure in the PDF - it is only important that it looks good on screen
This very open architecture the PDF format made it a success, but at the same time this flexibility generates real “pains” for users and businesses. Why?
- Digital processes need proper data input! - but PDFs can not guarantee that.
- Since most of the information/data that is required in IT system is based on words and numbers, it is important that they can access the textual information. It is not enough when only users that read the document content on a screen.
Text in the PDF is not "real" Text
There are many PDF libraries on the market that allow to extract textual information from a “digital born” PDF. But at the same time there are several reasons why textual information in PDFs is not accessible. For example:
- Scans and document images are embedded
- Text was converted to vectors (e.g. in Adobe Illustrator)
- The text encoding is not correct, so the computer does not “know” what characters are used
How Text Recognition can help
Optical Character Recognition technologies were initially developed to read text from scanned document images. But for reliable PDF-text extraction the OCR technologies are often the only way to access the information. That may sound strange, but internally every “canvas” (page) of a PDF document is rendered to a pixel based representation, only then it can be displayed on a screen.
OCR Technologies that make reliable PDF-OCR possible:
- Layout analysis
- Character recognition
- Word reconstruction with dictionary support
- Multi-language support
- Text-flow detection
- Layout reconstruction
- Flexible Export to text, XML, office formats or PDFs
The information contained in PDFs can not be utilized in business processes when
- PDFs do not have a proper text-layer, for example when the encoding or the reading order of the text is broken. If this is the case, then the “textual information” can not be used without problems.
- PDFs without proper textual information should not be archived, because it is very hard to find them in the future, e.g., by search engines or classification tools.
- Watch how OCR can help you to catch textual information within PDFs that can not be extracted otherwise
- 30 minutes video in English
- Recorded at the Technical PDF Conference 2015, Cologne, Germany
Direct link to the recording: https://youtu.be/Rn2AA_IupyQ
- No tags, yet