Table of Contents
OCR Accuracy Measurement
Below a brief introduction on the topic of the measurement of OCR accuracy. It seems that this is a simple topic, but in fact it is not. Some general statements that can be made:
- Optical Character Recognition (OCR) is executed in multiple steps and every one of of them does have an influence on the accuracy level that is achieved at the end of the process.
- Image quality does matter! “Garbage in - Garbage out” is true!
Measurement of OCR Accuracy
Accuracy on a Character Level
- OCR technology provider normally measure the accuracy of the optical character recognition results on a character level.
- 99% accuracy means that
- 1 out of 100 characters
- 10 out of 1000 characters are recognized “uncertain”
- 99,9% accuracy means that
- 1 out of 1000 characters are recognized “uncertain”
- An “uncertain” recognized character can be correct or not - the core OCR technology is not able to make a final decision - even after applying all built in classifiers and internal voting algorithms.
- In a real live scenario, a person might be the final instance to decide what is wrong or right.
- OCR scientists (and also ABBYY) develop,change and optimize the recognition technology on an ingoing base. So the increase or decrease of the resulting recognition accuracy can only be measured against a test set of images/documents, where the text is known and 100% correct.
- This means that in “real OCR” the absolute measurement normally can not be given, because normally there is no 100% correct ground truth data.
OCR Accuracy on a Word Level
- Instead of just looking on the character recognition quality, it is also possible to measure the accuracy on a world level.
- This approached is often used in environment, where the proper words should be found, for example searching for a name in a book or registration documents.
- ABBYY is fully aware of this requests, but there is no simple way how word level accuracy should be measured. There are several things that have to be considered here:
- Relevant words like: person names, city names, etc.
- Non relevant words“ like: “the”, “and” etc.
- Also it has to be considered that
- Most of the time there is no ground truth data on fulltext OCR scenarios on a word level.
Only in data extraction scenarios (like forms processing) it is much easier to work with word list and database look ups.
- “Simple” search algorithms, that only find the exact match, are not practical enough to get a proper search result - no matter if OCR uncertainty is in the document set or not. It is better to use a more intelligent, “fuzzy” search technology.
- In OCR scenarios for historic materials the challenge is that there is no unified grammar and spelling.
Ways to improve the OCR Accuracy
… since we are talking about OCR accuracy measurement, here an overview about the different areas where and how you can influence the accuracy of the recognition process within ABBYY OCR SDKs:
- Image quality
- Images for OCR have to full-fill a certain quality level. In a nutshell:
- 300 DPI – more: OCR - Optimal Image Resolution
- grey-scale images are better than black and white,
- color images can improve OCR, but mostly the export documents, like the searchable PDFs, should be in color
- The images of the documents should be sharp, flat and not proper oriented, so that there are straight text lines.
- Image pre-processing
- It is important to prepare the images for the OCR process, otherwise the results will stay much behind the achievable results, for example
- Layout Analysis
- Before the characters can be recognized, it is important the zones for OCR (region of interest) is detected or defined. This process is very easy for a human, but a tough job for algorithms. If you miss a text zone on a page - then it will not be OCRed and at the end if you measure “OCR accuracy” then it also has to be considered that “lost” text can not be wrong - but at the end loosing text might be much worse than having the full text with a few more uncertain characters
- Character Recognition
- This in a insider topic - where only ABBYY can work on. Here some more details what is about
- Language & Character Settings
- Knowing what languages and characters are used in the document helps to increase the accuracy rate. More on this topic: OCR Recognition Languages
- Use of word lists
- The ABBYY SDKs provide an API to use custom word lists, but in broad, mass OCR conversion, the use of dictionaries delivers better results.
- Use of (Morphology) Dictionaries
- ABBYY SDKs allow to work with dictionaries, the ones that come with the SDK, but it is also possible to create new custom ones. Some more insights Dictionaries and OCR
- Most of the time verification involves human interaction
- Image quality can be checked during the scan process
- The results of the automatic layout analysis can be verified before the text recognition is performed, this is recommended when documents should be transformed into editable office formats or e-books.
- The OCR results can be checked and corrected before the final document export takes place
- Post correction
- The ABBYY XML output gives “low level” access to the OCR results. They can be parsed, changed and then also be transformed into other formats - more details: ABBYY XML Export