Table of Contents
JBIG2 Compression, Image Errors and OCR
- JBIG2 is an image compression standard for bi-level images, developed by the Joint Bi-level Image Experts Group. It is suitable for both lossless and lossy compression.
- According to a press release from the Group, in its lossless mode JBIG2 typically generates files one third to one fifth the size of Fax Group 4and one half to one quarter the size of JBIG, the previous bi-level compression standard released by the Group. JBIG2 has been published in 2000 as the international standard ITU
- More details can be found here: JBIG2
Scans, Changed Numbers & JBIG2
In August 2013 the JBIG2 format got a very big media coverage because of a scanning problem with XEROX MFPs. After scanning documents with numbers (e.g an invoice), sometimes numbers were changed in generated image, for example a six became an eight: 6 → 8. This issue generated an intense public discussion in the imaging industry. Here only 2 links to the articles on the topic:
Source: Blog D. Kiesel
The conclusion made by Xerox experts as well as the blog post author: this happens as a result of JBIG2 image compression.
Important: OCR is NOT involved into this issue!
JBIG2 and ABBYY OCR SDKs/Engines
- Although OCR wasn’t used the XEROX case, it may still have relevance even in cases where JBIG2 to compress image layer of PDF files. As an ABBYY technology user /developer / partner you might have concerns on this topic.
- Theoretically it could happen, when ABBYY technology (as in FineReader Engine), processes TIFF files and generate PDF files with text under image that our OCR text result is correct but the image contains errors due to patch flips in the image.
- Every technology/product implementation of the algorithm has its own threshold for combining similar image segments into a cluster.
- ABBYY uses a threshold even more strict than the one used by the Recognizer subsystem for caching recognition results. So substitutions are possible but they are very-very seldom.
- If a customer/developer wants to keep the original image CCITT4 compression for whole image compression or for MRC text mask should be used.
- ABBYY is not aware of any support requests where related to this problem.
Coding Recommendations There are two scenarios where developers can use JBIG2 compression in ABBYY SDKs:
- Saving images with JBIG2 compression: Lossless compression is used.
- Saving PDF files with MRC. JBIG2 is used for text mask and is lossy.
- But both FRE 9 and FRE 10 have the
TextMaskQualityparameter that could be used to avoid the substitution issue. Just set it to 100%.
- Another remedy is to set up CCITT4 instead of JBIG2 for text mask compression - still the most confident way of ridding of the substitution issue.
- FineReader Engine 11 Release 2 and FineReader Engine 10.5. R3 will also allow to save JBIG2 black/white images in Lossless mode