Cloud OCR Scalability

Language:
EN
Product-Line:
Cloud OCR SDK
Version:
11
Type:
Technology & Features, Scenarios/Tasks
Category:
Recognition, OCR: Speed & Quality

Service and Platform Scalability

  • ABBYY Cloud OCR is powered by Microsoft Windows Azure and both are designed to scale, for example:
    • Uploaded images/document are stored as BLOBs - the data center can accept high loads and will store the data secure and distributed.
    • The submitted jobs are managed in the scalable, redundant Azure SQL database service.
    • OCR processing of the tasks will be executed on dedicated “Azure worker roles”, of which a very high number can be started/used.
    • The results can be downloaded via high-speed network

Scalability in the OCR Cloud

The architecture and the hardware of a highly scalable data center are impressive, but the services have to be implemented incorrectly to make use of the infrastructure. Therefore ABBYY developed a intelligent algorithm that monitors the incoming load and forecasts the processing power required in the “near future”.

  • The prediction is based on different parameters, for example:
    • Number of incoming tasks
    • Number of different applications receiving tasks
    • Number of images per job (single file/multiple files)
    • Document types (image snippets, A4 or A3 pages)
    • Processing scenario (Business Card processing, zonal OCR or full-text conversion)
    • Priority of the application (internal parameter)

Here some “real cases” that have to be considered to get good, dynamic scalability:

  • How many jobs are coming in?
    • Is there a trend?
    • How long it will take the currently running instances to work on the job queue?
  • What kind of processing tasks are in the queue?- How much load will they require?
    • Processing a business card is faster than OCRing an A4 page
    • A book with x-hundred pages will require more time than 20 incoming invoices
      • the book result needs to be delivered as a unit
      • processing of independent jobs can be processed parallel
    • Reading just a defined zones on an image is faster than “full processing with layout analysis.”
  • What are the customer/business requirements?
    • Business Cards that were submitted from a mobile app should be processed fast
    • Archiving projects with x thousands/millions of pages will run for days/weeks/months - super fast respond time during business hours might not be so relevant when the overall throughput is high enough. In the end, the average value is important not the speed for one document.
  • The ABBYY Cloud OCR back-end will automatically scale up when the number of tasks in the queue will get longer.
    …and down when there is nothing to do.
  • ABBYY also monitors the usage peaks of the all active applications. When it becomes clear that the load during weekdays grows between 7.00 and 10.00 then the service will be pre-scaled for this period.

Example

Further Technical Notes

  • A direct speed comparison between a local machine and a cloud instance is not “fair”, especially when only a small amount of pages is tested :-X.
    Why? A modern, standard PC/Server will probably be faster than a virtualized instance in the data center. Because hardware in data centers is optimized for energy consumption and cost efficiency, but most new PCs are tuned for peak performance, energy does not matter, just turn on all fans. The main differences in terms of scalability:
    • The local machine will reach a physical limit and not deliver faster results when the load continues to grow
    • A properly designed Cloud Service will scale up so that the overall responds time is stable or getting even better
  • Data Centers provide different virtual machines - from small instances with 1 one virtual CPU core - up to 8 cores. Depending on the OCR scenario, there can be a significant differences if a 50-page document job is OCRed on a 1, 2, 4 or 8 core instance. ABBYY internally adjusts the types of VM instances to provide a good service.
  • Massive peak uploads of documents/images for a custom speed test will not deliver realistic performance numbers at the beginning. Why?
    • The peak wave will trigger the upscale process of the system. So bombarding a service with a tasks flood will test only the upscale speed, but not the average responds time of the service in scaled up for production.
      • The upscale process could require several minutes to react because it takes some time to request more instances in the data center, assign hardware and start the instances that will take the jobs.
    • In real live processing the load of the overall system will grow statistically distributed, so the upscale will be in rather smooth.

Resume

  • ABBYY is scaling the Cloud OCR Service on demand to ensure fast responds time for all users.
    • Different machines will work on different tasks
    • If the overall load is growing, more cloud capacity, will be assigned to the service
    • The speed on different document types will vary, but is probably consistent