Automating data extraction from Passports using OCR

Extracting data from passports

Our client is a Dutch based market leader for the worldwide recruitment and human resource management of aviation personnel to operators like Airbus, Boeing, Fokker and ATR. As a part of their vast markets, they need to provide both their airline customers, as well as their candidates, with confidence and peace of mind for compliance. The client wanted to automate the end-to-end workflow of data entry of passports, flight licenses and resume into their database, as well as verify them.

Existing Process

The existing data-entry and review process was fraught with challenges leading to increased cost and time consumption and drop in accuracies:
  • Some of the passports had originally been faxed in and had different orientation
  • Some were low resolution images that existing data extraction technology were unable to read and process
  • The client hired 60 data keyers to manually index 10,000 documents a month, extract the necessary information and get it into the correct system of record

Challenges with Traditional Data Capture

The client soon realised that Manual Data Capture was wrought with high cost, low accuracy and inefficiency. Data entry staff spent hours typing, checking and re-typing information from multi-lingual documents into data systems, a dull and repetitive task. The client team also tried Traditional OCR Data Capture solutions such as Amazon Textract, Abby and Google Vision. These had their own limitations, as the output was highly accurate while processing documents with low variability.
  • They didn’t allow working with custom data where the client’s use had its own nuances and different kinds of data
  • They required a considerable amount of post-processing due to inconsistencies in the input images and lack of organization in the extracted text
  • They worked well only with specific constraints. It was unable to handle images of text in multiple languages at once, images with low resolution, images with new fonts and varying font sizes or images with shadowy text, etc making a lot of errors and leaving the client with poor accuracy.
  • OCR models provided by Google and Microsoft work well on English but do not perform well with other languages.
  • They didn’t provide the ease to scale up data capture operations
None of these solutions provided extensive pre-processing in their pipeline. The client needed a highly accurate and highly secure solution that was tailored to their data requirements and could handle a large variety in formats and languages.

Optical Character Recognition with Nanonets

We proposed an Optical Character Recognition (OCR) based Passport Data Recognition system to completely automate their process and remove any human intervention. We provided a unique solution that:
  • Can leverage the existing data to provide a quick ready to use solution
  • Can be integrated seamlessly with their existing database
  • Can be deployed on premise with docker, flexible to work on local private network and highly secure
The holistic solution allowed the user to upload a PDF / image file, and a combination of multiple machine learning models were deployed in real time to tackle the challenges in the large variety of data. No human effort was required and machine learning was applied for automation.

OCR API solution overview

Productising a computer vision based solution was fraught with a number of challenges. Our team’s unique expertise in data science, machine learning, deep learning, research, engineering and architecture enabled us to tackle them in the following ways:
  • Lack of good quality useful data was a huge challenge. The images were often orientated differently or were blurred, making data capture extremely difficult. To tackle this, we deployed an on-premise smart data processing pipeline to capture only relevant info.
  • Development of a solid technical architecture involved extensive study of data flow in the system and its different parts such as data processors, docker api etc.
  • In order to convert the raw data into a format that is easily digestible by machine learning modules, we needed to annotate a huge amount of data. To ensure data privacy, we leveraged our in-house annotation team.
  • Once the models were built, we converted the models into docker containers as per architecture so that they can be deployed on the system. In this stage, we also wrote code for all other parts of the system so the solution is modular and can be deployed at scale.
  • Since this solution involved on premise deployment, once the testing was done on cloud based deployment, we setup the system with required operating system, appropriate drivers needed for GPU’s and setup dependencies to run the code.
  • We re-collected the data where the models made an error and built model building processes to redeploy the new models. The system learns over time and accuracy of the system keeps improving.

Impact of passport data capture automation

Get in touch

Have a query?