OCR Document Classification and Data Extraction with online / incremental deep learning

For our FINTECH start-up we're looking at taking the next step of the way we can execute our unique service in a better and smarter way. We tried approaching our challenge with the use of traditional expression based document categorisation and template based data extraction, which is a pain to setup and manage. Beside the hassle to set everything up, the error rate is, with the traditional methods, unacceptably high for the service we want to offer our clients.

We aim to process a limited amount of document categories / types. Some will be documents with a relatively static layout, like Identy Documents (passports, ID cards, drives licences). Other documents will be more dynamic in layout and content, for example the processing of payment slips and bank statements, yet the information we need or the validations we want done remains the same for these documents. Important to mention is that the source of the documents will differ. It usually will be either a photo taken by the end-user or a scan taken by a MFP.

We’d like to approach this as a pilot project where we can iteratively expand on functionality and eventually setup the tools to add / train new documents /categories types. For a pilot version we want the application to receive an image, preferably trough an internal API, OCR the document to a fully unstructured text file (preferably with Google Vision), from these we will want to determine the document type and then structure the text into an organised JSON format and answer the api call with this JSON.

Ideally the application should be able to receive any document, determine what document it is and act accordingly. There are several reasons why we see added value in online incremental deep learning.

1) Due to the nature of the information we process the amount of initial sample files will be limited.

2) Some documents have an unlimited amount of variations.

3) A solid base for future iterations

Please outline your experience with Machine Learning and online / incremental deep learning to help assist in candidate selection for this project. Once the pilot project is successful an opportunity could be available to build a large project.

Questions to respond to in your proposal:

- How would you approach this project?

- What roadblocks do you see in this project?

- What tools do you recommend to make this project a succes

