Test Saagie in a few clicks with our interactive demo!

logo saagie red
illustration blog data extraction from a document

How to Extract Data From a Document?

Truth is, we’re not good enough yet. All documents are not standardized and we lose such a great amount of time reading them, finding informations we need and putting those into our databases : bills, legal papers, ID cards… Extract data from a document isn’t as simple as it seems.

If the documents, from one another are not homogenous, the relevant informations are quite often the same. We look for names, company names, dates, localities, addresses, prices or identification numbers. Second issue, we still print documents. Even if there are some technologies to sign documents in a dematerialized way, even if we have more screens than papers in our pockets, we still print documents. We ask for them, we share them and expect someone to process them. So what could be done?

Step #1 : Read the Documents

Document is dematerialized at some point if you want to automatically extract informations from it. And there are so many ways for this to happen, is it well done by scanning it which will provide a good quality document or on the other hand is it a picture someone took with a smartphone. Is it an old ID card falling in pieces or is it a brand new one? Is the background colored or are there some symbols on it? The diversity of the documents is not to establish, and it will determine how well you will be able to read it.

Improve the Image Quality : Image Processing

First, the image should be processed. Classic tasks to be done to improve the quality and the readability of the image are to keep colors above a threshold only, turn it into binary black and white image, straighten the image to make sure the text is horizontal, select part of the image you want to read. There are many techniques to use and they get to be adapted to the documents.

Convert an Image to Text : Optical Character Recognition

Optical Character Recognition (OCR) is a way to convert characters either printed text or handwritten into machine encoded text. OCR techniques are not new, but they have drastically improved with the use of machine learning and deep learning. The last OCR software developed by Google include deep learning, which has improve the accuracy even more.

Step 2 : Recognize Entities and Extract Them

There are two ways to recognize entities in a document, a rule-based one and a machine learning one. Those two can also be combined. The difficulty in entity recognition is basically the versatility of the words. Names change constantly, they appear in various forms and might also be abbreviated. Creating dictionaries is therefore too complex.

Rule-based entity extraction is based on the prior knowledge we have of a language and its rules. Basically an address will almost always look the same way: (a number, the mention of a street or road or avenue, a name, a postcode and a city). A set of various regular expressions to detect parts of the text that would respect this frame can be created. Luckily the word “address” would be present right above in the document as well.

Today, the machine learning based approach already show near humanlike performance in identifying lexical and phrasal information in text which express references to named entities:

  • Person names
  • Company/organization names
  • Locations

Several methods are used to create such Named Entity Recognizer, from supervised to unsupervised. Although even for the unsupervised, a small set of labeled data is needed.

Different models have been trained on several languages and are open-source. So today, for non-specific uses, the models are available on internet. The corpora used are quite often the same. They are taken from wikipedia or newspapers. This includes bias in the langage used. So most time, we will need to train those models furthermore to get the entity one need. The machine learning methods, contrary to the rule based method is language independent. It can be replicated to different languages once prepared on one.

To Go Further

the CoNLL conferences (Computational Natural Language Learning) bring together the main actors of this field and frequently make available annotated datasets . They also provide research papers on this field. The task of automatically annotate and understand links into text is still to be improved and explore. I’ll get deeper into Natural Language Processing tasks in another article.