We turn your ideas into math and software
We turn your ideas into math and software

How does Digisalix extract documents?

The in-house tech that we have built at Digisalix makes extracting information from documents easy. In this post we show a couple of examples on how to define a field – a piece of information we want to look up from a set of documents.

The process starts with creating a Python file, where we define the fields we want to extract from the document. Then for each field with a few lines of Python we define what we are looking for and give the Doclib engine, our software library for structured data extraction for documents, hints to help it out. The Doclib engine then calibrates the model for the field using the hints and a set of training documents.

Below we have an example of how the company name and company identification number are specified. 

We start by defining the company ID variable. Then we say it probably reads “company ID” nearby and give the regexp formulation of the ID as a hint. Finally we add a parser to extract the ID from a string.

company_id = model.add_var("company_id")
model.add_prior(company_id, FuzzyText(["company ID"]))
model.add_prior(company_id, RegEx(company_id_regex))
model.add_extractor(
    SingleBoxTextExtractor(
        company_id,
        parse_company_id
    )
)
model.add_structural_prior(
    energy_functions.Parseable(
        f"{company_id.name}_parser",
        company_id,
        parse_company_id
    )
)

Next the housing corporation name variable is defined. We then add a hint, that it likely reads “company” or “name” nearby. As the information can be in this case in Finnish or Swedish, we simply add relevant keywords in the other languages. Finally we add an extractor that cleans up any extra characters from the string. For example the following string “name: ACME Ltd.” would be cleaned to “ACME Ltd.”.

company_name = model.add_var("company_name")
model.add_prior(company_name, 
    FuzzyText({
        "name": 1.0,
        "company": 1.0,
        "As Oy": 1.0,
        "asunto oy": 1.0,
        "bost ab": 0.8,
        "bostads ab": 0.9,
        "Bab": 0.8
    },
    case_sensitive=False
)   
model.add_prior(company_name, FuzzyText({"service": -1.0}))
model.add_extractor(
    SingleBoxTextExtractor(
        company_name, 
        parse_company_name
    )
)

To improve the confidence of the system, we can perform checks between variables. For example, the list of housing corporations and IDs is public information. In the code below we check that the extracted names and IDs match what is listed in the official registry. We also check that the extracted company ID appears in the list. It is easy to add checks and compare against external information sources like these.

model.add_structural_prior(
    company_id_and_name_match(
        "company_id_and_company_name",
        company_id, 
        company_name
    )
)
model.add_structural_prior(
    company_id_in_list("company_id_in_list", company_id)
)

The description for the fields should be based on the observable properties of the documents, like keywords and variable value relations. All the hard parts, like machine learning, are encapsulated in the Doclib engine.

To read more about our document extraction platform, please refer to our recent case study on automating the processing of building certificates for a Finnish commercial bank.

Share this article

LinkedIn
Twitter
Facebook

More posts by Tommi Pesu