In this post Digisalix machine-learning guru, Tomi, explains why appreciating different levels of structure can help reduce the data hungriness of the machine learning pipeline in intelligent document automation, leaving more time for people to enjoy better things in life – like chocolate.
Structure is relative
Structured documents, like invoices or purchase orders, are designed to have visual layouts that make them easy to read and fill. When receiving an unexpected invoice, we can quickly scan through it for key pieces of information. It is enough for our eyes to fixate on “ACME Chocolate Webstore” and wonder where our credit card is and if the kids have a tummy ache.
Even if the document is in a foreign language, we can often still identify relations and make educated guesses reading, say, “Kokku: 1,000 dollarit” might raise the same unnerving feeling.
For computers, however, such seemingly structured documents are actually unstructured data, especially if relayed as images through mobile phone cameras or scanners. An image is a grid of numeric values representing the color at each location, and fails to elicit a shock reaction even if some of the pixels happen to form the text “$1,000”.
To take proper measured actions, for example, categorizing the invoice line items to proper accounts, computers need help in extracting the structured information.
How computers read
Like young children, computers start by recognising the letters and words. Technically, this happens through a process called optical character recognition (OCR), which identifies sequences of characters from the pixels of the image, turning them into snippets of text together with their positions. This is a fairly established technology, although the progress made with deep learning in visual perception tasks has revamped the methodology and improved accuracy.
Unlike children, computers lack the shared frames of reference that people learn in everyday life. OCR turns pixels into text, but the text is still rather unstructured for a computer. “$1,000” is just a sequence of characters, even if a human would immediately associate it with a price, and possibly further with enough milk chocolate to fill a bathtub. Before being able to act on the data, the computer needs more lessons in the curriculum to graduate to recognizing higher-level structure.
Jigsaw puzzles and levels of structure
Some years ago, on a conference trip, I bought a jigsaw puzzle as a gift for kids. It has 54 large pieces, illustrating the alphabet from A to Z and occupations starting with the corresponding letter. The pieces have small overlaps in color and texture with neighboring ones, so young children who don’t know the alphabet can, with somewhat burdensome work, put the puzzle together. Knowing your ABC, however, you immediately know which piece goes where.
The current trend in machine learning is to use large quantities of data and computation together with flexible deep learning models to learn statistical relationships. The recipe is:
- Collect a large amount of raw data, like text documents.
- Invent ways to computationally generate tasks for the computer. For example, randomly mask a word in a sentence and ask the computer, to predict the word given the context. Or give the computer the beginning of a sentence and ask it to complete it.
- Fit the parameters of the statistical model in small steps to improve the answers for the generated tasks.
We can reflect this approach of extracting structure from documents to the jigsaw puzzle. The relatively easily learnable statistical structure of text is analogous to the color and texture of the pieces, while learning higher-level semantic or logical structure, such as knowing that the alphabet goes from A to Z, is more difficult, as it requires longer context, higher-level abstraction, or knowledge outside of the text. It is at all possible to the extent that the statistical structure is related to the semantic structure.
We teach computers to read better
Jigsaw puzzles can be notoriously difficult. Yet, after a few key pieces are provided, the big picture starts to become visible and the puzzle becomes markedly easier to solve. Our approach to document automation at Digisalix is to develop software that makes it easy to provide the crucial key pieces.
So while the computer struggles to assemble the pieces by purely looking at the raw data, the task is not hopeless:
- in document automation, the statistical and semantic structure are often relatively strongly related (“Total sum” is often immediately to the left of a corresponding “$1,000”), and
- semantic or logical structure can be encoded into statistical models through structural constraints or external knowledge – for example, the prices of line items add up to the total sum.
Machine learning can then complete the picture, using smaller, well-curated datasets that demand less human labor and maintenance to keep high quality. Small datasets also speed up the development iterations and make intelligent document processing a convenient tool instead of a complicated AI project.
In the end, when all pieces fit snugly together, we can be confident about the final result and enjoy the downstream benefits of full automation (for example, take some time to help the kids with the chocolate).
–
To dive deeper into the history of natural language processing and for a balanced perspective on large language models, we recommend these two articles: