Spec Sheet Processing: Extracting Data from Semi-Structured Documents

Natural Language Processing (NLP) and Machine Learning (ML) are two of the most prominent forms of artificial intelligence today that have revolutionized countless industries. They exist in every industry that helps people interact with computers and make our communication with technologies more meaningful. For example, NLP can help make sense of unstructured or semi-structured data from various sources and formats. In this case study, we describe our approach to extracting fields from spec sheets for the manufacturing industry.

Challenge

Our task was to automatically extract data fields from spec sheets that often contained multiple model numbers with shared details and a table detailing the differences between them. Complicating matters further, the data in these sheets had been generated using an OCR service and machine learning algorithms, resulting in heterogeneous data and tables that weren't always in a clear format.

Solution

To address this challenge, we developed two distinct approaches:

Plan A

For the scenario when the client provides texts and categories, we match manually filled text with corresponding locations in the spec sheet. Our solution is a script that takes a pdf document and manually filled data as input and outputs bounding boxes and page numbers for each manual filling. The pipeline for text processing is as follows:

Step 1. Extract text with bounding boxes using custom OCR for scanned docs and a PDF reader for text-based docs.

Step 2. Select candidates for matching manual data (spec param names, values, and model names) with corresponding outputs from the OCR.

Step 3. Rank candidates for matches using Levenstein distance and other methods.

Plan B

For the scenario when the client provides only the expected spec categories for each document, we automatically fill in values and units of measurements for specs and predict their locations. Our solution is a script that takes a pdf document and specified spec categories as input and outputs bounding boxes, page numbers, spec values, and units of measurement for each spec category. If the script is not confident enough, it outputs several candidates for human review. The pipeline for this scenario is as follows:

Step 1. Extract text with bounding boxes with the custom OCR for scanned docs and a PDF reader for text-based docs.

Step 2. Find approximate locations of specs values by extracting and clustering locations of numbers, units of measurements, names of the categories and their synonyms. This step requires operations to find the most meaningful clusters:

At first, we post-process the OCR output to form an index of words and transfer them to bounding boxes (OCR provides both texts and the respective bounding boxes).
At the next step, we search for numbers, units of measurements, spec tags, and synonyms of the words we find with the help of WordNet.
Clusters are several boxes that are close to each other. To detect them, we can choose one of the options:

— Option a: Build a cluster graph using Chinese whispers, or other clustering algorithms where nodes are words and edges are distances between their bounding boxes.

— Option b: Assign a number of words in some predefined range (in pixels) to each word. The words with the highest numbers will be centers of clusters.

To limit the number of clusters and keep them close to the categories, we eliminate small clusters. Bounding boxes for clusters are the approximate locations for specs we are looking for.

Step 3. Associate clusters with products. For this, we need to perform the following operations:

We start with exploiting the periodicity of the document structure to find a period in the number of lines of text (or pixels).
We collect distances between clusters by searching for pairs of element pairs with similar relative distances (e.g. for elements A, B, C, D with distance A-B similar to distance C-D).
We then extract a period by averaging distance A-C across all pairs of pairs. One period contains data about one product.
We can group clusters by the obtained periods and return results. To enhance results, we similarly exploit periodicity in types of words (numbers, units of measurement, or spec categories).

Step 4. Select clusters that potentially contain necessary specs and extract information from them:

As a first step, we search for a number, a unit of measurement, or the spec category within the same cluster or its neighbor clusters.
If possible, we detect if the cluster is a table and change the extraction algorithm for it.
To enhance the extraction quality, we compare the product specs to other products in the document.
We utilize patterns in the text such as number x number x number as a common pattern for dimensions of the product.

Additional useful tricks

To increase the accuracy of clustering and information extraction, we use several tricks, such as extracting important words for each document and generating distributions of spec values to detect anomalies. Overall, our approach allowed us to extract data from semi-structured spec sheets in a fast, reliable, and scalable way.

Impact

We've developed a system that can extract fields from spec sheets using an algorithmic pipeline. Although a pure ML solution based on attention models is possible, it may be less reliable and more time-consuming.

Tech Stack

Python 3.6
NumPy
SciPy
Scikit-learn
Gensim
PostgreSQL
Docker
AWS.

RELATED CASES

case study

AI-Powered Taxi Receipt Management: Faster Reimbursements and Accurate Data

case study

From Dublin to Frankfurt: EdTech Provider's AWS Migration into Compliance for Profit

case study

Transforming Complex Medical Data into Clinical Insights with Jackalope

View all Case Studies