Natural Language Processing (NLP) and Machine Learning (ML) are two of the most prominent forms of artificial intelligence today that have revolutionized countless industries. They exist in every industry that helps people interact with computers and make our communication with technologies more meaningful. For example, NLP can help make sense of unstructured or semi-structured data from various sources and formats. In this case study, we describe our approach to extracting fields from spec sheets for the manufacturing industry.
Our task was to automatically extract data fields from spec sheets that often contained multiple model numbers with shared details and a table detailing the differences between them. Complicating matters further, the data in these sheets had been generated using an OCR service and machine learning algorithms, resulting in heterogeneous data and tables that weren't always in a clear format.
To address this challenge, we developed two distinct approaches:
Plan A
For the scenario when the client provides texts and categories, we match manually filled text with corresponding locations in the spec sheet. Our solution is a script that takes a pdf document and manually filled data as input and outputs bounding boxes and page numbers for each manual filling. The pipeline for text processing is as follows:
Step 1. Extract text with bounding boxes using custom OCR for scanned docs and a PDF reader for text-based docs.
Step 2. Select candidates for matching manual data (spec param names, values, and model names) with corresponding outputs from the OCR.
Step 3. Rank candidates for matches using Levenstein distance and other methods.
Plan B
For the scenario when the client provides only the expected spec categories for each document, we automatically fill in values and units of measurements for specs and predict their locations. Our solution is a script that takes a pdf document and specified spec categories as input and outputs bounding boxes, page numbers, spec values, and units of measurement for each spec category. If the script is not confident enough, it outputs several candidates for human review. The pipeline for this scenario is as follows:
Step 1. Extract text with bounding boxes with the custom OCR for scanned docs and a PDF reader for text-based docs.
Step 2. Find approximate locations of specs values by extracting and clustering locations of numbers, units of measurements, names of the categories and their synonyms. This step requires operations to find the most meaningful clusters:
— Option a: Build a cluster graph using Chinese whispers, or other clustering algorithms where nodes are words and edges are distances between their bounding boxes.
— Option b: Assign a number of words in some predefined range (in pixels) to each word. The words with the highest numbers will be centers of clusters.
Step 3. Associate clusters with products. For this, we need to perform the following operations:
Step 4. Select clusters that potentially contain necessary specs and extract information from them:
Additional useful tricks
To increase the accuracy of clustering and information extraction, we use several tricks, such as extracting important words for each document and generating distributions of spec values to detect anomalies. Overall, our approach allowed us to extract data from semi-structured spec sheets in a fast, reliable, and scalable way.
We've developed a system that can extract fields from spec sheets using an algorithmic pipeline. Although a pure ML solution based on attention models is possible, it may be less reliable and more time-consuming.