Reading and Categorizing Scanned Documents using Deep Learning

6 min readAug 4, 2020

To many people’s dismay, there is still a giant wealth of paper documents floating out there in the world. Tucked into corner drawers, stashed in filing cabinets, overflowing from cubicle shelves — these are headaches to keep track of, keep updated, and just store. What if there existed a system where you could scan these documents, generate plain text files from their contents, and automatically categorize them into high level topics? Well, the technology to do all of this exists, and it’s simply a matter of stitching them all together and getting it to work as a cohesive system, which is what we’ll be going through in this article. The main technologies used will be OCR (Optical Character Recognition) and Topic Modeling. Let’s get started!

The scariest thing I’ve seen (credit: Telegraph UK)

Collecting Data

The first thing we’re going to do is create a simple dataset so that we can test each portion of our workflow and make sure it’s doing what it’s supposed to. Ideally, our dataset will contain scanned documents of various levels of legibility and time periods, and the high level topic that each document belongs to. I couldn’t locate a dataset with these exact specifications, so I got to work building my own. The high level topics I decided on were government, letters, smoking, and patents. Random? Well these were mainly chosen because of the availability of a good variety of scanned documents for each of these areas. The wonderful sources below were used to extract the scanned documents for each of these topics:

Government/Historical: OurDocuments

Letters: LettersofNote

Patents: The Portal to Texas History (University of North Texas)

Smoking: Tobacco 800 Dataset

From each of these sources I picked 20 or so documents that were of a good size and legible to me, and put them into individual folders defined by the topic

After almost a full day of searching for and cataloging all the images, I resized them all to 600x800 and converted them into .PNG format. The finished dataset is available for download here.

Some of the scanned documents we will be analyzing

The simple resizing and conversion script is below:

Building the OCR Pipeline

Optical Character Recognition is the process of extracting written text from images. This is usually done via machine learning models, and most often through pipelines incorporating convolutional neural networks. While we could train a custom OCR model for our application, it would require tons more training data and computation resources. We will instead utilize the fantastic Microsoft Computer Vision API, which includes a specific module specifically for OCR. You will need to register for a free tier account (sufficient for use with document scanning) and the API call will consume an image (as a PIL image) and output several bits of information including the location/orientation of the text on the image as well as the text itself. The following function will take in a list of PIL images and output an equal sized list of extracted texts:

Post-processing

Since we might want to end our workflow here in some instances, instead of just holding onto the extracted text as a giant list in memory, we can also write out the extracted texts into individual .txt files with the same names as the original input files. While the OCR technology from Microsoft is good, it occasionally will make mistakes. We can mitigate some of these mistakes using the SpellChecker module. The following script accepts an input and output folder, reads in all the scanned documents in the input folder, reads them using our OCR script, runs a spell check and correct mis-spelled words, and finally writes out the raw .txt files into the output folder.

Preparing Text for Topic Modeling

If our set of scanned documents is large enough, writing them all into one large folder can make them hard to sort through, and we likely already have some kind of implicit grouping in the documents (especially if they came from something like a filing cabinet). If we have a rough idea of how many different “types” or topics of documents we have, we can use topic modeling to help identify these automatically. This will give us the infrastructure to split the identified text from OCR into individual folders based on document content. The topic model we will be using is called LDA, for Latent Direchlet Analysis, and there’s a great introduction to this type of model here. To run this model we will need a bit more pre-processing and organizing of our data, so to prevent our scripts from getting to long and congested we will assume the scanned documents have already been read and converted to .txt files using the above workflow. The topic model will then read in these .txt files, classify them into however many topics we specify, and place them into appropriate folders.

We’ll start off with a simple function to read all the outputted .txt files in our folder and read them into a list of tuples with (filename, text). This will help us keep track of the original filenames for after we categorize them into topics

Next, we will need to make sure that all useless words (ones that don’t help us distinguish the topic of a particular document). We will do this using three different methods:

Remove stopwords
Strip tags, punctuations, numbers, and multiple whitespaces
TF-IDF filtering

To achieve all of this (and our topic model) we will use the Gensim package. The script below will run the necessary pre-processing steps on a list of text (output from the function above) and train an LDA model.

Using the Topic Model to Categorize Documents

Once we have our LDA model trained, we can use it to categorize our set of training documents (and future documents that might come in) into topics and then place them into the appropriate folders.

Using the trained LDA model against a new text string require some fiddling (in fact I needed some help figuring it out myself, thank god for SO), all of the complication is contained in the function below:

Finally, we’ll need another method to get the actual name of the topic based on the topic index.

Putting it All Together

Now, we can stick all of the functions we wrote above into a single script that accepts an input folder, output folder, and topic count. The script will read all the scanned document images in the input folder, write them into .txt files, build an LDA model to find high level topics in the documents, and organize the outputted .txt files into folders based on document topic.

Demo

To prove all of the above wasn’t just long winded gibberish, here’s a video demo of the system. There’s many things that can be improved (most notably to keep track of line breaks from the scanned documents, handling special characters and other languages besides English, and making requests to the computer vision API in batch instead of one by one) but we have ourselves a solid foundation to build improvements on. For more information checkout the associated Github repo.

Thanks for reading!