Teaching computers to see, read, and understand
November 3, 2020
Full-Stack Optical Character Recognition
Advancing into the future sometimes means diving into the past. When ATB sought to find ways to extract information from a deluge of analog data hidden in decades’ worth of historical paperwork, we taught machines to read - shrinking down years of work into a matter of hours. In this whitepaper, Rhys Chouinard and Tyler Dauphinee of the ATB AI Guild breakdown the process we undertook to develop and test Optical Character Recognition.
ATB Financial has been providing banking services to Albertans for over 80 years. As one can imagine, in a long standing financial institution, there is a lot of legacy data. The main method for tracking interactions with customers in the past was paper documents. Lots and lots of paper documents.
Most of these paper documents were signed, scanned, and stored by ATB Financial in massive digital data warehouses. Occasionally, these files need to be accessed to extract some relevant piece of information. Perhaps a customer closes an account, and then returns later, and the original account information needs to be retrieved. Maybe a regulation change requires the retrieval and examination of old account files to extract data about a transaction or loan.
In any case, having a human find the right scanned document, open it, read through it, extract some data points, and enter them into a new system is not a challenging task. But it is time consuming, and usually boring. Even though a person is well equipped for these tasks, due to its simple and repetitive nature, a computer may be better suited to the job. This delegation of arduous tasks frees up human team members to spend their time doing other (more complex) tasks such as building stronger relationships with the customer.
While searching for information in legacy documents may not be an issue for a one-off request, a challenge arises when this task needs to be done many times. Take for example the business problem posed to ATB’s Enterprise Data Science (EDS) team (within the Artificial Intelligence Guild) by another group at ATB Financial. This group needed to retrieve information that was locked away in a legacy datastore on a massive scale.
The challenge involved sifting through roughly 50,000 scanned documents containing customer account details such as customer name, account number, address, document date, and so on. Each scanned document had an unknown length, containing anywhere from one page to over one hundred pages. Each document had an unknown scan quality - it may have been a recent scan of perfect quality, or a 15-year old scan from a poor quality flatbed scanner.
The individual pages themselves could have been scanned right-side up, upside-down, sideways, or any angle in between. The pages in the document are sometimes irrelevant for the task at hand; standard terms of service paragraphs, with the key account information only existing on one or two pages out of the entire report. In order to determine if ATB could provide a new type of banking product to its customers, these 50,000 scanned documents needed to be opened, read through, and examined closely for specific information. Once the relevant information was located (account numbers, dates, names, addresses, and dollar values), it needed to be extracted and stored in a flat file for analysis.
An additional challenge was that not every document was the same file type. Many files were TIFFs, but some were PDFs or PNG files. Making the situation even more challenging, none of these documents had “embedded text” - all were essentially pictures of the text.
The group in charge of extracting this information began by having ATB employees manually search the data store, find the right file, open the file, read through it, and extract the data, typing it into a new database. But, it quickly became apparent that this was inefficient, unenjoyable, and … it was going to take a looong time.
It was estimated (using the rate that staff were able to complete the task for a small sample) that it would take 10 full time employees roughly 4 years to accomplish this task:
Open 50,000 documents Sift through 600,000 pages Look over 300,000,000 words Extract about 250,000 key data points
Not a very fun task for a group of employees to find the needles in the haystack. This is when EDS was enlisted to find a better way. At its core, this is a classic information retrieval problem.
The Problem, Distilled: Information retrieval is a class of problems where a small amount of relevant data needs to be extracted from a document or database. In this example, the task involved detecting what are called “key value pairs” (KVPs). A KVP is a pair of data points such as:
Key: Property Address Value: 123 Jasper Ave, Edmonton, AB
The ‘Key’ is the phrase that identifies the meaning of the corresponding value. An example is shown in Figure 1.1 shows Key Value Pairs, and the information retrieval steps to be completed:
open the document find the Keys extract the Values then move them to a database where they can be acted upon by either people or by other machines.
To solve this daunting technical challenge, a full-stack optical character recognition solution was created on Google Cloud Platform.
A “stack” is just a number of algorithms or systems connected together to accomplish a multi-step task. Optical character recognition (OCR) is key to this particular stack. OCR refers to a class of algorithms/applications that extract text from images, in essence “reading” the words off a picture.
Page Classification (i.e. narrowing the search)
To begin, each of the 50,000 scanned documents were loaded into memory and split into their constituent pages. Every page was examined by a machine learning classifier to determine if it was a KVP-containing page. If not, it was ignored for the remainder of the process.
To develop this page classifier, the team first generated a training set by extracting a numerical vector between 100 and 500 features long, depending on the parameters chosen. In short, each page image was distilled down to a single vector describing its characteristics. Each feature may not have significance on its own, but as a whole, the feature vector describes whether or not the page contained tables, whether it contained boxes, pictures and text, only text, various sizes of text versus only one size, and so on.
Once this training set of feature vectors was generated, the machine learning classifier for differentiating pages could be trained. Utilizing the powerful open source library scikit-learn, a variety of machine learning models were tested: random forests, support vector machines (SVM), logistic regressions, and a few neural networks (fully connected, with varying depths).
In the end, it was a combination of a few classifiers that yielded the best results, based on the vanilla accuracy. Building one model out of a combination of classifiers is known as an ‘ensemble classifier,’ where the whole is greater than the sum of its parts.
Image Cleaning i.e. Getting a better view
At this point in the stack, there was a subset of the initial 600,000 pages (those classified as containing KVPs) that now needed to be “read” . Before that could be done, these images needed to be cleaned. A number of OpenCV submodules were leveraged for de-skewing (fixing slightly crooked scans), cleaning (removing artifacts from the scanning process) and sharpening (bringing the image into focus). The cleaning stage is critical for OCR performance - feeding upside down scans with lots of scanner artifacts through an OCR algorithm will net poor results, or no results at all!
Optical Character Recognition (i.e. a lifetime of reading)
The next step in the stack was the “reading” part - the Optical Character Recognition. The data science team utilized the open source “Tesseract OCR” package, which was birthed as a sponsored PhD project in 1984(!) in HP Labs, Bristol. It was open sourced in 2005, and the researchers at Google began supporting and updating the package. The repository itself is still very active, and had a major update in 2016 to include an LSTM (Long Short Term Memory) line recognizer. The inner workings of this OCR engine is extremely interesting and in order to truly do it justice a paper must be written on the topic (in fact, Google’s own Ray Smith has done just that).
The steps that the Tesseract package takes to ‘read text’ is very similar to how humans read and understand text, although the Tesseract OCR involves quite a bit more math behind the scenes…
How Machines “See”
The first step in reading a document is to parse the characters; in layman’s terms, this is equivalent to the computer learning its ABCs. Just like a human, Tesseract has to learn what each letter in the alphabet is before it can read. It does this by approximating the overall shape of a character image by creating a series of line segments. It then compares this character outline that to “ideal characters” and finds the best match. This match needs to be done not only on characters but also on fonts, since different fonts can be visually very different.
character image to line segments (3D vector <x,y,angle>) line segments to idealized edges (4D vector <x,y,angle,length>) matching of idealized edges to a character pattern. This intermediate 4D vector is where the OCR package gets its name, derived from a 4D “cube” known as a Tesseract.
Recognizing Words:
After learning their ABCs, humans begin to learn that letters string together to form words, and eventually we begin to understand what words should look like given how the words sound when spoken.
Tesseract also has to learn this, and it does so in a very elegant way. In an ideal world every character would be printed perfectly and one could slide a “window” over the image capturing all the characters. However the world of printed text is not perfect; characters are often incomplete, distorted, or squeezed together. Just like a human, the Tesseract package uses the context of nearby letters to derive meaning and guess at the right identity of characters.
It does this by first segmenting the image based on visual content into lines, then for each of those lines it traces a sliding window over the candidate characters, getting all the possible character combinations corresponding to that line. This results in a directed acyclic graph (DAG) which can then be analyzed to determine the most likely word represented by the lines.
KVP Extraction (i.e. Reading is easy, understanding is hard)
Utilizing Tesseract, the text is extracted from the relevant pages, reducing the problem of searching through images to a problem of searching through raw text.
Making Sense of it All... Even though it is straightforward to extract text from images given the current ubiquity of OCR tools (like Tesseract), the challenge lies in making sense of the mountain of data that is extracted.
To overcome this challenge, one must understand the process that humans undertake to find the right data, and then that process must be translated into instructions for a computer. The process for finding KVPs can be described abstractly as a landmark search. When opening a document with the express purpose of locating a single (KVP), humans typically look for ‘landmarks’, either visual or textual, that quickly narrow the search to a specific location on a page.
A simple example of this is reading a receipt from a store. An adult human knows from years of reading receipts that there must be a total, and there must be an item listing with corresponding amounts, so when a human wants to know the cost of a single item, they narrow their search to the item list on the receipt, then look for the word identifying their item of interest.
This same ‘landmark identification’ process as a way of narrowing down a text search was utilized for KVP extraction.
Regular Expressions
The KVP extraction process began with another ubiquitous data science tool, known as regular expressions (regex). Regular expressions are pattern-searching instructions used to hunt through lines of text and identify unique words or character strings.
The KVP search did not require machine learning, instead the data scientists combed through a subset of the documents and carefully crafted custom patterns to capture a number of fields from the documents such as:
Appraisal Value (dollar amount) Legal address Dates Customer name Named Entity Recognition
Some KVPs were hard to capture with the regex technique, so a slower but more sophisticated approach was taken. Utilizing the open source Natural Language ToolKit (NLTK) and SpaCy, the data science team was again able to ‘stand on the shoulders of giants,’ leveraging pre-existing toolkits to quickly extract the right KVPs from the dataset. Within both libraries there are a number of submodules for a process known as Named Entity Recognition (NER), that essentially takes a sequence of words in a sentence and attempts to tag the words that correspond to a given type of entity. An entity in this context is a person, place or thing, which translates to the automatic tagging of names, companies, dates, and dollar amounts.
Using these two disparate approaches (regex and NER) the team was able to extract most of the required KVPs from the 50,000 documents. EDS also developed some custom self-validation strategies to provide an accompanying confidence metric for each of the extracted KVPs.
Full Stack Parallelization (i.e. Walking is fun but running is faster.)
Once the full-stack had been developed, it was functioning very efficiently, but in a serial fashion. The entire stack needed to be done on one document, then the next, then the next, and so on.
This resulted in a long computational runtime for the entire 50,000 documents. A quick analysis of the entire stack shows that the overall process is “embarrassingly parallel” meaning that the analysis of one document has absolutely zero impact on the analysis of the other 49,999. This means the stack is a good candidate for parallel computing. The goal of parallelization of the stack involved getting as many documents as possible running through the above process at the same time.
Utilizing the power of Google Cloud Platform the data scientists were able to create and fully manage their own collection of powerful virtual machines (VMs), and even without optimization the team was able to speed up the processing of the entire document set by approximately a factor of 60.
This meant the full process for analyzing all 50,000 documents went from about 3 weeks of processing time, to around 10 hours. A key enabler of this parallelization is that in leveraging GCP, the team did not require the assistance of another group to create these machines and continue this processing. There was no need for a lengthy requisition process; the team was unhindered by access to necessary computing hardware, and as a result was able to deliver a high quality product in a short amount of time.
At the end of the project, for the cost of a few thousand dollars in virtual machines and storage in Google Cloud Platform, a full-stack OCR solution was created entirely in-house, using open-source libraries and a lot of research, development, and experimentation.
Instead of investing the equivalent of 40 full time employee years of work, a cluster of computers was able to complete the job in under a day (although, to be fair, the actual training of that cluster of computers took a few months).
This stack is a huge win for artificial intelligence in banking - using a combination of machine learning, computer vision, optical character recognition, and text analytics to save hundreds of thousands of dollars in employee costs. Better yet, it saved staff members from thousands of long hours of a tedious task, that was much better suited to a computer. If ATB Financial decides to use this validated dataset as part of the process for creating new financial products, it could open up a new revenue stream for the bank, increasing the value delivered by the project.
Out of all of the benefits of this data science work, two stand out as being the hardest to quantify, but arguably the most important:
First, is the fact that this stack (or individual components of the stack) are already being leveraged on other projects to quickly create a multiplier-effect for the value. In this way, the Enterprise Data Science team is building a “product line” of sorts, composed of readily available algorithms, processes, and techniques for solving the next generation of banking problems.
Second is the invaluable experience gained by the data scientists working on the project. Developing the in-house expertise to solve large scale computational challenges is a much wiser investment in the long run, as compared to hiring out the task to an external consulting firm. In the end, the team was able to teach a computer how to see, read, and understand. While in the process, we freed up humans to do the tasks they are great at (such a developing relationships with our customers), while putting electrons and microchips to work doing the tasks that they are great at - retrieving valuable information buried in a mountain of unstructured data.
Keep up with our ATB Innovation Lab initiatives by subscribing below.