From Sandra Carrico, VP Engineering and Chief Data Scientist, WattzOn
It is well known that using ad hoc code to extract data from documents is extremely expensive. Data sources change periodically without notice, silently breaking best efforts at robust code. The more data sources you add, the more people you need, the more your cost grows. There is no efficiency gain.
We built Mr Bill to solve a problem. Extracting data from documents with regex code or OCR is costly and brittle from start to finish. We asked ourselves: “Can machine learning be used to make this process not only more robust, but cheaper and more scalable, too?” The answer is yes.
Mr Bill is an advanced machine learning system that extracts data from PDFs, scans, faxes, and document images, delivering structured data to customers’ systems. We set out to create a single, scalable code base that works across those document types, and improves with better machine learning algorithms and techniques. We addressed constantly changing document presentations with training data, not code; a very practical result. And we wanted small training sets, the only way to lower data extraction costs that arise from the huge number of document publishers. Mr Bill was born. The end of brittle code!
The secret sauce of Mr Bill is the way we structured the problem, making it amenable to a machine learning approach. (Heads up: We filed a patent.) At a high level Mr Bill consists of two basic systems:
- A sophisticated ensemble of machine learning algorithms
- A pipelining and orchestration system that trains, tunes and runs Mr Bill
THE CORE MR BILL
The core Mr Bill system ingests either PDFs or images of documents which are preprocessed by OCR. Mr Bill is not an OCR system. It is a smart system designed to find the data a customer wants after the document characters are readable by a computer. It allows for a combination of both statistical machine learning ensembles and arbitrarily architected neural net type solutions—independently or as part of an ensemble. In addition, the two types of solutions can help each other via embeddings that broker information between them , or as different results in a larger ensemble. The architecture is not restricted.
The core Mr Bill engine begins operation by training the system to extract particular fields from documents. The trainer simply identifies which fields they want extracted. This is typical data marking, or creation of ground truth. For example, if the desired field is “Amount Due”, 10 or 20 sample documents will be marked with that field. WattzOn has built an easy-to-use data marking tool for quick set up of training data, and with features that increase accuracy of results.
In practice, across numerous different bill types and fields, we found that the knee in the curve for extracting billing data from similar bills occurs at about seven training examples. It’s not perfect at seven, but it’s pretty good, with about 90% accuracy. After 20 examples, Mr Bill often produces the correct answer 95% of the time or more. Of course there are examples which are more difficult, but these are the typical results for typical data requests. And we continue to implement new features in Mr Bill to drive up our accuracy and recall rates. As those capabilities come online, retraining on the same data sets delivers that benefit to our customers.
Mr Bill addresses two challenges that plague traditional OCR and ad hoc code solutions with ever growing data extraction costs: Changes in the positions of data fields on a page, and new data fields appearing over time. Mr Bill was designed to be quite resilient to changes in presentations—most layout variations don’t have the slightest effect on Mr Bill’s ability to accurately extract data. What’s more, we’ll be introducing a Mr Bill feature later this year called Discovery. Discovery exploits the fact that Mr Bill learns over time, and knows what fields to extract. When a new field appears, Mr Bill will be able to suggest a reasonable field name and data value.
PIPELINE AND ORCHESTRATION
As we rolled out Mr Bill, we soon realized that there was no off-the-shelf pipelining and orchestration system to handle our large, complex ensemble solution. We scoured the market, but in the end found that we had to create our own solution. Since we knew we would be building more ensemble applications, we developed a general production solution hosted on AWS, one ready to run Mr Bill and other machine learning applications.
Our production workflow and orchestration system has four key features:
We know our customers have surges in volume and frequently need to run heavy loads through Mr Bill. So Mr Bill is fully elastic—growing as demand rises, and shrinking as demand abates. It’s capable of supporting high availability via execution in multiple AWS zones.
Customers come to us with a variety of technical capabilities. The pipelining system uses SFTP as a simple interface to deliver files to Mr Bill’s orchestration system. A separate folder returns results to the user. The pipelining system is separate from general orchestration and orchestration easily supports a restful API.
A major source of accuracy and precision in today’s machine learning applications comes from tuning of hyperparameters. Our pipelining and orchestration system allows the customer to indicate the level of tuning required by their application.
In addition, we know that speed can be valuable. Customers that need rapid turnaround at scale can authorize and provision an AWS account that arbitrarily expands compute capability for prediction, or even training or tuning of hyperparameters. Mr Bill and the pipeline and orchestration systems which support it have fully automated provisioning, allowing custom deployments of the system to fit a customer’s needs, including deployment behind customer firewalls to comply with data governance and security requirements.
Clearly we’re proud of Mr Bill. But we’re also very proud of the team that built it. It has been an amazing journey.