Mar 15–18, 2020

Beyond OCR: Using deep learning to understand documents

Eitan Anzenberg (
4:15pm4:55pm Tuesday, March 17, 2020
Location: 210 E

Who is this presentation for?

  • Data scientists, machine learning (ML) engineers, AI or ML directors, and heads of data science




Extracting key fields from a variety of document types remains a challenging problem. Services such as AWS and Google Cloud provide text extraction services to digitize images or PDFs. These services use OCR techniques and return phrases, words, and characters with their corresponding coordinate locations. Working with these outputs remains challenging and unscalable as different document types require different heuristics with new types uploaded daily. Additionally, OCR doesn’t attempt to understand the document; for example, dollar amounts need be numerical, while OCR may suggest a “1” is a lowercase “L”. Furthermore, a performance ceiling is reached even when parsing algorithms work perfectly: while third-party service OCR is excellent, it isn’t perfectly accurate.

Eitan Anzenberg proposes an end-to-end scalable solution using deep learning architecture consisting of a computer vision component connected to a sequence generation component. Through training on millions of documents, the model learns to understand document trends and characteristics to finally extract important fields from raw documents. There is marked improvement of accuracy compared to third-party OCR services. Additional benefits include character-level probabilities for confidence scores and using explainability algorithms such as LIME to determine which “hot pixels” in the document are responsible for the predictions. is working to build a paperless future. It parses 60M documents per year, ranging from invoices, contracts, receipts, and a variety of other types. Understanding those documents is critical to building intelligent products for its users.

Prerequisite knowledge

  • General knowledge of machine learning
  • A basic understanding of deep learning

What you'll learn

  • Learn how to use deep learning for complex document understanding, design machine learning architectures, build machine learning projects, and deploy machine learning models to production
Photo of Eitan Anzenberg

Eitan Anzenberg

Eitan Anzenberg is the director of data science at and has many years of experience as a scientist and researcher. His recent focus is in machine learning, deep learning, applied statistics, and engineering. Previously, Eitan was a postdoctoral scholar at Lawrence Berkeley National Lab, received his PhD in physics from Boston University, and his BS in astrophysics from University of California, Santa Cruz. Eitan has 2 patents and 11 publications to date and has spoken about data at various conferences around the world.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)

Contact us

For conference registration information and customer service

For more information on community discounts and trade opportunities with O’Reilly conferences

Become a sponsor

For information on exhibiting or sponsoring a conference

For media/analyst press inquires