Presented By O’Reilly and Cloudera
Make Data Work
21–22 May 2018: Training
22–24 May 2018: Tutorials & Conference
London, UK

Spark NLP in action: Intelligent, high-accuracy fact extraction from long financial documents

David Talby (Pacific AI), Saif Addin Ellafi (John Snow Labs), Paul Parau (UiPath)
14:0514:45 Thursday, 24 May 2018
Data science and machine learning, Expo Hall
Location: Expo Hall Level: Intermediate
Secondary topics:  Financial Services, Text and Language processing and analysis
Average rating: ****.
(4.50, 4 ratings)

Who is this presentation for?

  • Data scientists, architects, and engineering leaders

Prerequisite knowledge

  • Basic familiarity with machine learning

What you'll learn

  • Learn best practices and a reference architecture for intelligent fact extraction from complex free-text documents


Answering questions accurately based on information from financial documents, which can be a hundred or more pages long, is a challenge even for human domain experts. While traditional rule-based or expression-matching techniques work for simple fields in templated documents, it is harder to infer facts based on implied statements, on the absence of certain statements, or on the combination of other facts. Also, often the most interesting questions are not factual (e.g., Who is the CFO of this company?) but fuzzy (e.g., Are there suspect equity transactions disclosed here?). Answering such questions at a very high level of accuracy requires state-of-the-art deep learning techniques applied to NLP.

Spark NLP, John Snow Labs’s NLP Library for Apache Spark, is an open source library that natively extends Spark ML to provide natural language understanding capabilities with performance and scale that was not possible to date and provides advanced NLP algorithms like named entity recognition, fact extraction, spell checking, sentiment analysis, assertion status detection, and others. It enables training domain-specific machine learning and deep learning NLP models at a performance and scale that are one to two orders of magnitude better than existing alternatives.

David Talby, Saif Addin Ellafi, and Paul Parau explain how Spark NLP was used to augment the Recognos smart data extraction platform in order to automatically infer fuzzy, implied, and complex facts from long financial documents, covering the technical challenges, the architecture of the full solution, and lessons learned that you can directly apply to your next data extraction project.

Photo of David Talby

David Talby

Pacific AI

David Talby is a chief technology officer at Pacific AI, helping fast-growing companies apply big data and data science techniques to solve real-world problems in healthcare, life science, and related fields. David has extensive experience in building and operating web-scale data science and business platforms, as well as building world-class, agile, distributed teams. Previously, he led business operations for Bing Shopping in the US and Europe with Microsoft’s Bing Group and built and ran distributed teams that helped scale Amazon’s financial systems with Amazon in both Seattle and the UK. David holds a PhD in computer science and master’s degrees in both computer science and business administration.

Photo of Saif  Addin Ellafi

Saif Addin Ellafi

John Snow Labs

Saif Addin Ellafi is a software developer at John Snow Labs, where he’s the main contributor to Spark NLP. A data scientist, forever student, and an extreme sports and gaming enthusiast, Saif has wide experience in problem solving and quality assurance in the banking and finance industry.

Paul Parau


Paul Parau is a researcher and technical lead of the Recognos Smart Data Platform. He specializes in image processing, document layout analysis, and data extraction algorithms. Previously, Paul conducted research in the field of network science, with an emphasis on applications in social networks. He is currently focusing his research on brain network analysis.

Comments on this page are now closed.


Picture of David Talby
13/06/2018 16:14 BST

Jonas, I’m glad the tutorial was helpful! The slides and all the notebooks we went over are publicly available here:

13/06/2018 11:45 BST

thank you for the great talk. Do you plan on sharing your slides?