Answering questions accurately based on information from financial documents, which can be a hundred or more pages long, is a challenge even for human domain experts. While traditional rule-based or expression-matching techniques work for simple fields in templated documents, it is harder to infer facts based on implied statements, on the absence of certain statements, or on the combination of other facts. Also, often the most interesting questions are not factual (e.g., Who is the CFO of this company?) but fuzzy (e.g., Are there suspect equity transactions disclosed here?). Answering such questions at a very high level of accuracy requires state-of-the-art deep learning techniques applied to NLP.
Spark NLP, John Snow Labs’s NLP Library for Apache Spark, is an open source library that natively extends Spark ML to provide natural language understanding capabilities with performance and scale that was not possible to date and provides advanced NLP algorithms like named entity recognition, fact extraction, spell checking, sentiment analysis, assertion status detection, and others. It enables training domain-specific machine learning and deep learning NLP models at a performance and scale that are one to two orders of magnitude better than existing alternatives.
David Talby, Saif Addin Ellafi, and Paul Parau explain how Spark NLP was used to augment the Recognos smart data extraction platform in order to automatically infer fuzzy, implied, and complex facts from long financial documents, covering the technical challenges, the architecture of the full solution, and lessons learned that you can directly apply to your next data extraction project.
David Talby is a chief technology officer at Pacific AI, helping fast-growing companies apply big data and data science techniques to solve real-world problems in healthcare, life science, and related fields. David has extensive experience in building and operating web-scale data science and business platforms, as well as building world-class, agile, distributed teams. Previously, he led business operations for Bing Shopping in the US and Europe with Microsoft’s Bing Group and built and ran distributed teams that helped scale Amazon’s financial systems with Amazon in both Seattle and the UK. David holds a PhD in computer science and master’s degrees in both computer science and business administration.
Saif Addin Ellafi is a software developer at John Snow Labs, where he’s the main contributor to Spark NLP. A data scientist, forever student, and an extreme sports and gaming enthusiast, Said has a wide experience in problem solving and quality assurance in the banking and finance industry.
Paul Parau is a researcher and technical lead of the Recognos Smart Data Platform. He specializes in image processing, document layout analysis, and data extraction algorithms. Previously, Paul conducted research in the field of network science, with an emphasis on applications in social networks. He is currently focusing his research on brain network analysis.
Comments on this page are now closed.
©2018, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com