Presented By O'Reilly and Cloudera
Make Data Work
March 13–14, 2017: Training
March 14–16, 2017: Tutorials & Conference
San Jose, CA

Semantic natural language understanding at scale using Spark, machine-learned annotators, and deep-learned ontologies

David Talby (Atigeo), Claudiu Branzan (G2 Web Services)
1:50pm2:30pm Wednesday, March 15, 2017
Data science & advanced analytics
Location: 230 C Level: Intermediate
Secondary topics:  Deep learning, Healthcare, Text
Average rating: ****.
(4.14, 7 ratings)

Who is this presentation for?

  • Data scientists, machine-learning engineers, NLP engineers, and software architects

Prerequisite knowledge

  • A basic understanding of and experience with Spark and machine learning

What you'll learn

  • Understand three key skills needed for natural language understanding: building an annotations pipeline, training and using machine-learned annotators, and applying deep learning to learn new ontologies and the relationships between concepts

Description

A text-mining system must go way beyond indexing and search to appear truly intelligent. First, it should understand language beyond keyword matching. (For example, distinguishing between “Jane has the flu,” “Jane may have the flu,” “Jane is concerned about the flu," “Jane’s sister has the flu, but she doesn’t,” or “Jane had the flu when she was 9” is of critical importance.) This is a natural language processing problem. Second, it should “read between the lines” and make likely inferences even if they’re not explicitly written. (For example, if Jane has had a fever, a headache, fatigue, and a runny nose for three days, not as part of an ongoing condition, then she likely has the flu.) This is a semisupervised machine-learning problem. Third, it should automatically learn the right contextual inferences to make. (For example, learning on its own that fatigue is sometimes a flu symptom—only because it appears in many diagnosed patients—without a human ever explicitly stating that rule.) This is an association-mining problem, which can be tackled via deep learning or via more guided machine-learning techniques.

David Talby and Claudiu Branzan lead a live demo of an end-to-end system that makes nontrivial clinical inferences from free-text patient records and provides real-time inferencing at scale. The architecture is built out of open source big data components: Kafka and Spark Streaming for real-time data ingestion and processing, Spark for modeling, and Elasticsearch for enabling low-latency access to results. The data science components include spaCy, a pipeline with custom annotators, machine-learning models for implicit inferences, and dynamic ontologies for representing and learning new relationships between concepts. Source code will be made available after the talk to enable you to hack away on your own.

Photo of David Talby

David Talby

Atigeo

David Talby is Atigeo’s chief technology officer, working to evolve its big data analytics platform to solve real-world problems in healthcare, energy, and cybersecurity. David has extensive experience in building and operating web-scale data science and business platforms, as well as building world-class, Agile, distributed teams. Previously, he was with Microsoft’s Bing group, where he led business operations for Bing Shopping in the US and Europe. Earlier, he worked at Amazon both in Seattle and the UK, where he built and ran distributed teams that helped scale Amazon’s financial systems. David holds a PhD in computer science and master’s degrees in both computer science and business administration.

Photo of Claudiu Branzan

Claudiu Branzan

G2 Web Services

Claudiu Branzan is the director of data science at G2 Web Services where he designs and implements data science solutions to mitigate merchant risk, leveraging his 10+ years of machine-learning and distributed-systems experience. Previously, Claudiu worked for Atigeo Inc, building big data and data science-driven products for various customers.

Comments on this page are now closed.

Comments

Picture of David Talby
David Talby | CTO
03/09/2017 4:00am PST

A.J., thank you for your interest in the session – knowing what people are interested in helps us greatly to know what to focus on. The focus of the session is on building NLP pipelines, and then applying machine learning and deep learning for more advanced annotators. If you want to talk more specifically about making the data available for visualization & further analysis, we can reserve time to talk right after the session. Please let me know if that will work for you. Thanks!

A.J. Ferrara | DATA INTELLIGENCE MANAGER
03/07/2017 10:34pm PST

I will be at this session. While I am very interested in all aspects of this session, I am most interested in how to best get the data out of ElasticSearch/NoSQL and integrate with rest of the data and make available for any tool (Tableau, SQL, SAS, etc).