Presented By O’Reilly and Cloudera
Make Data Work
March 5–6, 2018: Training
March 6–8, 2018: Tutorials & Conference
San Jose, CA

Deep learning for domain-specific entity extraction from unstructured text

Mohamed AbdelHady (Microsoft), Zoran Dzunic (Microsoft)
2:40pm3:20pm Wednesday, March 7, 2018
Average rating: ***..
(3.50, 2 ratings)

Who is this presentation for?

  • Data scientists and engineers

Prerequisite knowledge

  • A basic understanding of machine learning concepts

What you'll learn

  • Understand the steps required to build an entity extraction system
  • Learn how to train a neural word embedding model on huge amount of data (20 million records) using Azure HDInsight cluster, how to train LSTM recurrent deep neural network model using Keras and TensorFlow, how to evaluate the quality of the trained models, and how to visualize the word embeddings

Description

Biomedical named entity recognition is a critical step for complex biomedical NLP tasks such as understanding the interactions between different entity types, such as the drug-disease relationship or the gene-protein relationship. Feature generation for such tasks is often complex and time consuming. However, neural networks can obviate the need for feature engineering and use original data as input.

Mohamed AbdelHady and Zoran Dzunic demonstrate how to build a domain-specific entity extraction system from unstructured text using deep learning. In the model, domain-specific word embedding vectors are trained with word2vec learning algorithm on a Spark cluster using millions of Medline PubMed abstracts and then used as features to train a LSTM recurrent neural network for entity extraction, using Keras with TensorFlow or CNTK on a GPU-enabled Azure Data Science Virtual Machine (DSVM). Results show that training a domain-specific word embedding model boosts performance when compared to embeddings trained on generic data such as Google News.

Photo of Mohamed AbdelHady

Mohamed AbdelHady

Microsoft

Mohamed AbdelHady is a senior data scientist on the algorithms and data science (ADS) team within the AI+R Group at Microsoft, where he focuses on machine learning applications for text analytics and natural language processing. Mohamed works with Microsoft product teams and external customers to deliver advanced technologies that extract useful and actionable insights from unstructured free text such as search queries, social network messages, product reviews, customer feedback. Previously, he spent three years at Microsoft Research’s Advanced Technology Labs. He holds a PhD in machine learning from the University of Ulm in Germany.

Photo of Zoran Dzunic

Zoran Dzunic

Microsoft

Zoran Dzunic is a data scientist on the algorithms and data science (ADS) team within the AI+R Group at Microsoft, where he focuses on machine learning applications for text analytics and natural language processing. He holds a PhD and a master’s degree from MIT, where he focused on Bayesian probabilistic inference, and a bachelor’s degree from the University of Nis in Serbia.