Presented By O'Reilly and Cloudera
Make Data Work
September 25–26, 2017: Training
September 26–28, 2017: Tutorials & Conference
New York, NY

Learning meaning from web-scale big data

Gerard de Melo (Rutgers University)
Data science & advanced analytics, Machine Learning & Data Science
Location: 1A 06/07 Level: Intermediate
Secondary topics:  Deep learning, Text
Average rating: ****.
(4.00, 4 ratings)

Although we now have vast amounts of data available to us on the web and elsewhere, it is not obvious how to leverage all of this data to enable more intelligent applications. Gerard de Melo shares results on applying deep learning techniques to web-scale amounts of data to learn neural representations of language and world knowledge.

Regular word2vec-style word vectors already allow us to capture basic aspects of word meanings from large corpora. However, by developing novel neural learning models and building on a Spark-based data processing infrastructure, we are additionally able to exploit massive amounts of structured data (from large knowledge graphs) as well as relational information extracted from unstructured data (in particular large web text and image archives). This in turn enables us to induce substantially more detailed semantic representations, capturing more phenomena and explicit knowledge. As a result, we are able to capture word and text semantics more adequately and in over 300 languages, capture millions of entities and their relationships, and move toward the goal of commonsense knowledge modeling (e.g., the fact that birds tend to be able to fly and that grass tends to be green).

Gerard concludes by demonstrating how easy it is to use these resources in Spark (and without it) to enable novel data-driven and intelligent applications.

Photo of Gerard de Melo

Gerard de Melo

Rutgers University

Gerard de Melo is an assistant professor of computer science at Rutgers University, where he heads a team of researchers working on big data analytics, natural language processing, and web mining. Gerard’s research projects include UWN/MENTA, one of the largest multilingual knowledge bases, and, an important hub in the web of data. Previously, he was a faculty member at Tsinghua University, one of China’s most prestigious universities, where he headed the Web Mining and Language Technology Group, and a visiting scholar at UC Berkeley, where he worked in the ICSI AI Group. He serves as an editorial board member for Computational Intelligence, the Journal of Web Semantics, the Springer Language Resources and Evaluation journal, and the Language Science Press TMNLP book series. Gerard has published over 80 papers, with best paper or demo awards at WWW 2011, CIKM 2010, ICGL 2008, and the NAACL 2015 Workshop on Vector Space Modeling, as well as an ACL 2014 best paper honorable mention, a best student paper award nomination at ESWC 2015, and a thesis award for his work on graph algorithms for knowledge modeling. He holds a PhD in computer science from the Max Planck Institute for Informatics.