Presented By O'Reilly and Cloudera
Make Data Work
September 25–26, 2017: Training
September 26–28, 2017: Tutorials & Conference
New York, NY

PyTextRank: Graph algorithms for enhanced natural language processing

Paco Nathan (derwen.ai)
11:20am12:00pm Thursday, September 28, 2017
Machine Learning & Data Science
Location: 1A 06/07 Level: Intermediate
Secondary topics:  Text
Average rating: *****
(5.00, 3 ratings)

Who is this presentation for?

  • Data scientists, machine learning and AI researchers, and product managers working in NLP and text analytics use cases

Prerequisite knowledge

  • Familiarity with Python programming, machine learning, and natural language processing

What you'll learn

  • Learn how PyTextRank provides advanced NLP, which can be performed on single-server solutions
  • Explore techniques for preparing raw text for use with deep learning

Description

PyTextRank is a Python open source implementation of TextRank, a graph algorithm for NLP based on the Mihalcea 2004 paper. The package is intended to complement other machine learning approaches, specifically deep learning used in custom search and recommendations, by generating enhanced feature vectors from raw texts. PyTextRank builds on builds on spaCy, datasketch, NetworkX, and other popular Python libraries. Results include full parse from raw texts, vectors of ranked keyphrases, and adjustable autosummarization. PyTextRank is used in production at scale by O’Reilly Media and is available on PyPi and GitHub.

Previous generations of NLP used shortcuts such as stemming, bag of words, and n-grams, which tend to degrade results. In contrast, PyTextRank uses lemmatization, named entity resolution, hypernyms, and graph-based semantic analysis. Advances in popular Python libraries for statistical parsing, graph analytics, probabilistic data structures, as well the availability of multicore processors with large memory spaces, make possible more effective approaches to NLP which do not require clusters. Resulting keyphrase vectors are significantly more useful than simple keyword extraction, especially for vector embedding. Moreover, this approach allows import of an ontology to help refine results. In other words, inference extends the parsing capabilities into natural language understanding.

Paco Nathan illustrates PyTextRank use cases in media and learning to enable semisupervised word sense disambiguation, move from natural language parsing to natural language understanding, and implement AI-based video search and approximation algorithms for content recommendation based on semantic similarity.

Photo of Paco Nathan

Paco Nathan

derwen.ai

Paco Nathan is known as a “player/coach” with core expertise in data science, natural language processing, machine learning, and cloud computing. He has 35+ years of experience in the tech industry, at companies ranging from Bell Labs to early-stage startups. His recent roles include director of the Learning Group at O’Reilly and director of community evangelism at Databricks and Apache Spark. Paco is the cochair of Rev conference and an advisor for Amplify Partners, Deep Learning Analytics, Recognai, and Primer. He was named one of the "top 30 people in big data and analytics" in 2015 by Innovation Enterprise.