PyTextRank is a Python open source implementation of TextRank, a graph algorithm for NLP based on the Mihalcea 2004 paper. The package is intended to complement other machine learning approaches, specifically deep learning used in custom search and recommendations, by generating enhanced feature vectors from raw texts. PyTextRank builds on builds on spaCy, datasketch, NetworkX, and other popular Python libraries. Results include full parse from raw texts, vectors of ranked keyphrases, and adjustable autosummarization. PyTextRank is used in production at scale by O’Reilly Media and is available on PyPi and GitHub.
Previous generations of NLP used shortcuts such as stemming, bag of words, and n-grams, which tend to degrade results. In contrast, PyTextRank uses lemmatization, named entity resolution, hypernyms, and graph-based semantic analysis. Advances in popular Python libraries for statistical parsing, graph analytics, probabilistic data structures, as well the availability of multicore processors with large memory spaces, make possible more effective approaches to NLP which do not require clusters. Resulting keyphrase vectors are significantly more useful than simple keyword extraction, especially for vector embedding. Moreover, this approach allows import of an ontology to help refine results. In other words, inference extends the parsing capabilities into natural language understanding.
Paco Nathan illustrates PyTextRank use cases in media and learning to enable semisupervised word sense disambiguation, move from natural language parsing to natural language understanding, and implement AI-based video search and approximation algorithms for content recommendation based on semantic similarity.
Paco Nathan is known as a “player/coach” with core expertise in data science, natural language processing, machine learning, and cloud computing. He has 35+ years of experience in the tech industry, at companies ranging from Bell Labs to early-stage startups. His recent roles include director of the Learning Group at O’Reilly and director of community evangelism at Databricks and Apache Spark. Paco is the cochair of Rev conference and an advisor for Amplify Partners, Deep Learning Analytics, Recognai, and Primer. He was named one of the "top 30 people in big data and analytics" in 2015 by Innovation Enterprise.
©2017, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org