Presented By O'Reilly and Cloudera
Make Data Work
Dec 4–5, 2017: Training
Dec 5–7, 2017: Tutorials & Conference

AI within O'Reilly Media

Paco Nathan (
11:15am11:55am Thursday, December 7, 2017
Average rating: ****.
(4.60, 5 ratings)

Who is this presentation for?

  • Developers, data scientists, and product managers working with text, video, and audio content

Prerequisite knowledge

  • Familiarity with use cases for machine learning, especially in natural language processing

What you'll learn

  • Learn how O'Reilly, a media company focused on learning, leverages available AI technologies


Paco Nathan explains how O’Reilly Media employs AI, from the obvious (chatbots, case studies about other firms) to the less so (using AI to show the structure of content in detail, enhance search and recommendations, and guide editors for gap analysis, assessment, pathing, etc.). Approaches include vector embedding search, summarization, TDA for content gap analysis, and speech-to-text to index video. Paco offers an overview of AI resources available through O’Reilly Media before taking a detailed look at how O’Reilly itself has undergone a transformation to leverage AI and deep learning both for customer needs and to augment editors’ work in curation.

The foundation of this work centers on O’Reilly’s ontology aka its knowledge graph, which complements what deep learning can provide. That graph describes the semantics of O’Reilly’s content areas, its audience interactions, vendor and sponsor relations, etc. One lesson that was quickly learned was the importance of maintaining integrity between the human-scale ontology graph and the large-scale data products produced by ML automation. Two open source projects support this work: PyTextRank, which builds atop spaCy, NetworkX, and datasketch for graph-based NLP, and nbtransom, which enables people and machines to collaborate on ML pipelines that support “human-in-the-loop” as a design pattern for management using Project Jupyter.

Some of these experiences at O’Reilly are relatively unique, since the company’s content includes many different publishers (all on Safari) and across a broad range of disciplines and content types, served to thousands of enterprise organizations. Overall, this work reflects recent major changes in industry away from “reference” content, with substantially more emphasis now placed on learning—that is, less about topics and keywords and more about job roles and skills.

Photo of Paco Nathan

Paco Nathan

Paco Nathan is known as a “player/coach” with core expertise in data science, natural language processing, machine learning, and cloud computing. He has 35+ years of experience in the tech industry, at companies ranging from Bell Labs to early-stage startups. His recent roles include director of the Learning Group at O’Reilly and director of community evangelism at Databricks and Apache Spark. Paco is the cochair of Rev conference and an advisor for Amplify Partners, Deep Learning Analytics, Recognai, and Primer. He was named one of the "top 30 people in big data and analytics" in 2015 by Innovation Enterprise.