Presented By O’Reilly and Cloudera
Make Data Work
September 11, 2018: Training & Tutorials
September 12–13, 2018: Keynotes & Sessions
New York, NY

Applying petabyte-scale analytics and machine learning to billions of news reading sessions

Andrew Montalenti ( )
11:20am–12:00pm Thursday, 09/13/2018
Data science and machine learning
Location: 1A 06/07 Level: Intermediate
Secondary topics:  Media, Marketing, Advertising, Text and Language processing and analysis
Average rating: *****
(5.00, 1 rating)

Who is this presentation for?

  • CTOs, VPs of engineering, data engineers, data scientists, machine learning engineers, and web analysts

What you'll learn

  • Explore real-world production applications of open source and cloud technology
  • Learn how web analytics data enables insights into consumer behavior and what modern machine learning and natural language processing techniques work well against web content

Description runs a real-time web and content analytics platform that serves 350+ enterprise clients, 30,000+ site operators, and thousands of high-traffic sites. This platform is used to understand audience, content, and attention at a granular level, but the aggregate data exhaust from these integrations provides a front-row seat to what the internet is looking at today.

Andrew Montalenti explains how consumer attention in the web era really works (e.g., to what degree Facebook and Google dominate consumer web attention versus more niche platforms). Andrew also showcases how recently applied modern natural language processing and machine learning techniques to better understand its evolving dataset of more than a million unique pieces of content per day, including how the company classified all web pages into a structured content taxonomy and automatically extracted out relevant topics and entities.

Alongside some of these network data findings related to news trends, social networks, search engines, and device usage patterns, Andrew also digs into the technology running under the hood, particularly multicloud setups (in the hundreds) with Elasticsearch, Cassandra, Kafka, Storm, and Spark, and discusses open source projects the company has built and released, such as PyKafka and streamparse. Andrew even talks about’s recent adoption of serverless cloud tooling, which makes machine learning easier.

Andrew concludes by explaining how’s web-wide trend data has been used so far, such as for content strategy inside major newsrooms as well as for predicting offline consumer behavior (e.g., which movies would win at the box office based on the web attention those movies received in weeks prior).

Photo of Andrew Montalenti

Andrew Montalenti

Andrew Montalenti is the cofounder and CTO of, a widely used real-time web content analytics platform. The product is trusted daily by editors at HuffPost, Time, TechCrunch, Slate, Quartz, the Wall Street Journal, and over 350 other leading digital companies. Andrew is a dedicated Pythonista and has presented his team’s work at the PyCon and PyData conferences. He is also the cohost of the web data and analytics podcast The Center of Attention. For more information, check out’s research on internet attention via @parsely.