Presented By O’Reilly and Cloudera
Make Data Work
September 11, 2018: Training & Tutorials
September 12–13, 2018: Keynotes & Sessions
New York, NY

Document vectors in the wild: Building a content recommendation system for

James Dreiss (Reuters)
1:15pm–1:55pm Wednesday, 09/12/2018
Data science and machine learning
Location: 1A 06/07 Level: Intermediate
Secondary topics:  Media, Marketing, Advertising, Recommendation Systems, Text and Language processing and analysis
Average rating: ***..
(3.67, 3 ratings)

Who is this presentation for?

  • Data scientists, data engineers, data analysts, journalists, and editors

Prerequisite knowledge

  • A basic understanding of machine learning and natural language processing

What you'll learn

  • Explore best practices for implementing content-to-content recommendation systems at a news organizations


In the summer of 2017, embarked on an ambitious redesign of its article pages, specifically a scroll design in which articles that users request to read are immediately followed by related (or possibly unrelated) articles. The initial launch of the scroll model made recommendations based on content alone, independent of user behavior. Given the advantages of word and document embedding models and the particularities of content, the system was designed to use document vectors to to determine article similarity. Being unsupervised, document vectors need some supervised learning assistance if being used in a production system.

James Dreiss discusses the development of the supervised topic filtering model that sits on top of the document vector model, as well as additional filtering strategies. Measuring performance of word and document vectors is notoriously difficult, but some heuristics have been developed. James offers a brief overview of measuring word and document vector performance and explains how he ultimately tackled the problem. James also details how he tested a pet theory that users would want diversity in content, especially given the wall-to-wall coverage of certain subjects, such as Donald Trump, and shares the results of serving both similarly and dissimilarly related content to users. James concludes by covering the cookie-based personalization system that was later implemented for content recommendation on article scrolls, including test results comparing the two systems.

Photo of James Dreiss

James Dreiss


James Dreiss is a senior data scientist at Reuters. Previously, he worked at the Metropolitan Museum of Art in New York. He studied at New York University and the London School of Economics.

Comments on this page are now closed.


Picture of James Dreiss
09/26/2018 7:39am EDT

thanks jay! looks like the slides are just available on this page

Jay Urbain | PROFESSOR
09/25/2018 3:16pm EDT

Hi James,
Caught your talk on the TWIML podcast and really enjoyed it. I Went to: but your slides are not posted.

Picture of James Dreiss
09/18/2018 6:46am EDT

thanks! slides should be posted here:

Shashank Shashikant Rao | DATA SCIENTIST
09/17/2018 10:33am EDT

Nice talk! Was interesting to see how document vectors are used @ Reuters. Can you please share the slides?