The Model and the Train Wreck: A Training Data How-to

Monica Rogati (Data Natives)
Deep Data, A-B
Average rating: ****.
(4.50, 2 ratings)

Getting training data for a recommender system is easy: if users clicked it, it’s a positive – if they didn’t, it’s a negative.

… Or is it? You’ve probably learned an algorithm to run on top of your existing algorithm, now and every time you re-train. And what do you do when the data product you’re building doesn’t have any users yet? Do you really launch with random results, hand label 50K examples, or ask a Turker to pretend they’re User #1337?

Unlike having a better algorithm, having better training data can improve your results by orders of magnitude. Yet training data generation is often an afterthought — a footnote in a formula-filled publication.

In this talk, we use examples from production recommender systems to bring training data to the forefront: from overcoming presentation bias to the art of crowdsourcing subjective judgments to creative data exhaust exploitation and feature creation.

Photo of Monica Rogati

Monica Rogati

Data Natives

As one of the founding members of the LinkedIn data science team, Monica turns data into products, actionable insights and (news) stories.

Monica obtained her PhD in Computer Science from Carnegie Mellon, where she focused on text mining and applied machine learning. At LinkedIn, she pioneered data driven products with multi-million dollar business impact and is currently building mathematical models that power LinkedIn’s personalized recommendations. When she doesn’t name projects after Harry Potter, Monica finds stories in the LinkedIn data about the most overused buzzwords, trending job titles, entrepreneur DNA, promotion cycles for Millennials and first names that tend to succeed. Her stories appeared in thousands of media outlets – from the Wall Street Journal & The Economist to NPR & CNN to Real Simple & (yes!) Howard Stern.


  • EMC
  • Microsoft
  • HPCC Systems™ from LexisNexis® Risk Solutions
  • MarkLogic
  • Shared Learning Collaborative
  • Cloudera
  • Digital Reasoning Systems
  • Pentaho
  • Rackspace Hosting
  • Teradata Aster
  • VMware
  • IBM
  • NetApp
  • Oracle
  • 1010data
  • 10gen
  • Acxiom
  • Amazon Web Services
  • Calpont
  • Cisco
  • Couchbase
  • Cray
  • Datameer
  • DataSift
  • DataStax
  • Esri
  • Facebook
  • Feedzai
  • Hadapt
  • Hortonworks
  • Impetus
  • Jaspersoft
  • Karmasphere
  • Lucid Imagination
  • MapR Technologies
  • Pervasive
  • Platform Computing
  • Revolution Analytics
  • Scaleout Software
  • Skytree, Inc.
  • Splunk
  • Tableau Software
  • Talend

For information on exhibition and sponsorship opportunities at the conference, contact Susan Stewart at

For information on trade opportunities with O'Reilly conferences contact Kathy Yu at mediapartners

For media-related inquiries, contact Maureen Jennings at

View a complete list of Strata contacts