Deep Data

Deep Data, Ballroom AB

Deep Data is a no-holds-barred program for data scientists. The advanced technical content will keep you up to speed with the latest techniques, and give you the opportunity to debate and network with the most skilled data scientists in our industry.


9:00am – 9:45am
SQL and NoSQL Are Two Sides Of The Same Coin

Michael Rys

Contrary to popular belief, SQL and NoSQL are not at odds with each other, they are duals—in fact NoSQL should really be called coSQL. Recognizing this duality can change the way we think about which technology to use when, and what we need to invest in next.

9:45am – 10:30am
From Knowing ‘What’ To Understanding ‘Why’

Claudia Perlich

With the collection of almost every piece of information about your customers comes the ability to start asking your data the right question: Why do they do what they do? And even more: what would they do if I could interact with them. We show for the case of online display advertising, how causal analysis gives interesting new answers about the right (and wrong) ways of spending your money.

10:30am – 11:00am Break

11:00am – 11:30am
The Model and the Train Wreck: A Training Data How-to

Monica Rogati

Getting training data for a recommender system is easy: if users clicked it, it’s a positive – if they didn’t, it’s a negative. … Or is it? In this talk, we use examples from production recommender systems to bring training data to the forefront: from overcoming presentation bias to the art of crowdsourcing subjective judgments to creative data exhaust exploitation and feature creation.

11:30am – 12:00pm
Corpus Bootstrapping with NLTK

Jacob Perkins

Learn various ways to bootstrap a custom corpus for training highly accurate natural language processing models. Real world examples will be presented with Python code samples using NLTK. Each example will show you how, starting from scratch, you can rapidly produce a highly accurate custom corpus for training the kinds of natural language processing models you need.

The Importance of Importance: An Introduction to Feature Selection
Ben Gimpert

Twenty-first century big data is being used to train predictive models of emotional sentiment, customer churn, patient health, and other behavioral complexities. Variable importance and feature selection reduces the dimensionality of our models, so an unfeasible and complex problem may become somewhat more predictable.

12:30pm – 1:30pm Lunch

Social Network Analysis Isn’t Just For People

Matt Biddulph

The tools of social network analysis are based on mathematical network theory. There is very little in these techniques that actually requires that the data represents social activity. We’ll show how these techniques can be applied to data from areas such as geo, linguistics and the Wikipedia link graph. We’ll visualise and explore the data using Gephi, the “Photoshop for graphs”.

Array Theory vs. Set Theory in Managing Data

Robert Lefkowitz

Relational databases were based on Set theory — which insists that the order of items does not matter. For many (most?) data problems, however, order does matter. By using Array theory, a relational-like database gains a considerable advantage over set-theory based engines.

3:00pm – 3:30pm Break

3:30pm – 4:00pm
"Survival Analysis for Cache Time-to-Live Optimization ":

Robert Lancaster

We examine the effectiveness of a statistical technique known as survival analysis to optimize the cache time-to-live for hotel rates in a hotel rate cache. We describe how we collect and prepare nearly a billion records per day utilizing MongoDB and Hadoop. Finally, we show how this analysis is improving the operation of our hotel rate cache.

The Data Science Debate

Peter Skomoroch, Michael Driscoll, DJ Patil, Amy Heineike, Pete Warden, Toby Segaran

End the day by joining leading data scientists in debating the hot issues in the profession.

Register Now for Strata 2012


  • EMC
  • Microsoft
  • HPCC Systems™ from LexisNexis® Risk Solutions
  • MarkLogic
  • Shared Learning Collaborative
  • Cloudera
  • Digital Reasoning Systems
  • Pentaho
  • Rackspace Hosting
  • Teradata Aster
  • VMware
  • IBM
  • NetApp
  • Oracle
  • 1010data
  • 10gen
  • Acxiom
  • Amazon Web Services
  • Calpont
  • Cisco
  • Couchbase
  • Cray
  • Datameer
  • DataSift
  • DataStax
  • Esri
  • Facebook
  • Feedzai
  • Hadapt
  • Hortonworks
  • Impetus
  • Jaspersoft
  • Karmasphere
  • Lucid Imagination
  • MapR Technologies
  • Pervasive
  • Platform Computing
  • Revolution Analytics
  • Scaleout Software
  • Skytree, Inc.
  • Splunk
  • Tableau Software
  • Talend

For information on exhibition and sponsorship opportunities at the conference, contact Susan Stewart at

For information on trade opportunities with O'Reilly conferences contact Kathy Yu at mediapartners

For media-related inquiries, contact Maureen Jennings at

View a complete list of Strata contacts