Presented By O'Reilly and Cloudera
Make Data Work
March 28–29, 2016: Training
March 29–31, 2016: Conference
San Jose, CA

Data Science & Advanced Analytics conference sessions

Tuesday, March 29

9:00am–5:00pm Tuesday, 03/29/2016
Location: LL20 A
T.J. Alumbaugh (Continuum Analytics), James Powell (NumFOCUS), Bryan Van de Ven (Continuum Analytics), Sarah Bird (Continuum Analytics), Jake VanderPlas (eScience Institute, University of Washington), Katrina Riehl (Continuum Analytics)
Average rating: ****.
(4.33, 18 ratings)
Python has become an increasingly important part of the data-engineer and analytic-tool landscapes. PyData at Strata provides in-depth coverage of the tools and techniques gaining traction with the data audience, including IPython Notebook, NumPy/matplotlib, SciPy, and scikit-learn, and explores how to scale Python performance, including handling large, distributed datasets. Read more.
9:00am–5:00pm Tuesday, 03/29/2016
Location: LL20 C
Garrett Grolemund (RStudio), Nina Zumel (Win-Vector LLC), John Mount (Win-Vector LLC), Stephen Elston (Quantia Analytics, LLC)
Average rating: ***..
(3.88, 8 ratings)
From advanced visualization, collaboration, and reproducibility to big data, R Day at Strata covers a raft of current topics that analysts and R users need to pay attention to. The R Day tutorials come from leading luminaries and R committers—the folks keeping the R ecosystem apace of the challenges facing analysts and others who work with data. Read more.
9:00am–5:00pm Tuesday, 03/29/2016
Location: 210 B/F
Chris DuBois (Dato), Brian Kent (Dato), Srikrishna Sridhar (Dato), Piotr Teterwak (Dato)
Average rating: ***..
(3.21, 29 ratings)
This hands-on tutorial provides a quick start to building intelligent business applications using machine learning. Learn about machine-learning basics, feature engineering, anomaly detection, recommender systems, and deep learning as you are guided through all the steps of prototyping and production: data cleaning, feature engineering, model building and evaluation, and deployment. Read more.

Wednesday, March 30

11:00am–11:40am Wednesday, 03/30/2016
Location: LL20 D
Chris Sanden (Netflix), Christopher Colburn (Netflix)
Average rating: ****.
(4.50, 20 ratings)
Chris Sanden and Christopher Colburn outline a shared infrastructure for doing anomaly detection. Chris and Christopher explain how their solution addresses both real-time and batch use cases and offer a framework for performance evaluation. Read more.
11:50am–12:30pm Wednesday, 03/30/2016
Location: LL20 A
Tags: ai, ecommerce
Eric Colson (Stitch Fix)
Average rating: ****.
(4.42, 24 ratings)
Recommender systems use machine-learning algorithms to surface relevant products to consumers. While they are extremely effective, they cannot fully replace human interpretation. The two have very different capabilities that are additive. Eric Colson shows what's possible when the unique contributions of machines are combined with those of human experts to create a truly personalized experience. Read more.
11:50am–12:30pm Wednesday, 03/30/2016
Location: LL20 D
Average rating: ***..
(3.23, 13 ratings)
If you consider user click paths a process, you can apply process mining. Process mining models users based on their actual behavior, which allows us to compare new clicks with modeled behavior and report any inconsistencies. Bolke de Bruin and Hylke Hendriksen explain how ING implemented process mining on Spark Streaming, enabling real-time fraud detection. Read more.
1:50pm–2:30pm Wednesday, 03/30/2016
Location: LL20 A
John Berryman (Eventbrite)
Average rating: ****.
(4.00, 7 ratings)
At Eventbrite, users can serendipitously discover events they will love. But making this possible isn't easy. Events are short lived, and by the time Eventbrite can build an adequate collaborative-filtering model, the event is already over. John Berryman explains how Eventbrite overcomes these technical challenges with a combination of collaborative-filtering and content-based methods. Read more.
1:50pm–2:30pm Wednesday, 03/30/2016
Location: LL20 D
Tags: featured
Average rating: ****.
(4.57, 23 ratings)
Data scientists inhabit such an ever-changing landscape of languages, packages, and frameworks that it can be easy to succumb to tool fatigue. If this sounds familiar, you may have missed the increasing popularity of Linux containers in the DevOps world, in particular Docker. Michelangelo D'Agostino demonstrates why Docker deserves a place in every data scientist’s toolkit. Read more.
2:40pm–3:20pm Wednesday, 03/30/2016
Location: LL20 A
Robert Grossman (University of Chicago)
Average rating: ***..
(3.86, 14 ratings)
There is a big difference between running a machine-learning algorithm manually from time to time and building a production system that runs thousands of machine-learning algorithms each day on petabytes of data, while also dealing with all the edge cases that arise. Robert Grossman discusses some of the lessons learned when building such a system and explores the tools that made the job easier. Read more.
2:40pm–3:20pm Wednesday, 03/30/2016
Location: LL20 D
Tags: featured
Average rating: ****.
(4.78, 9 ratings)
BayesDB enables rapid prototyping and incremental refinement of statistical models by combining a model-independent declarative query language, BQL, with machine-assisted modeling and compositional models. Richard Tibbetts and Vikash Mansinghka explore the applications of BayesDB for analyzing and understanding developmental economics data in collaboration with the Gates Foundation. Read more.
2:40pm–3:20pm Wednesday, 03/30/2016
Location: 210 A/E
Sandy Ryza (Clover Health)
Average rating: ***..
(3.48, 27 ratings)
Want to build models over data every second from millions of sensors? Dig into the histories of millions of financial instruments? Sandy Ryza discusses the unique challenges of time series data and explains how to work with it at scale. Sandy then introduces the open source Spark-Timeseries library, which provides a natural way of munging, manipulating, and modeling time series data. Read more.
2:40pm–3:20pm Wednesday, 03/30/2016
Location: 230 C
Wes McKinney (Two Sigma Investments), Jacques Nadeau (Dremio)
Average rating: ****.
(4.07, 15 ratings)
Hadoop’s traditional batch technologies are quickly being supplanted by in-memory columnar execution to drive faster data-to-value. Wes McKinney and Jacques Nadeau provide an overview of in-memory columnar execution, survey key related technologies, including Kudu, Ibis, Impala, and Drill, and cover a sample use case using Ibis in conjunction with Apache Drill to deliver real-time conclusions. Read more.
4:20pm–5:00pm Wednesday, 03/30/2016
Location: LL20 A
Erik Andrejko (The Climate Corporation)
Average rating: ****.
(4.50, 4 ratings)
Best practices from scientific research can significantly increase the pace and quality of data science projects. Erik Andrejko discusses the benefits and challenges of reproducibility and collaboration, including review and inter-team communication, for data science work at the Climate Corporation. Read more.
4:20pm–5:00pm Wednesday, 03/30/2016
Location: LL20 D
Brandon Ballinger (Cardiogram), Johnson Hsieh (Cardiogram)
Average rating: ****.
(4.67, 9 ratings)
Each year, 15 million people suffer strokes, and at least a fifth of those are due to atrial fibrillation, the most common heart arrhythmia. Brandon Ballinger reports on a collaboration between UCSF cardiologists and ex-Google data scientists that detects atrial fibrillation with deep learning. Read more.
4:20pm–5:00pm Wednesday, 03/30/2016
Location: 210 C/G
Tags: real-time, ai
Alex Ingerman (Amazon Web Services)
Average rating: ***..
(3.62, 8 ratings)
Alex Ingerman explains how several AWS services, including Amazon Machine Learning, Amazon Kinesis, AWS Lambda, and Amazon Mechanical Turk, can be tied together to build a predictive application to power a real-time customer-service use case. Read more.
5:10pm–5:50pm Wednesday, 03/30/2016
Location: LL20 A
Moderated by:
Michael Dauber (Amplify Partners)
Panelists:
Yael Garten (LinkedIn), Monica Rogati (Data Natives), Daniel Tunkelang (Various)
Average rating: ****.
(4.14, 7 ratings)
We’ve all heard that rare breed the data scientist described as a unicorn. In building your DS team, should you hold out for that unicorn or create groups of specialists who can work together? Michael Dauber, Yael Garten, Monica Rogati, and Daniel Tunkelang discuss the pros and cons of various team models to help you decide what works best for your particular situation and organization. Read more.
5:10pm–5:50pm Wednesday, 03/30/2016
Location: LL20 D
Josh Patterson (Skymind), Dave Kale (Skymind), Zachary Lipton (University of California, San Diego)
Average rating: ****.
(4.00, 11 ratings)
Time series data is increasingly ubiquitous with both the adoption of electronic health record (EHR) systems in hospitals and clinics and the proliferation of wearable sensors. Josh Patterson, David Kale, and Zachary Lipton bring the open source deep learning library DL4J to bear on the challenge of analyzing clinical time series using recurrent neural networks (RNNs). Read more.

Thursday, March 31

11:00am–11:40am Thursday, 03/31/2016
Location: LL20 A
Chi-Yi Kuan (LinkedIn), Weidong Zhang (LinkedIn), Tiger Zhang (LinkedIn)
Average rating: ****.
(4.29, 24 ratings)
Chi-Yi Kuan, Weidong Zhang, and Yongzheng Zhang explain how LinkedIn has built a "voice of member" platform to analyze hundreds of millions of text documents. Chi-Yi, Weidong, and Yongzheng illustrate the critical components of this platform and showcase how LinkedIn leverages it to derive insights such as customer value propositions from an enormous amount of unstructured data. Read more.
11:50am–12:30pm Thursday, 03/31/2016
Location: LL20 A
Travis Oliphant (Continuum Analytics)
Average rating: ****.
(4.19, 21 ratings)
Despite Python's popularity throughout the data-engineering and data science workflow, the principles behind its performance and scaling behavior are less understood. Travis Oliphant explains best practices and modern tools to scale Python to larger-than-memory and distributed workloads without sacrificing its ease of use or being forced to adopt heavyweight frameworks. Read more.
1:50pm–2:30pm Thursday, 03/31/2016
Location: LL20 A
Marcel Kornacker (Cloudera), Alexander Behm (Cloudera)
Average rating: ***..
(3.86, 7 ratings)
Marcel Kornacker explains how to use nested data structures to increase analytic productivity. Marcel uses the well-known TPC-H schema to demonstrate how to simplify analytic workloads with nested schemas. Read more.
2:40pm–3:20pm Thursday, 03/31/2016
Location: LL20 A
Tags: science
Siddha Ganju (Nvidia)
Average rating: ***..
(3.64, 11 ratings)
Siddha Ganju explains how CERN uses machine-learning models to predict which datasets will become popular over time. This helps to replicate the datasets that are most heavily accessed, which improves the efficiency of physics analysis in CMS. Analyzing this data leads to useful information about the physical processes. Read more.
4:20pm–5:00pm Thursday, 03/31/2016
Location: LL20 A
Scott Draves (Two Sigma Open Source)
Average rating: ***..
(3.40, 5 ratings)
Scott Draves gives an overview of the Beaker notebook, a new open source tool for data scientists. Beaker was designed to be polyglot: a single notebook may contain cells from multiple languages that communicate with one another through a unique feature called autotranslation. Scott discusses motivations for the design, reviews the architecture, and gives a demo of Beaker in action. Read more.
4:20pm–5:00pm Thursday, 03/31/2016
Location: LL20 C
Jeroen Janssens (Data Science Workshops)
Average rating: ****.
(4.50, 4 ratings)
Vowpal Wabbit (VW) is a fast out-of-core learning system that pushes the frontier of machine learning. Jeroen Janssens offers a practical introduction to VW from both RStudio and the Unix command line and demonstrates how it can be used to perform tasks such as classification, regression, matrix factorization, and topic modeling. Read more.
4:20pm–5:00pm Thursday, 03/31/2016
Location: LL21 B
Sreeni Iyer (quadanalytix), Anurag Bhardwaj (Quad Analytix)
Average rating: *****
(5.00, 6 ratings)
Typically, 8–10% of product URLs in ecommerce sites are misclassified. Sreeni Iyer and Anurag Bhardwaj discuss a machine-learning-based solution that relies on an innovative fusion of classifiers that are both text- and image-based, along with human touch to handle edge cases, to automatically classify product URLs according to a canonical taxonomic organization with a high F-score. Read more.