Presented By O'Reilly and Cloudera
Make Data Work
March 13–14, 2017: Training
March 14–16, 2017: Tutorials & Conference
San Jose, CA

Schedule: Data science & advanced analytics sessions

Inside the world of data practitioners—from the hard science of the latest algorithms and advances in machine learning, to the thorny issues of cultural change and team-building.

Add to your personal schedule
9:00am - 5:00pm Monday, March 13 & Tuesday, March 14
Location: 212 B
Bruce Martin (Cloudera)
Average rating: ****.
(4.00, 1 rating)
Bruce Martin walks you through applying data science methods to real-world challenges in different industries, offering preparation for data scientist roles in the field. Join in to learn how Spark and Hadoop enable data scientists to help companies reduce costs, increase profits, improve products, retain customers, and identify new opportunities. Read more.
Add to your personal schedule
9:00am - 5:00pm Monday, March 13 & Tuesday, March 14
Location: 212D
Secondary topics:  Deep learning
Robert Schroll (The Data Incubator)
Average rating: ***..
(3.20, 5 ratings)
Robert Schroll demonstrates TensorFlow's capabilities through its Python interface and explores TFLearn, a high-level deep learning library built on TensorFlow. Join in to learn how to use TFLearn and TensorFlow to build machine-learning models on real-world data. Read more.
Add to your personal schedule
9:00am12:30pm Tuesday, March 14, 2017
Location: LL21 C/D Level: Intermediate
Secondary topics:  R
Vanja Paunic (Microsoft), Robert Horton (Microsoft), Hang Zhang (Microsoft), Srini Kumar (LevaData, Inc.), Mengyue Zhao (Microsoft), John-Mark Agosta (Microsoft), Mario Inchiosa (Microsoft), Debraj GuhaThakurta (Microsoft Corporation)
Average rating: **...
(2.50, 4 ratings)
Join in to learn how to do scalable, end-to-end data science in R on single machines as well as on Spark clusters. You'll be assigned an individual Spark cluster with all contents preloaded and software installed and use it to gain experience building, operationalizing, and consuming machine-learning models using distributed functions in R. Read more.
Add to your personal schedule
9:00am12:30pm Tuesday, March 14, 2017
Location: LL21 E/F Level: Intermediate
Secondary topics:  Deep learning
Amy Unruh (Google), Yufeng Guo (Google)
Average rating: ***..
(3.69, 16 ratings)
Amy Unruh and Yufeng Guo walk you through training and deploying a machine-learning system using TensorFlow, a popular open source library. Amy and Yufeng begin by giving an overview of TensorFlow and demonstrating some fun, already-trained TensorFlow models. Read more.
Add to your personal schedule
9:00am12:30pm Tuesday, March 14, 2017
Location: 210 C/G Level: Intermediate
Secondary topics:  R
Stephen Elston (Quantia Analytics, LLC), Ryan Hafen (Hafen Consulting)
Average rating: ****.
(4.12, 8 ratings)
Divide and recombine techniques provide scalable methods for exploration and visualization of otherwise intractable datasets. Stephen Elston and Ryan Hafen lead a series of hands-on exercises to help you develop skills in exploration and visualization of large, complex datasets using R, Hadoop, and Spark. Read more.
Add to your personal schedule
9:00am5:00pm Tuesday, March 14, 2017
Location: LL20 B
Michael Abbott (Kleiner Perkins Caufield & Byers), Christopher Pouliot (Nio), Jennifer Anderson, Renee DiResta (Haven), Coco Krumme (Haven | UC Berkeley), Ryan Baumann (Mapbox), Jay White Bear (IBM), Andre Luckow (BMW Group), Rajiv Paul (Yakit), Evangelos Simoudis (Synapse Partners), Roland Major (Transport for London), Rodrigo Fontecilla (Unisys), Lloyd Palum (Vnomics), Andreas Ribbrock (#zeroG, A Lufthansa Systems Company)
Data, Transportation, and Logistics Day offers a daylong deep-dive into how data science is changing transportation and logistics. We’ll investigate the latest advances in and applications of self-driving vehicles, automated drones, and embedded sensors and explore how new uses of data are challenging the industry to evolve infrastructure for the future. Read more.
Add to your personal schedule
1:30pm5:00pm Tuesday, March 14, 2017
Location: LL20 D Level: Intermediate
Secondary topics:  Deep learning
Dave Kale (Skymind), Susan Eraly (Skymind), Josh Patterson (Skymind)
Average rating: ***..
(3.33, 3 ratings)
Dave Kale, Susan Eraly, and Josh Patterson explain how to build, train, and deploy neural networks using Deeplearning4j. Topics include the fundamentals of deep learning, ND4J and DL4J, and scalable training using GPUs and Apache Spark. You'll gain hands-on experience with several models, including convolutional and recurrent neural nets. Read more.
Add to your personal schedule
1:30pm5:00pm Tuesday, March 14, 2017
Location: LL21 B Level: Intermediate
Secondary topics:  Pydata
Juliet Hougland (Cloudera)
Average rating: ****.
(4.00, 2 ratings)
Using an interactive demo format with accompanying online materials and data, data scientist Juliet Hougland offers a practical overview of the basics of using Python data tools with a Hadoop cluster. Read more.
Add to your personal schedule
1:30pm5:00pm Tuesday, March 14, 2017
Location: LL21 C/D Level: Intermediate
Secondary topics:  R
John Mount (Win-Vector LLC)
Average rating: ****.
(4.83, 6 ratings)
Sparklyr provides an R interface to Spark. With sparklyr, you can manipulate Spark datasets to bring them into R for analysis and visualization and use sparklyr to orchestrate distributed machine learning in Spark from R with the Spark MLlib and H2O SparkingWater libraries. John Mount demonstrates how to use sparklyr to analyze big data in Spark. Read more.
Add to your personal schedule
10:10am10:25am Wednesday, March 15, 2017
Location: Grand Ballroom
Secondary topics:  Geospatial, Sports
Rajiv Maheswaran (Second Spectrum)
Average rating: ****.
(4.91, 35 ratings)
What happens when machines understand sports? As Rajiv Maheswaran demonstrates, everything changes, from how coaches coach and how players play to how storytellers tells stories and how fans experience the game. Read more.
Add to your personal schedule
11:00am11:40am Wednesday, March 15, 2017
Location: LL20 A
Secondary topics:  Data Platform, Logistics
Peng Du (Uber Inc.), Randy Wei (Uber Inc.)
Average rating: ***..
(3.11, 9 ratings)
Peng Du and Randy Wei offer an overview of Uber’s data science workbench, which provides a central platform for data scientists to perform interactive data analysis through notebooks, share and collaborate on scripts, and publish results to dashboards and is seamlessly integrated with other Uber services, providing convenient features such as task scheduling, model publishing, and job monitoring. Read more.
Add to your personal schedule
11:00am11:40am Wednesday, March 15, 2017
Location: 230 C Level: Intermediate
Sean Kandel (Trifacta), Karthik Sethuraman (Trifacta)
Average rating: ***..
(3.60, 5 ratings)
It's well known that data analysts spend 80% of their time preparing data and only 20% analyzing it. In order to change that ratio, organizations must build tools specifically designed for working with ad hoc (semistructured) data. Sean Kandel and Karthik Sethuraman explore a new technique leveraging machine learning to discover and profile the inherent structure in ad hoc datasets. Read more.
Add to your personal schedule
11:00am11:40am Wednesday, March 15, 2017
Location: 210 C/G Level: Intermediate
Secondary topics:  Deep learning, ecommerce, Retail
Feng Zhu (Microsoft), Valentine Fontama (Microsoft)
Average rating: ****.
(4.71, 7 ratings)
Although deep learning has proved to be very powerful, few results are reported on its application to business-focused problems. Feng Zhu and Val Fontama explore how Microsoft built a deep learning-based churn predictive model and demonstrate how to explain the predictions using LIME—a novel algorithm published in KDD 2016—to make the black box models more transparent and accessible. Read more.
Add to your personal schedule
11:50am12:30pm Wednesday, March 15, 2017
Location: 230 C
Secondary topics:  Data Platform, ecommerce, Hardcore Data Science, Media
Jure Leskovec (Pinterest)
Average rating: ****.
(4.82, 11 ratings)
Pinterest built a flexible, graph-based system for making recommendations to users in real time. The system uses random walks on a user-and-object graph in order to make personalized recommendations to 100+ million Pinterest users out of a catalog of over a billion items. Jure Leskovec explains how Pinterest built its modern recommendation engine and the lessons learned along the way. Read more.
Add to your personal schedule
11:50am12:30pm Wednesday, March 15, 2017
Location: 210 C/G Level: Intermediate
Secondary topics:  Deep learning, Hardcore Data Science, Mobile
Anirudh Koul (Microsoft)
Average rating: ****.
(4.20, 5 ratings)
Over the last few years, convolutional neural networks (CNN) have risen in popularity, especially in computer vision. Anirudh Koul explains how to bring the power of deep learning to memory- and power-constrained devices like smartphones and drones. Read more.
Add to your personal schedule
1:50pm2:30pm Wednesday, March 15, 2017
Location: 230 C Level: Intermediate
Secondary topics:  Deep learning, Healthcare, Text
David Talby (Pacific AI), Claudiu Branzan (G2 Web Services)
Average rating: ****.
(4.14, 7 ratings)
David Talby and Claudiu Branzan offer a live demo of an end-to-end system that makes nontrivial clinical inferences from free-text patient records. Infrastructure components include Kafka, Spark Streaming, Spark, and Elasticsearch; data science components include spaCy, custom annotators, curated taxonomies, machine-learned dynamic ontologies, and real-time inferencing. Read more.
Add to your personal schedule
1:50pm2:30pm Wednesday, March 15, 2017
Location: 210 C/G Level: Advanced
Secondary topics:  Deep learning, Healthcare
Michael Dusenberry (IBM Spark Technology Center), Frederick Reiss (IBM Spark Technology Center)
Average rating: *****
(5.00, 2 ratings)
Estimating the growth rate of tumors is a very important but very expensive and time-consuming part of diagnosing and treating breast cancer. Michael Dusenberry and Frederick Reiss describe how to use deep learning with Apache Spark and Apache SystemML to automate this critical image classification task. Read more.
Add to your personal schedule
2:40pm3:20pm Wednesday, March 15, 2017
Location: 230 C Level: Intermediate
Secondary topics:  Hardcore Data Science, Healthcare
Robert Grossman (University of Chicago)
Average rating: ***..
(3.73, 11 ratings)
When there is a strong signal in a large dataset, many machine-learning algorithms will find it. On the other hand, when the effect is weak and the data is large, there are many ways to discover an effect that is in fact nothing more than noise. Robert Grossman shares best practices so that you will not be accused of p-hacking. Read more.
Add to your personal schedule
2:40pm3:20pm Wednesday, March 15, 2017
Location: 210 C/G
Secondary topics:  AI, Deep learning
James Bradbury (Salesforce Research)
Average rating: ****.
(4.00, 8 ratings)
James Bradbury offers an overview of PyTorch, a brand-new deep learning framework from developers at Facebook AI Research that's intended to be faster, easier, and more flexible than alternatives like TensorFlow. James makes the case for PyTorch, focusing on the library's advantages for natural language processing and reinforcement learning. Read more.
Add to your personal schedule
4:20pm5:00pm Wednesday, March 15, 2017
Location: 230 C Level: Advanced
Secondary topics:  Hardcore Data Science
Ted Dunning (MapR Technologies)
Average rating: ****.
(4.50, 6 ratings)
Ted Dunning offers an overview of tensor computing—covering, in practical terms, the high-level principles behind tensor computing systems—and explains how it can be put to good use in a variety of settings beyond training deep neural networks (the most common use case). Read more.
Add to your personal schedule
4:20pm5:00pm Wednesday, March 15, 2017
Location: 210 C/G Level: Intermediate
Secondary topics:  Deep learning, Streaming
Shivnath Babu (Duke University | Unravel Data Systems)
Average rating: ***..
(3.33, 3 ratings)
Shivnath Babu offers an introduction to using deep learning to solve complex problems in IT operations analytics. Shivnath focuses on how deep learning can derive operations insights automatically for the complex big data application stack composed of systems such as Hadoop, Spark, Cassandra, Elasticsearch, and Impala, using examples of open source tools for deep learning. Read more.
Add to your personal schedule
5:10pm5:50pm Wednesday, March 15, 2017
Location: 230 C
Secondary topics:  Hardcore Data Science
Alice Zheng (Amazon)
Average rating: ****.
(4.50, 6 ratings)
In the machine-learning pipeline, feature engineering takes up the majority amount of time yet is seldom discussed. Alice Zheng leads a tour of popular feature engineering methods for text, logs, and images, giving you an intuitive and actionable understanding of tricks of the trade. Read more.
Add to your personal schedule
5:10pm5:50pm Wednesday, March 15, 2017
Location: 210 C/G Level: Intermediate
Secondary topics:  Deep learning, Hardcore Data Science
Stephen Merity (Salesforce Research)
Average rating: ****.
(4.67, 3 ratings)
While attention and memory have become important components in many state-of-the-art deep learning architectures, it's not always obvious where they may be most useful. Even more challenging, such models can be very computationally intensive for production. Stephen Merity discusses the most recent techniques, what tasks they show the most promise in, and when they make sense in production systems. Read more.
Add to your personal schedule
5:10pm5:50pm Wednesday, March 15, 2017
Location: 212 A-B Level: Intermediate
Secondary topics:  Financial services
Matar Haller (Winton Capital)
Average rating: *****
(5.00, 2 ratings)
With the exploding growth of video and audio content online, there's an increasing need for indexable and searchable audio. Matar Haller demonstrates how to automatically identify who is speaking when in a recorded conversation using machine learning applied to a corpus of audio recordings. Matar shares how she approached the problem, the algorithms used, and steps taken to validate the results. Read more.
Add to your personal schedule
11:00am11:40am Thursday, March 16, 2017
Location: 230 A
Secondary topics:  Hardcore Data Science, Media, Text
Dorna Bandari (Pinterest Inc.)
Average rating: ****.
(4.00, 2 ratings)
Most internet companies record a constant stream of logs as a user interacts with their application. Depending on the complexity of the application, the logs can be extremely difficult to decipher. Dorna Bandari presents a novel NLP-based method for clustering user sessions in consumer internet applications, which has proved to be extremely effective in both driving strategy and personalization. Read more.
Add to your personal schedule
11:00am11:40am Thursday, March 16, 2017
Location: 230 C
Secondary topics:  Cloud, Deep learning, Hardcore Data Science
Anima Anandkumar (UC Irvine)
Average rating: ****.
(4.67, 3 ratings)
Anima Anandkumar demonstrates how to use preconfigured Deep Learning AMIs and CloudFormation templates on AWS to help speed up deep learning development and shares use cases in computer vision and natural language processing. Read more.
Add to your personal schedule
11:00am11:40am Thursday, March 16, 2017
Location: 210 A/E
Secondary topics:  AI, Deep learning
Rajat Monga (Google)
Average rating: ***..
(3.86, 7 ratings)
Rajat Monga offers an overview of TensorFlow progress and adoption in 2016 before looking ahead to the areas of importance in the future—performance, usability, and ubiquity—and the efforts TensorFlow is making in those areas. Read more.
Add to your personal schedule
11:00am11:40am Thursday, March 16, 2017
Location: LL21 A Level: Beginner
June Andrews (Wise / GE Digital), Frances Haugen (Pinterest)
Average rating: *****
(5.00, 5 ratings)
An experiment at Pinterest revealed somewhat shocking results. When nine data scientists and ML engineers were asked the same constrained question, they gave nine spectacularly different answers. The implications for business are astronomical. June Andrews and Frances Haugen explore the aspects of analysis that cause differences in conclusions and offer some solutions. Read more.
Add to your personal schedule
11:50am12:30pm Thursday, March 16, 2017
Location: 230 A Level: Beginner
Secondary topics:  AI, Hardcore Data Science
Michael Lee Williams (Fast Forward Labs)
Average rating: ***..
(3.80, 5 ratings)
Real-world data is incomplete and imperfect. The right way to handle it is with Bayesian inference. Michael Williams demonstrates how probabilistic programming languages hide the gory details of this elegant but potentially tricky approach, making a powerful statistical method easy and enabling rapid iteration and new kinds of data-driven products. Read more.
Add to your personal schedule
11:50am12:30pm Thursday, March 16, 2017
Location: 230 C
Secondary topics:  Hardcore Data Science
Carlos Guestrin (University of Washington | Apple)
Average rating: *****
(5.00, 4 ratings)
Carlos Guestrin offers an overview of anchors and aLIME, a novel, high-precision explanation technique for the predictions of any classifier in an interpretable and faithful manner, demonstrating the flexibility of these methods by explaining different models for text, image classification, and visual question answering and exploring the usefulness of explanations via novel experiments. Read more.
Add to your personal schedule
11:50am12:30pm Thursday, March 16, 2017
Location: 210 C/G Level: Non-technical
Secondary topics:  Media
Brian Lange (Datascope)
Average rating: ****.
(4.00, 2 ratings)
The goal of RCSA's Scialog conferences is to foster collaboration between scientists with different specialties and approaches, and, working with Datascope, the company has been doing so in a quantitative way for the last six years. Brian Lange discusses how Datasope and RCSA arrived at the problem, the design choices made in the survey and optimization, and how the results were visualized. Read more.
Add to your personal schedule
11:50am12:30pm Thursday, March 16, 2017
Location: LL21 A
Shoumik Palkar (Stanford University)
Modern data applications combine functions from many libraries and frameworks and cannot achieve peak hardware performance due to data movement across functions. Shoumik Palkar offers an overview of Weld, an optimizing runtime that enables optimizations across disjoint libraries, and explains how to integrate it into frameworks such as Spark SQL for performance gains with no changes to user code. Read more.
Add to your personal schedule
1:50pm2:30pm Thursday, March 16, 2017
Location: LL21 E/F Level: Advanced
Secondary topics:  Hardcore Data Science, Media
Chao Zhong (Microsoft)
Average rating: ****.
(4.67, 3 ratings)
Chao Zhong offers an overview of a new predictive model for customer lifetime value (LTV) in a cloud-computing business. This model is also the first known application of the Fader RFM approach to a cloud business—a Bayesian approach that predicts a customer's LTV with a symmetric absolute percentage error (SAPE) of only 3% on an out-of-time testing dataset. Read more.
Add to your personal schedule
1:50pm2:30pm Thursday, March 16, 2017
Location: 230 A Level: Intermediate
Secondary topics:  ecommerce, Media
Michelangelo D'Agostino (Civis Analytics), Bill Lattner (Civis Analytics)
Average rating: ****.
(4.00, 2 ratings)
How do we know that an advertisement or promotion truly drives incremental revenue? Michelangelo D'Agostino and Bill Lattner share their experience developing machine-learning techniques for predicting treatment responsiveness from randomized controlled experiments and explore the use of these “persuasion” models at scale in politics, social good, and marketing. Read more.
Add to your personal schedule
1:50pm2:30pm Thursday, March 16, 2017
Location: 230 C
Secondary topics:  Media, Text
Grace Huang (Pinterest)
Average rating: ***..
(3.33, 3 ratings)
With over 75 billion pins, the Pinterest content corpus is one of the largest human-curated collection of ideas. Grace Huang walks you through the lifecycle of a piece of content in Pinterest, a portfolio of metrics developed to monitor the health of the content corpus, and the story of creating a cross-functional initiative to preserve a healthy, sustainable content ecosystem. Read more.
Add to your personal schedule
1:50pm2:30pm Thursday, March 16, 2017
Location: 210 A/E Level: Intermediate
Matt Brandwein (Cloudera), Tristan Zajonc (Cloudera)
Average rating: ***..
(3.33, 3 ratings)
Self-service data science is easier said than delivered, especially on Apache Hadoop. Most organizations struggle to balance the diverging needs of the data scientist, data engineer, operator, and architect. Matt Brandwein and Tristan Zajonc cover the underlying root causes of these challenges and introduce new capabilities being developed to make self-service data science a reality. Read more.
Add to your personal schedule
1:50pm2:30pm Thursday, March 16, 2017
Location: LL21 A
Ira Cohen (Anodot)
Average rating: ***..
(3.50, 2 ratings)
Apps have so many moving parts that a simple change to one element can cause havoc somewhere else. The resulting issues annoy users and cause revenue leaks. Ira Cohen outlines ways to use anomaly detection to monitor all areas of an app, from the code to the user behavior to partner integrations and more, to fully optimize your mobile app. Read more.
Add to your personal schedule
2:40pm3:20pm Thursday, March 16, 2017
Location: 230 A Level: Intermediate
Secondary topics:  Media
Michelle Casbon (Qordoba)
Average rating: *****
(5.00, 1 rating)
Supporting multiple locales involves the maintenance and generation of localized strings. Michelle Casbon explains how machine learning and natural language processing are applied to the underserved domain of localization using primarily open source tools, including Scala, Apache Spark, Apache Cassandra, and Apache Kafka. Read more.
Add to your personal schedule
2:40pm3:20pm Thursday, March 16, 2017
Location: 230 C Level: Beginner
Secondary topics:  Architecture, Data Platform, ecommerce
Gleicon Moraes (luc.id), Arthur Grava (Luizalabs)
Average rating: ****.
(4.00, 3 ratings)
Gleicon Moraes and Arthur Grava share war stories about developing and deploying a cloud-based large-scale recommender system for a top-three Brazilian ecommerce company. The system, which uses Cassandra and graph traversal, led to a more than 15% increase in sales. Read more.
Add to your personal schedule
2:40pm3:20pm Thursday, March 16, 2017
Location: 210 C/G Level: Intermediate
Secondary topics:  ecommerce, Retail
Eric Colson (Stitch Fix)
Average rating: ****.
(4.36, 14 ratings)
Data scientists blend the skills of statisticians, software engineers, and domain experts to create new roles. Data science isn't merely an amalgam of disciplines but rather a gestalt which synthesizes the ethos of various fields. This merits new thinking when it comes to organization. Eric Colson explores some novel—and often unintuitive—ways to unleash the value of your data science team. Read more.
Add to your personal schedule
2:40pm3:20pm Thursday, March 16, 2017
Location: LL21 A Level: Intermediate
Christopher Bergh (DataKitchen), Gil Benghiat (DataKitchen)
Average rating: ****.
(4.50, 2 ratings)
Data analysts, data scientists, and data engineers are already working on teams delivering insight and analysis, but how do you get the team to support experimentation and insight delivery without ending up in an IT versus data engineer versus data scientist war? Christopher Bergh and Gil Benghiat present the seven shocking steps to get these groups of people working together. Read more.
Add to your personal schedule
4:20pm5:00pm Thursday, March 16, 2017
Location: 230 A Level: Intermediate
Eduardo Arino de la Rubia (Domino Data Lab)
Average rating: ****.
(4.71, 7 ratings)
The promise of the automated statistician is as old as statistics itself. Eduardo Arino de la Rubia explores the tools created by the open source community to free data scientists from tedium, enabling them to work on the high-value aspects of insight creation. Along the way, Eduardo compares open source tools such as TPOT and auto-sklearn and discusses their place in the DS workflow. Read more.
Add to your personal schedule
4:20pm5:00pm Thursday, March 16, 2017
Location: 230 C Level: Advanced
Secondary topics:  Hardcore Data Science
Frederick Reiss (IBM Spark Technology Center), Arvind Surve (IBM)
Many iterative machine-learning algorithms can only operate efficiently when a large matrix of training data fits in the main memory. Frederick Reiss and Arvind Surve offer an overview of compressed linear algebra, a technique for compressing training data and performing key operations in the compressed domain that lets you build models over big data with small machines. Read more.