Presented By O'Reilly and Cloudera
Make Data Work
March 13–14, 2017: Training
March 14–16, 2017: Tutorials & Conference
San Jose, CA

Schedule: Hardcore Data Science sessions

11:50am12:30pm Wednesday, March 15, 2017
Data science & advanced analytics
Location: 210 C/G Level: Intermediate
Anirudh Koul (Microsoft)
Average rating: ****.
(4.20, 5 ratings)
Over the last few years, convolutional neural networks (CNN) have risen in popularity, especially in computer vision. Anirudh Koul explains how to bring the power of deep learning to memory- and power-constrained devices like smartphones and drones. Read more.
11:50am12:30pm Wednesday, March 15, 2017
Jure Leskovec (Pinterest)
Average rating: ****.
(4.82, 11 ratings)
Pinterest built a flexible, graph-based system for making recommendations to users in real time. The system uses random walks on a user-and-object graph in order to make personalized recommendations to 100+ million Pinterest users out of a catalog of over a billion items. Jure Leskovec explains how Pinterest built its modern recommendation engine and the lessons learned along the way. Read more.
2:40pm3:20pm Wednesday, March 15, 2017
Data science & advanced analytics
Location: 230 C Level: Intermediate
Robert Grossman (University of Chicago)
Average rating: ***..
(3.73, 11 ratings)
When there is a strong signal in a large dataset, many machine-learning algorithms will find it. On the other hand, when the effect is weak and the data is large, there are many ways to discover an effect that is in fact nothing more than noise. Robert Grossman shares best practices so that you will not be accused of p-hacking. Read more.
4:20pm5:00pm Wednesday, March 15, 2017
Data science & advanced analytics
Location: 230 C Level: Advanced
Ted Dunning (MapR)
Average rating: ****.
(4.50, 6 ratings)
Ted Dunning offers an overview of tensor computing—covering, in practical terms, the high-level principles behind tensor computing systems—and explains how it can be put to good use in a variety of settings beyond training deep neural networks (the most common use case). Read more.
5:10pm5:50pm Wednesday, March 15, 2017
Data science & advanced analytics
Location: 210 C/G Level: Intermediate
Stephen Merity (Salesforce Research)
Average rating: ****.
(4.67, 3 ratings)
While attention and memory have become important components in many state-of-the-art deep learning architectures, it's not always obvious where they may be most useful. Even more challenging, such models can be very computationally intensive for production. Stephen Merity discusses the most recent techniques, what tasks they show the most promise in, and when they make sense in production systems. Read more.
5:10pm5:50pm Wednesday, March 15, 2017
Alice Zheng (1977)
Average rating: ****.
(4.50, 6 ratings)
In the machine-learning pipeline, feature engineering takes up the majority amount of time yet is seldom discussed. Alice Zheng leads a tour of popular feature engineering methods for text, logs, and images, giving you an intuitive and actionable understanding of tricks of the trade. Read more.
11:00am11:40am Thursday, March 16, 2017
Dorna Bandari (Jetlore)
Average rating: ****.
(4.00, 2 ratings)
Most internet companies record a constant stream of logs as a user interacts with their application. Depending on the complexity of the application, the logs can be extremely difficult to decipher. Dorna Bandari presents a novel NLP-based method for clustering user sessions in consumer internet applications, which has proved to be extremely effective in both driving strategy and personalization. Read more.
11:00am11:40am Thursday, March 16, 2017
Anima Anandkumar (UC Irvine)
Average rating: ****.
(4.67, 3 ratings)
Anima Anandkumar demonstrates how to use preconfigured Deep Learning AMIs and CloudFormation templates on AWS to help speed up deep learning development and shares use cases in computer vision and natural language processing. Read more.
11:50am12:30pm Thursday, March 16, 2017
Data science & advanced analytics
Location: 230 A Level: Beginner
Mike Lee Williams (Cloudera Fast Forward Labs)
Average rating: ***..
(3.80, 5 ratings)
Real-world data is incomplete and imperfect. The right way to handle it is with Bayesian inference. Michael Williams demonstrates how probabilistic programming languages hide the gory details of this elegant but potentially tricky approach, making a powerful statistical method easy and enabling rapid iteration and new kinds of data-driven products. Read more.
11:50am12:30pm Thursday, March 16, 2017
Spark & beyond
Location: 210 A/E
Joseph Bradley (Databricks), Tim Hunter (Databricks, Inc.)
Average rating: ***..
(3.75, 4 ratings)
Joseph Bradley and Tim Hunter share best practices for building deep learning pipelines with Apache Spark, covering cluster setup, data ingest, tuning clusters, and monitoring jobs—all demonstrated using Google’s TensorFlow library. Read more.
11:50am12:30pm Thursday, March 16, 2017
Carlos Guestrin (University of Washington | Apple)
Average rating: *****
(5.00, 4 ratings)
Carlos Guestrin offers an overview of anchors and aLIME, a novel, high-precision explanation technique for the predictions of any classifier in an interpretable and faithful manner, demonstrating the flexibility of these methods by explaining different models for text, image classification, and visual question answering and exploring the usefulness of explanations via novel experiments. Read more.
1:50pm2:30pm Thursday, March 16, 2017
Data science & advanced analytics
Location: LL21 E/F Level: Advanced
Chao Zhong (Microsoft)
Average rating: ****.
(4.67, 3 ratings)
Chao Zhong offers an overview of a new predictive model for customer lifetime value (LTV) in a cloud-computing business. This model is also the first known application of the Fader RFM approach to a cloud business—a Bayesian approach that predicts a customer's LTV with a symmetric absolute percentage error (SAPE) of only 3% on an out-of-time testing dataset. Read more.
2:40pm3:20pm Thursday, March 16, 2017
Platform Security and Cybersecurity, Spark & beyond
Location: LL21 C/D Level: Advanced
Alexander Ulanov (Hewlett Packard Labs), Manish Marwah (Hewlett Packard Labs)
Alexander Ulanov and Manish Marwah explain how they implemented a scalable version of loopy belief propagation (BP) for Apache Spark, applying BP to large web-crawl data to infer the probability of websites to be malicious. Applications of BP include fraud detection, malware detection, computer vision, and customer retention. Read more.
2:40pm3:20pm Thursday, March 16, 2017
Real-time applications
Location: 210 A/E Level: Advanced
Jeffrey Yau (Silicon Valley Data Science)
Average rating: ***..
(3.20, 5 ratings)
Thanks to frameworks such as Spark's GraphX and GraphFrames, graph-based techniques are increasingly applicable to anomaly, outlier, and event detection in time series. Jeffrey Yau offers an overview of applying graph-based techniques in fraud detection, IoT processing, and financial data and outlines the benefits of graphs relative to other techniques. Read more.
4:20pm5:00pm Thursday, March 16, 2017
Data science & advanced analytics
Location: 230 C Level: Advanced
Many iterative machine-learning algorithms can only operate efficiently when a large matrix of training data fits in the main memory. Frederick Reiss and Arvind Surve offer an overview of compressed linear algebra, a technique for compressing training data and performing key operations in the compressed domain that lets you build models over big data with small machines. Read more.