Presented By O'Reilly and Cloudera
Make Data Work
September 26–27, 2016: Training
September 27–29, 2016: Tutorials & Conference
New York, NY

Schedule: Data science & advanced analytics sessions

9:00am–5:00pm Tuesday, 09/27/2016
Location: 1 E 10/1 E11
Michael Li (The Data Incubator), Robert Schroll (The Data Incubator)
Average rating: ***..
(3.00, 6 ratings)
Tianhui Li and Robert Schroll of the Data Incubator offer a foundation in building intelligent business applications using machine learning, walking you through all the steps to prototyping and production—data cleaning, feature engineering, model building and evaluation, and deployment—and diving into an application for anomaly detection and a personalized recommendation engine. Read more.
9:00am–12:30pm Tuesday, 09/27/2016
Location: 3D 12 Level: Intermediate
Tags: pydata
Andreas Mueller (Columbia University)
Average rating: ****.
(4.00, 6 ratings)
Scikit-learn, which provides easy-to-use interfaces to perform advances analysis and build powerful predictive models, has emerged as one of the most popular open source machine-learning toolkits. Using scikit-learn and Python as examples, Andreas Mueller offers an overview of basic concepts of machine learning, such as supervised and unsupervised learning, cross-validation, and model selection. Read more.
9:00am–12:30pm Tuesday, 09/27/2016
Location: 3D 10 Level: Intermediate
Tags: r-lang
Garrett Grolemund (RStudio), Nathan Stephens (RStudio, Inc.)
Average rating: ****.
(4.20, 5 ratings)
Garrett Grolemund and Nathan Stephens explore the new sparklyr package by RStudio, which provides a familiar interface between the R language and Apache Spark and communicates with the Spark SQL and the Spark ML APIs so R users can easily manipulate and analyze data at scale. Read more.
1:30pm–5:00pm Tuesday, 09/27/2016
Location: 1 E 07/1 E 08 Level: Intermediate
Martin Wicke (Google), Joshua Gordon (Google)
Average rating: ***..
(3.47, 15 ratings)
Martin Wicke and Josh Gordon offer hands-on experience training and deploying a machine-learning system using TensorFlow, a popular open source library. You'll learn how to build machine-learning systems from simple classifiers to complex image-based models as well as how to deploy models in production using TensorFlow Serving. Read more.
1:30pm–5:00pm Tuesday, 09/27/2016
Location: 1 E 09 Level: Intermediate
Tags: pydata
Juliet Hougland (Cloudera), srowen om (Cloudera)
Average rating: ***..
(3.67, 3 ratings)
Juliet Hougland and Sean Owen offer a practical overview of the basics of using Python data tools with a Hadoop cluster, covering HDFS connectivity and dealing with raw data files, running SQL queries with a SQL-on-Hadoop system like Apache Hive or Apache Impala (incubating), and using Apache Spark to write more complex analytical jobs. Read more.
1:30pm–5:00pm Tuesday, 09/27/2016
Location: 3D 12 Level: Intermediate
Bryan Van de Ven (Continuum Analytics), Sarah Bird (Continuum Analytics)
Average rating: ****.
(4.00, 4 ratings)
Bryan Van de Ven and Sarah Bird demonstrate how to build intelligent apps in a week with Bokeh, Python, and optimization. Read more.
11:20am–12:00pm Wednesday, 09/28/2016
Location: Hall 1C Level: Intermediate
Carlos Guestrin (Apple | University of Washington )
Average rating: ****.
(4.45, 20 ratings)
Despite widespread adoption, machine-learning models remain mostly black boxes, making it very difficult to understand the reasons behind a prediction. Such understanding is fundamentally important to assess trust in a model before we take actions based on a prediction or choose to deploy a new ML service. Carlos Guestrin offers a general approach for explaining predictions made by any ML model. Read more.
1:15pm–1:55pm Wednesday, 09/28/2016
Location: Hall 1C Level: Intermediate
Average rating: ****.
(4.38, 8 ratings)
Data science has always been a focus at eHarmony, but recently more business units have needed data-driven models. Jonathan Morra introduces Aloha, an open source project that allows the modeling group to quickly deploy type-safe accurate models to production, and explores how eHarmony creates models with Apache Spark and how it uses them. Read more.
2:05pm–2:45pm Wednesday, 09/28/2016
Location: Hall 1C Level: Beginner
June Andrews (Wise / GE Digital)
Average rating: ****.
(4.31, 13 ratings)
Clustering algorithms produce vectors of information, which are almost surely difficult to interpret. These are then laboriously translated by data scientists into insights for influencing product and executive decisions. June Andrews offers an overview of a human-in-the-loop method used at Pinterest and LinkedIn that has lead to fast, accurate, and pertinent human-readable insights. Read more.
2:55pm–3:35pm Wednesday, 09/28/2016
Location: Hall 1C Level: Beginner
Eui-Hong Han (The Washington Post), Shuguang Wang (The Washington Post)
Average rating: ****.
(4.20, 5 ratings)
Predicting which stories will become popular is an invaluable tool for newsrooms. Eui-Hong Han and Shuguang Wang explain how the Washington Post predicts what stories on its site will be popular with readers and share the challenges they faced in developing the tool and metrics on how they refined the tool to increase accuracy. Read more.
4:35pm–5:15pm Wednesday, 09/28/2016
Location: 3D 10 Level: Beginner
Crystal Valentine (MapR Technologies)
Average rating: ****.
(4.50, 4 ratings)
Crystal Valentine explains how the large graph-processing frameworks that run on Hadoop can be used to detect significantly mutated protein signaling pathways in cancer genomes through a probabilistic analysis of large protein-protein interaction networks, using techniques similar to those used in social network analysis algorithms. Read more.
4:35pm–5:15pm Wednesday, 09/28/2016
Location: Hall 1C Level: Intermediate
Josh Patterson (Skymind), Dave Kale (Skymind)
Average rating: *****
(5.00, 1 rating)
Can machines be creative? Josh Patterson and David Kale offer a practical demonstration—an interactive Twitter bot that users can ping to receive a response dynamically generated by a conditional recurrent neural net implemented using DL4J—that suggests the answer may be yes. Read more.
5:25pm–6:05pm Wednesday, 09/28/2016
Location: 3D 10 Level: Non-technical
Tags: ai
Mike Lee Williams (Cloudera Fast Forward Labs)
Average rating: ****.
(4.80, 10 ratings)
Our ability to extract meaning from unstructured text data has not kept pace with our ability to produce and store it, but recent breakthroughs in recurrent neural networks are allowing us to make exciting progress in computer understanding of language. Building on these new ideas, Michael Williams explores three ways to summarize text and presents prototype products for each approach. Read more.
5:25pm–6:05pm Wednesday, 09/28/2016
Location: Hall 1C Level: Intermediate
Martin Wicke (Google)
Average rating: ***..
(3.50, 2 ratings)
Much of the success of deep learning in recent years can be attributed to scale—bigger datasets and more computing power—but scale can quickly become a problem. Distributed, asynchronous computing in heterogenous environments is complex, hard to debug, and hard to profile and optimize. Martin Wicke demonstrates how to automate or abstract away such complexity, using TensorFlow as an example. Read more.
11:20am–12:00pm Thursday, 09/29/2016
Location: 3D 12 Level: Intermediate
Ihab Ilyas (University of Waterloo)
Average rating: *****
(5.00, 2 ratings)
Machine-learning tools promise to help solve data curation problems. While the principles are well understood, the engineering details in configuring and deploying ML techniques are the biggest hurdle. Ihab Ilyas explains why leveraging data semantics and domain-specific knowledge is key in delivering the optimizations necessary for truly scalable ML curation solutions. Read more.
11:20am–12:00pm Thursday, 09/29/2016
Location: Hall 1C Level: Beginner
Yishay Carmiel (IntelligentWire)
Average rating: ****.
(4.43, 7 ratings)
Deep learning has taken us a few steps further toward achieving AI for a man-machine interface. However, deep learning technologies like speech recognition and natural language processing remain a mystery to many. Yishay Carmiel reviews the history of deep learning, the impact it's made, recent breakthroughs, interesting solved and open problems, and what's in store for the future. Read more.
11:20am–12:00pm Thursday, 09/29/2016
Location: 3D 08 Level: Non-technical
Tags: iot
Mike Stringer (Datascope Analytics)
Average rating: *....
(1.89, 9 ratings)
We're likely just at the beginning of data science. The people and things that are starting to be equipped with sensors will enable entirely new classes of problems that will have to be approached more scientifically. Mike Stringer outlines some of the issues that may arise for business, for data scientists, and for society. Read more.
1:15pm–1:55pm Thursday, 09/29/2016
Location: 3D 10 Level: Intermediate
Tags: r-lang
Xiangrui Meng (Databricks)
Average rating: ****.
(4.00, 2 ratings)
Xiangrui Meng explores recent community efforts to extend SparkR for scalable advanced analytics—including summary statistics, single-pass approximate algorithms, and machine-learning algorithms ported from Spark MLlib—and shows how to integrate existing R packages with SparkR to accelerate existing R workflows. Read more.
1:15pm–1:55pm Thursday, 09/29/2016
Location: Hall 1C Level: Intermediate
Tags: ai
David Talby (Pacific AI), Claudiu Branzan (Accenture)
Average rating: ****.
(4.00, 1 rating)
David Talby and Claudiu Branzan lead a live demo of an end-to-end system that makes nontrivial clinical inferences from free-text patient records. Infrastructure components include Kafka, Spark Streaming, Spark, Titan, and Elasticsearch; data science components include custom UIMA annotators, curated taxonomies, machine-learned dynamic ontologies, and real-time inferencing. Read more.
2:05pm–2:45pm Thursday, 09/29/2016
Location: 1 E 10/1 E11 Level: Beginner
Amit Kapoor (narrativeVIZ)
Average rating: ****.
(4.67, 3 ratings)
Though visualization is used in data science to understand the shape of the data, it's not widely used for statistical models, which are evaluated based on numerical summaries. Amit Kapoor explores model visualization, which aids in understanding the shape of the model, the impact of parameters and input data on the model, the fit of the model, and where it can be improved. Read more.
2:05pm–2:45pm Thursday, 09/29/2016
Location: 3D 10 Level: Beginner
Amitai Armon (Intel), Nir Lotan (Intel)
Average rating: ****.
(4.50, 2 ratings)
Amitai Armon and Nir Lotan outline a new, free software tool that enables the creation of deep learning models quickly and easily. The tool is based on existing deep learning frameworks and incorporates extensive optimizations that provide high performance on standard CPUs. Read more.
2:05pm–2:45pm Thursday, 09/29/2016
Location: Hall 1C Level: Intermediate
Tags: media, politics
Amir Hajian (Thomson Reuters), Khaled Ammar (Thomson Reuters), Alex Constandache (Thomson Reuters)
Average rating: ***..
(3.75, 4 ratings)
Amir Hajian, Khaled Ammar, and Alex Constandache offer an approach to mining a large dataset to predict the electability of hypothetical candidates in the US presidential election race, using machine learning, natural language processing, and deep learning on an infrastructure that includes Spark and Elasticsearch, which serves as the backbone of the mobile game White House Run. Read more.
2:55pm–3:35pm Thursday, 09/29/2016
Location: 3D 10 Level: Intermediate
Brendan Herger (Capital One)
Average rating: ****.
(4.80, 5 ratings)
Many areas of applied machine learning require models optimized for rare occurrences, such as class imbalances, and users actively attempting to subvert the system (adversaries). Brendan Herger offers an overview of multiple published techniques that specifically attempt to address these issues and discusses lessons learned by the Data Innovation Lab at Capital One. Read more.
2:55pm–3:35pm Thursday, 09/29/2016
Location: Hall 1C Level: Intermediate
Danielle Dean (iRobot), Shaheen Gauher (Microsoft)
Average rating: ****.
(4.20, 5 ratings)
In the realm of predictive maintenance, the event of interest is an equipment failure. In real scenarios, this is usually a rare event. Unless the data collection has been taking place over a long period of time, the data will have very few of these events or, in the worst case, none at all. Danielle Dean and Shaheen Gauher discuss the various ways of building and evaluating models for such data. Read more.
2:55pm–3:35pm Thursday, 09/29/2016
Location: 3D 08 Level: Intermediate
Kaz Sato (Google)
Average rating: *****
(5.00, 4 ratings)
The largest challenge for deep learning is scalability. Google has built a large-scale neural network in the cloud and is now sharing that power. Kazunori Sato introduces pretrained ML services, such as the Cloud Vision API and the Speech API, and explores how TensorFlow and Cloud Machine Learning can accelerate custom model training 10x–40x with Google's distributed training infrastructure. Read more.
4:35pm–5:15pm Thursday, 09/29/2016
Location: Hall 1C Level: Intermediate
Josh Lemaitre (Thomson Reuters)
Average rating: *****
(5.00, 1 rating)
How can the value of a patent be quantified? Josh Lemaitre explores how Thomson Reuters Labs approached this problem by applying machine learning to the patent corpus in an effort to predict those most likely to be enforced via litigation. Josh covers infrastructure, methods, challenges, and opportunities for future research. Read more.