Presented By O'Reilly and Cloudera
Make Data Work
Sept 29–Oct 1, 2015 • New York, NY

Data Science & Advanced Analytics conference sessions

Inside the world of data practitioners, from the hard science of the latest algorithms and advances in machine learning to the thorny issues of cultural change and team-building.

Tuesday, September 29

9:00am–5:00pm Tuesday, 09/29/2015
Location: 1 E16 / 1 E17
Garrett Grolemund (RStudio), Yihui Xie (RStudio, Inc.), Nathan Stephens (RStudio, Inc.), Randall Prium (Calvin College)
Average rating: ****.
(4.20, 15 ratings)
From advanced visualization, collaboration, and reproducibility to data manipulation, R Day at Strata covers a raft of current topics that analysts and R users need to pay attention to. The R Day tutorials come from leading luminaries and R committers, the folks keeping the R ecosystem apace of the challenges facing analysts and others who work with data. Read more.
9:00am–12:30pm Tuesday, 09/29/2015
Location: 3D 03/10 Level: Advanced
srowen om (Cloudera), Juliet Hougland (Cloudera), Sandy Ryza (Clover Health)
Average rating: **...
(2.96, 24 ratings)
In this tutorial, attendees will get a taste of how large-scale data science techniques and technologies developed for the consumer internet can be applied in the world of finance. We will guide an exploration of the relationship between the traffic on Wikipedia pages to the movement of stock prices. Read more.
9:00am–5:00pm Tuesday, 09/29/2015
Location: 1 E6 / 1 E7 Level: Intermediate
Average rating: ***..
(3.63, 19 ratings)
This hands-on, beginner-friendly tutorial provides a quick start to building intelligent business applications using machine learning. Learn about machine learning basics, feature engineering, recommender systems, and deep learning. The program includes hands-on portions to build and deploy large-scale machine learning applications. Read more.
9:00am–5:00pm Tuesday, 09/29/2015
Location: 1 E12/ 1 E13
Travis Oliphant (Anaconda), Peter Wang (Anaconda), Kyle Kelley (Netflix), Andrew Odewahn (O'Reilly Media), Paige Bailey (Microsoft), Jeff Reback (Continuum Analytics), Andy Terrel (NumFOCUS), Bryan Van de Ven (Continuum Analytics), Sarah Bird (Aptivate), James Powell (NumFOCUS), Phil Cloud (Continuum), Jason Grout (Bloomberg LP), Chris Colbert (Anaconda Powered by Continuum Analytics), Owen Zhang (DataRobot), Peter Prettenhofer (DataRobot), Damon McDougall (UT Austin), Michael Droettboom (Space Telescope Science Institute), Jim Crist (Continuum Analytics), Benjamin Zaitlen (Anaconda), Andreas Mueller (NYU, scikit-learn)
Average rating: ***..
(3.50, 10 ratings)
Python has become an increasingly important part of the data engineer and analytic tool landscape. Pydata at Strata provides in-depth coverage of the tools and techniques gaining traction with the data audience, including IPython Notebook, NumPy/matplotlib for visualization, SciPy, scikit-learn, and how to scale Python performance, including how to handle large, distributed data sets. Read more.

Wednesday, September 30

11:20am–12:00pm Wednesday, 09/30/2015
Location: 1 E8 / 1 E9 Level: Intermediate
Wes McKinney (Two Sigma Investments)
Average rating: ***..
(3.70, 10 ratings)
Many data science and data analytics applications are written in Python or R, but developing and deploying these applications at scale or in production is a pain point for many users. We will discuss our new efforts to bridge the gap between familiar in-memory data tools and distributed data management systems using Python and Impala. Read more.
1:15pm–1:35pm Wednesday, 09/30/2015
Location: 1 E8 / 1 E9 Level: Non-technical
Russell Jurney (Data Syndrome)
Average rating: ***..
(3.50, 2 ratings)
The talk covers the development of the O'Reilly Media Report, "Mapping big data: A data driven market report." Read more.
1:35pm–1:55pm Wednesday, 09/30/2015
Location: 1 E8 / 1 E9 Level: Non-technical
Lauralea Banks Edwards (Washington State University)
Average rating: ****.
(4.00, 2 ratings)
This presentation identifies some of the areas in data creation and analytics where we perpetuate the simplistic representation of the world. It uses queer theory to demonstrate alternative ways of creating and analyzing data to take non-normative cases into consideration. Read more.
2:05pm–2:25pm Wednesday, 09/30/2015
Location: 1 E8 / 1 E9 Level: Intermediate
Joy Thomas (Apigee), Jagdish Chand (Apigee)
Average rating: ***..
(3.30, 10 ratings)
Customer journey analytics systems of large corporations must handle a great volume of events on a daily basis. Apriori aggregation used by early systems often caused signal loss due to ever-changing customer activity rates. We will present a new method that identifies paths inherent in raw cross-channel data, and that captures traffic patterns via nodes of interest across all channels of data. Read more.
2:05pm–2:45pm Wednesday, 09/30/2015
Location: Hall B
DJ Patil (White House Office of Science and Technology Policy)
Average rating: **...
(2.81, 16 ratings)
DJ Patil, U.S. Chief Data Scientist at White House Office of Science and Technology Policy Read more.
2:25pm–2:45pm Wednesday, 09/30/2015
Location: 1 E8 / 1 E9 Level: Intermediate
Albert Bifet (Télécom ParisTech), Silviu Maniu (Huawei)
Average rating: ***..
(3.62, 16 ratings)
Real-time analytics are becoming increasingly important due to the large amount of data that is being created continuously. Drawing from our experiences in Huawei Noah's Ark Lab, we present StreamDM, a new open source data mining and machine learning library designed on top of Spark Streaming. We will show its advanced methods, and how easily it can be used and extended. Read more.
2:55pm–3:35pm Wednesday, 09/30/2015
Location: 1 E8 / 1 E9 Level: Non-technical
Tags: media
Juan Huerta (Dow Jones)
Average rating: ****.
(4.25, 20 ratings)
In this presentation I will describe the way in which Data Science is helping the Wall Street Journal produce better journalism strategies, personalize our subscribers’ experience, and optimize revenue and overall customer engagement. Read more.
2:55pm–3:35pm Wednesday, 09/30/2015
Location: Hall B
DJ Patil (White House Office of Science and Technology Policy)
Average rating: **...
(2.71, 7 ratings)
DJ Patil, U.S. Chief Data Scientist at White House Office of Science and Technology Policy Read more.
4:35pm–5:15pm Wednesday, 09/30/2015
Location: 1 E8 / 1 E9 Level: Intermediate
Marcel Kornacker (Cloudera), Josh Wills (Cloudera), Alexander Behm (Cloudera)
Average rating: ***..
(3.25, 12 ratings)
In this talk, we will explain how data scientists use nested data structures to increase analytic productivity. We will use two well-known relational schemas - TPC-H and Twitter - to demonstrate how to simplify data science workloads with nested schemas. Also, we will outline best practices for converting flat relational schemas into nested ones, and give examples of data science-style analysis. Read more.
5:25pm–6:05pm Wednesday, 09/30/2015
Location: 1 E8 / 1 E9 Level: Intermediate
Raphael Lee (Airbnb), Victor Vazquez (Airbnb)
Average rating: ****.
(4.00, 8 ratings)
More users than ever are accessing web applications from multiple devices. When logged-out users receive mixed experiment treatments, weird and wacky results can start appearing in your experiment analyses. Find out what we've learned about this problem at Airbnb and how our data scientists and engineers teamed up to solve it. Read more.

Thursday, October 1

11:20am–12:00pm Thursday, 10/01/2015
Location: 1 E8 / 1 E9 Level: Intermediate
Robert Grossman (University of Chicago)
Average rating: ****.
(4.25, 16 ratings)
Large datasets have large numbers of anomalies, and the challenge is not just identifying anomalies but rank ordering them to create alerts, so that data scientists can examine the most interesting ones. We discuss three case studies that integrate machine learning and data engineering, and extract six techniques for identifying anomalies and ranking ordering them by their potential significance. Read more.
1:15pm–1:35pm Thursday, 10/01/2015
Location: 1 E8 / 1 E9 Level: Intermediate
Average rating: ***..
(3.18, 11 ratings)
Reaching 100,000,000 antivirus users was a big challenge for Avira, but we managed to achieve the goal. The challenge that arises now is to convince our users to stay with us, by offering the best possible experience to each one of them. In this presentation we will share the entire flow of the user churn prevention, from building custom surveys to using machine learning algorithms. Read more.
1:35pm–1:55pm Thursday, 10/01/2015
Location: 1 E8 / 1 E9 Level: Intermediate
Thomas Wiecki (Quantopian)
Average rating: ***..
(3.94, 16 ratings)
Probabilistic programming has already revolutionized machine learning and will have a similar impact on the emerging field of data science. By automating the inference process, it dramatically increases the number of people who can build complex Bayesian models custom-made to the specific problem at hand; and makes experts vastly more effective in devising new machine learning methods. Read more.
2:05pm–2:25pm Thursday, 10/01/2015
Location: 1 E8 / 1 E9 Level: Intermediate
Ihab Ilyas (University of Waterloo | Tamr)
Average rating: ***..
(3.50, 8 ratings)
Machine learning tools offer promise in helping solve data curation problems. While the principles are well-understood, the engineering details in configuring and deploying ML techniques are the biggest hurdle. Leveraging data semantics and domain-specific knowledge is key in delivering the optimizations necessary for truly scalable ML curation solutions. Read more.
2:25pm–2:45pm Thursday, 10/01/2015
Location: 1 E8 / 1 E9 Level: Non-technical
Tags: featured
Allen Downey (Olin College of Engineering)
Average rating: ****.
(4.69, 13 ratings)
Bayesian methods are well-suited for business applications because they provide concrete guidance for decision-making under uncertainty.  But many data science teams lack the background to take advantage of these methods.  In this presentation I will explain the advantages and suggest ways for teams to develop skills and add Bayesian methods to their toolkit. Read more.
2:55pm–3:35pm Thursday, 10/01/2015
Location: 1 E8 / 1 E9 Level: Non-technical
Tags: geospatial
Brett Goldstein (University of Chicago)
Average rating: ****.
(4.80, 5 ratings)
Spatial analytics is often hampered by the arbitrary choice of units, allowing local heterogeneity to obscure true patterns. A new “smart clustering” technique lets us use large quantities of open municipal data to literally redraw city maps to reflect facts on the ground, not administrative boundaries. This talk will explain what smart clusters are and the promise they hold for urban science. Read more.
4:35pm–5:15pm Thursday, 10/01/2015
Location: 1 E8 / 1 E9 Level: Intermediate
Bar Ifrach (Airbnb)
Average rating: ****.
(4.00, 5 ratings)
This talk describes the development of a machine learning model that infers Airbnb host preferences for accommodation requests based on their past behavior. The model is used to surface likely matches more prominently on Airbnb’s search results. In our A/B testing the model showed about a 3.75% increase in booking conversion, resulting in many more trips on Airbnb. Read more.
4:35pm–5:15pm Thursday, 10/01/2015
Location: 1 E16 / 1 E17 Level: Non-technical
Vasant Dhar (NYU)
Average rating: ***..
(3.50, 2 ratings)
Financial markets emanate massive amounts of data from which machines can, in principle, learn to invest with minimal initial guidance from humans. I contrast human and machine strengths and weaknesses in making investment decisions. Read more.