Presented By
O’Reilly + Cloudera
Make Data Work
March 25-28, 2019
San Francisco, CA

Schedule: Automation in data science and big data sessions

As the use of machine learning and analytics become more widespread, we’re beginning to see tools that allow data scientists and data engineers to scale and tackle many more problems and maintain more systems. This includes automation tools for the many stages involved in data science including data preparation, feature engineering, model selection and hyperparameter tuning, as well as in data engineering and data operations.

9:00am5:00pm Tuesday, March 26, 2019
Location: 2022
Alex Kudriashova (Astro Digital), Jonathan Francis (Starbucks), JoLynn Lavin (General Mills), Robin Way (Corios), June Andrews (GE), Kyungtaak Noh (SK Telecom), Taposh DuttaRoy (Kaiser Permanente), Sabrina Dahlgren (Kaiser Permanente), Craig Rowley (Columbia Sportswear), Ambal Balakrishnan (IBM), Benjamin Glicksberg (UCSF), Patrick Lucey (Stats Perform), Rhonda Textor (True Fit)
Hear practical insights from household brands and global companies: the challenges they tackled, approaches they took, and the benefits—and drawbacks—of their solutions. Read more.
11:00am11:40am Wednesday, March 27, 2019
Diego Oppenheimer (Algorithmia)
Average rating: ****.
(4.00, 11 ratings)
You've invested heavily in cleaning your data, feature engineering, training, and tuning your model—but now you have to deploy your model into production, and you discover it's a huge challenge. Diego Oppenheimer shares common architectural patterns and best practices of the most advanced organizations who are deploying your model for scalability and accessibility. Read more.
11:00am11:40am Wednesday, March 27, 2019
JIAN CHANG (Alibaba Group), Sanjian Chen (Alibaba Group)
Average rating: ****.
(4.50, 4 ratings)
Jian Chang and Sanjian Chen outline the design of the AI engine on Alibaba's TSDB service, which enables fast and complex analytics of large-scale retail data. They then share a successful case study of the Fresh Hema Supermarket, a major “new retail” platform operated by Alibaba Group, highlighting solutions to the major technical challenges in data cleaning, storage, and processing. Read more.
11:50am12:30pm Wednesday, March 27, 2019
Sarah Aerni (Salesforce)
Average rating: ****.
(4.25, 4 ratings)
How does Salesforce make data science an Agile partner to over 100,000 customers? Sarah Aerni shares the nuts and bolts of the platform and details the Agile process behind it. From open source autoML library TransmogrifAI and experimentation to deployment and monitoring, Sarah covers the tools that make it possible for data scientists to rapidly iterate and adopt a truly Agile methodology. Read more.
11:50am12:30pm Wednesday, March 27, 2019
Tobias Knaup (Mesosphere), Joerg Schad (Suki)
Average rating: ****.
(4.50, 2 ratings)
There are many great tutorials for training your deep learning models, but training is only a small part in the overall deep learning pipeline. Tobias Knaup and Joerg Schad offer an introduction to building a complete automated deep learning pipeline, starting with exploratory analysis, overtraining, model storage, model serving, and monitoring. Read more.
4:20pm5:00pm Wednesday, March 27, 2019
Kelley Rivoire (Stripe)
Average rating: ****.
(4.33, 3 ratings)
Production ML applications benefit from reproducible, automated retraining, and deployment of ever-more predictive models trained on ever-increasing amounts of data. Kelley Rivoire explains how Stripe built a flexible API for training machine learning models that's used to train thousands of models per week on Kubernetes, supporting automated deployment of new models with improved performance. Read more.
5:10pm5:50pm Wednesday, March 27, 2019
Ting-Fang Yen (DataVisor)
Average rating: ****.
(4.00, 3 ratings)
Ting-Fang Yen details an approach for monitoring production machine learning systems that handle billions of requests daily by discovering detection anomalies, such as spurious false positives, as well as gradual concept drifts when the model no longer captures the target concept. Join in to explore new tools for detecting undesirable model behaviors early in large-scale online ML systems. Read more.
5:10pm5:50pm Wednesday, March 27, 2019
Kevin Moore (Salesforce)
Average rating: ****.
(4.50, 2 ratings)
Kevin Moore walks you through how TransmogrifAI—Salesforce's open source AutoML library built on Spark—automatically generates models that are automatically customized to a company's dataset and use case and provides insights into why the model is making the predictions it does. Read more.
11:50am12:30pm Thursday, March 28, 2019
Alkis Simitsis (Micro Focus), Shivnath Babu (Unravel Data Systems | Duke University)
Average rating: **...
(2.67, 3 ratings)
Alkis Simitsis and Shivnath Babu share an automated technique for root cause analysis (RCA) for big data stack applications using deep learning techniques, using Spark and Impala. The concepts they discuss apply generally to the big data stack. Read more.
1:50pm2:30pm Thursday, March 28, 2019
Arun Kumar (University of California, San Diego)
Average rating: ****.
(4.00, 2 ratings)
Arun Kumar details recent techniques to accelerate ML over data that is the output of joins of multiple tables. Using ideas from query optimization and learning theory, Arun demonstrates how to avoid joins before ML to reduce runtimes and memory and storage footprints. Along the way, he explores open source software prototypes and sample ML code in both R and Python. Read more.
2:40pm3:20pm Thursday, March 28, 2019
Till Bergmann (Salesforce)
Average rating: ***..
(3.67, 6 ratings)
A problem in predictive modeling data is label leakage. At enterprise companies such as Salesforce, this problem takes on monstrous proportions as the data is populated by diverse business processes, making it hard to distinguish cause from effect. Till Bergmann explains how Salesforce—which needs to churn out thousands of customer-specific models for any given use case—tackled this problem. Read more.
4:40pm5:20pm Thursday, March 28, 2019
Yves Thibaudeau (US Census Bureau)
Average rating: ***..
(3.33, 3 ratings)
The US Census Bureau has been involved in record linkage projects for over 40 years. In that time, there's been a lot of change in computing capabilities and new techniques, and the Census Bureau is reviewing an inventory of linkage methodologies. Yves Thibaudeau describes the progress made so far in identifying specific record linkage techniques for specific applications. Read more.
4:40pm5:20pm Thursday, March 28, 2019
Holden Karau (Google), Rachel Warren (Salesforce Einstein)
Average rating: ****.
(4.60, 5 ratings)
Apache Spark is an amazing distributed system, but part of the bargain we've made with the infrastructure deamons involves providing the correct set of magic numbers (a.k.a. tuning) or our jobs may be eaten by Cthulhu. Holden Karau and Rachel Warren explore auto-tuning jobs using systems like Apache BEAM, Mahout, and internal Spark ML jobs as workloads—including new settings in 2.4. Read more.