Presented By
O’Reilly + Cloudera

Make Data Work

March 25-28, 2019
San Francisco, CA

Schedule: Automation in data science and big data sessions

As the use of machine learning and analytics become more widespread, we’re beginning to see tools that allow data scientists and data engineers to scale and tackle many more problems and maintain more systems. This includes automation tools for the many stages involved in data science including data preparation, feature engineering, model selection and hyperparameter tuning, as well as in data engineering and data operations.

9:00am–5:00pm Tuesday, March 26, 2019

Data Case Studies

Location: 2022

Alex Kudriashova (Astro Digital), Jonathan Francis (Starbucks), JoLynn Lavin (General Mills), Robin Way (Corios), June Andrews (GE), Kyungtaak Noh (SK Telecom), Taposh DuttaRoy (Kaiser Permanente), Sabrina Dahlgren (Kaiser Permanente), Craig Rowley (Columbia Sportswear), Ambal Balakrishnan (IBM), Benjamin Glicksberg (UCSF), Patrick Lucey (Stats Perform), Rhonda Textor (True Fit)

Hear practical insights from household brands and global companies: the challenges they tackled, approaches they took, and the benefits—and drawbacks—of their solutions. Read more.

11:00am–11:40am Wednesday, March 27, 2019

Automating DevOps for machine learning

Data Engineering & Architecture
Location: 2008

Diego Oppenheimer (Algorithmia)

Average rating:

(4.00, 11 ratings)

You've invested heavily in cleaning your data, feature engineering, training, and tuning your model—but now you have to deploy your model into production, and you discover it's a huge challenge. Diego Oppenheimer shares common architectural patterns and best practices of the most advanced organizations who are deploying your model for scalability and accessibility. Read more.

11:00am–11:40am Wednesday, March 27, 2019

Building the AI engine for retail in the new era

Data Engineering & Architecture
Location: 2002

JIAN CHANG (Alibaba Group), Sanjian Chen (Alibaba Group)

Average rating:

(4.50, 4 ratings)

Jian Chang and Sanjian Chen outline the design of the AI engine on Alibaba's TSDB service, which enables fast and complex analytics of large-scale retail data. They then share a successful case study of the Fresh Hema Supermarket, a major “new retail” platform operated by Alibaba Group, highlighting solutions to the major technical challenges in data cleaning, storage, and processing. Read more.

11:50am–12:30pm Wednesday, March 27, 2019

Automated machine learning for Agile data science at scale

Data Science, Machine Learning & AI
Location: 2011

Sarah Aerni (Salesforce)

Average rating:

(4.25, 4 ratings)

How does Salesforce make data science an Agile partner to over 100,000 customers? Sarah Aerni shares the nuts and bolts of the platform and details the Agile process behind it. From open source autoML library TransmogrifAI and experimentation to deployment and monitoring, Sarah covers the tools that make it possible for data scientists to rapidly iterate and adopt a truly Agile methodology. Read more.

11:50am–12:30pm Wednesday, March 27, 2019

Deep learning beyond the learning

Data Engineering & Architecture
Location: 2008

Tobias Knaup (Mesosphere), Joerg Schad (ArangoDB)

Average rating:

(4.50, 2 ratings)

There are many great tutorials for training your deep learning models, but training is only a small part in the overall deep learning pipeline. Tobias Knaup and Joerg Schad offer an introduction to building a complete automated deep learning pipeline, starting with exploratory analysis, overtraining, model storage, model serving, and monitoring. Read more.

4:20pm–5:00pm Wednesday, March 27, 2019

Scaling model training: From flexible training APIs to resource management with Kubernetes

Data Science, Machine Learning & AI
Location: 2011

Kelley Rivoire (Stripe)

Average rating:

(4.33, 3 ratings)

Production ML applications benefit from reproducible, automated retraining, and deployment of ever-more predictive models trained on ever-increasing amounts of data. Kelley Rivoire explains how Stripe built a flexible API for training machine learning models that's used to train thousands of models per week on Kubernetes, supporting automated deployment of new models with improved performance. Read more.

5:10pm–5:50pm Wednesday, March 27, 2019

Talking to the machines: Monitoring production machine learning systems

Data Science, Machine Learning & AI
Location: 2011

Ting-Fang Yen (DataVisor)

Average rating:

(4.00, 3 ratings)

Ting-Fang Yen details an approach for monitoring production machine learning systems that handle billions of requests daily by discovering detection anomalies, such as spurious false positives, as well as gradual concept drifts when the model no longer captures the target concept. Join in to explore new tools for detecting undesirable model behaviors early in large-scale online ML systems. Read more.

5:10pm–5:50pm Wednesday, March 27, 2019

Point, click, predict

Data Science, Machine Learning & AI, Expo Hall
Location: Expo Hall

Kevin Moore (Salesforce)

Average rating:

(4.50, 2 ratings)

Kevin Moore walks you through how TransmogrifAI—Salesforce's open source AutoML library built on Spark—automatically generates models that are automatically customized to a company's dataset and use case and provides insights into why the model is making the predictions it does. Read more.

11:50am–12:30pm Thursday, March 28, 2019

Automation of root cause analysis for big data stack applications

Data Engineering & Architecture
Location: 2024

Alkis Simitsis (Micro Focus), Shivnath Babu (Unravel Data Systems | Duke University)

Average rating:

(2.67, 3 ratings)

Alkis Simitsis and Shivnath Babu share an automated technique for root cause analysis (RCA) for big data stack applications using deep learning techniques, using Spark and Impala. The concepts they discuss apply generally to the big data stack. Read more.

1:50pm–2:30pm Thursday, March 28, 2019

Faster ML over joins of tables

Data Engineering & Architecture
Location: 2008

Arun Kumar (University of California, San Diego)

Average rating:

(4.00, 2 ratings)

Arun Kumar details recent techniques to accelerate ML over data that is the output of joins of multiple tables. Using ideas from query optimization and learning theory, Arun demonstrates how to avoid joins before ML to reduce runtimes and memory and storage footprints. Along the way, he explores open source software prototypes and sample ML code in both R and Python. Read more.

2:40pm–3:20pm Thursday, March 28, 2019

How to train your model (and catch label leakage)

Data Science, Machine Learning & AI
Location: 2010

Till Bergmann (Salesforce)

Average rating:

(3.67, 6 ratings)

A problem in predictive modeling data is label leakage. At enterprise companies such as Salesforce, this problem takes on monstrous proportions as the data is populated by diverse business processes, making it hard to distinguish cause from effect. Till Bergmann explains how Salesforce—which needs to churn out thousands of customer-specific models for any given use case—tackled this problem. Read more.

4:40pm–5:20pm Thursday, March 28, 2019

New directions in record linkage

Data Engineering & Architecture
Location: 2024

Yves Thibaudeau (US Census Bureau)

Average rating:

(3.33, 3 ratings)

The US Census Bureau has been involved in record linkage projects for over 40 years. In that time, there's been a lot of change in computing capabilities and new techniques, and the Census Bureau is reviewing an inventory of linkage methodologies. Yves Thibaudeau describes the progress made so far in identifying specific record linkage techniques for specific applications. Read more.

4:40pm–5:20pm Thursday, March 28, 2019

Understanding Spark tuning with auto-tuning; or, Magical spells to stop your pager going off at 2:00am

Data Engineering & Architecture
Location: 2001

Holden Karau (Independent), Rachel Warren (Salesforce Einstein)

Average rating:

(4.60, 5 ratings)

Apache Spark is an amazing distributed system, but part of the bargain we've made with the infrastructure deamons involves providing the correct set of magic numbers (a.k.a. tuning) or our jobs may be eaten by Cthulhu. Holden Karau and Rachel Warren explore auto-tuning jobs using systems like Apache BEAM, Mahout, and internal Spark ML jobs as workloads—including new settings in 2.4. Read more.

Presented by

Strategic Sponsors

Zettabyte Sponsor

Contributing Sponsors

Exabyte Sponsors

Impact Sponsors

Supporting Sponsor

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2019, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com