Presented By O’Reilly and Cloudera

San Jose • London • New York

Make Data Work

March 5–6, 2018: Training
March 6–8, 2018: Tutorials & Conference
San Jose, CA

Best practices for productionizing Apache Spark MLlib models

Joseph Bradley (Databricks)

2:40pm–3:20pm Wednesday, March 7, 2018

Big data and data science in the cloud, Data science and machine learning
Location: LL20 D

Average rating:

(5.00, 1 rating)

Download slides (PDF)

Who is this presentation for?

Data scientists, data engineers, and Apache Spark users

Prerequisite knowledge

A basic understanding of Apache Spark and MLlib (useful but not required)

What you'll learn

Gain a high-level view of deploying Apache Spark ML models to production
Learn best practices for major deployment modes for Spark ML models

Description

Apache Spark has become a key tool for data scientists to explore, understand, and transform massive datasets and build and train advanced machine learning models. The question then becomes how to deploy these machine learning models in a production environment. How do you embed what you’ve learned into customer-facing data applications?

When companies begin to employ machine learning in actual production workflows, they encounter new sources of friction. Sharing models across teams can be challenging, especially when sharing means migrating to new deployment environments. Ensuring that identical models are deployed in different systems, especially while maintaining complex featurization logic, can cause subtle bugs and changes of behavior.

Joseph Bradley discusses common paths to productionizing Apache Spark MLlib models and shares engineering challenges and corresponding best practices. Along the way, Joseph covers several deployment scenarios, including batch scoring, Structured Streaming, and real-time low-latency serving. Joseph concludes with a demo that illustrates key parts of these workflows. You’ll leave with a high-level view of deployment modes as well as tips and resources for getting started with each mode.

Joseph Bradley

Databricks

Joseph Bradley is a software engineer working on machine learning at Databricks. Joseph is an Apache Spark committer and PMC member. Previously, he was a postdoc at UC Berkeley. Joseph holds a PhD in machine learning from Carnegie Mellon University, where he focused on scalable learning for probabilistic graphical models, examining trade-offs between computation, statistical efficiency, and parallelization.

Comments on this page are now closed.

Comments

Amy D | DATA SCIENTIST

03/07/2018 7:05am PST

Is this presentation available on github?

Esteban Hernández | SOFTWARE ARCHITECT

03/07/2018 6:55am PST

Is possible download your presentation ?

Presented by

Elite Sponsors

Strategic Sponsors

Zettabyte Sponsor

Contributing Sponsors

Exabyte Sponsors

Impact Sponsors

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com