Presented By O’Reilly and Cloudera
Make Data Work
March 5–6, 2018: Training
March 6–8, 2018: Tutorials & Conference
San Jose, CA

Best practices for productionizing Apache Spark MLlib models

Joseph Bradley (Databricks)
2:40pm3:20pm Wednesday, March 7, 2018
Average rating: *****
(5.00, 1 rating)

Who is this presentation for?

  • Data scientists, data engineers, and Apache Spark users

Prerequisite knowledge

  • A basic understanding of Apache Spark and MLlib (useful but not required)

What you'll learn

  • Gain a high-level view of deploying Apache Spark ML models to production
  • Learn best practices for major deployment modes for Spark ML models


Apache Spark has become a key tool for data scientists to explore, understand, and transform massive datasets and build and train advanced machine learning models. The question then becomes how to deploy these machine learning models in a production environment. How do you embed what you’ve learned into customer-facing data applications?

When companies begin to employ machine learning in actual production workflows, they encounter new sources of friction. Sharing models across teams can be challenging, especially when sharing means migrating to new deployment environments. Ensuring that identical models are deployed in different systems, especially while maintaining complex featurization logic, can cause subtle bugs and changes of behavior.

Joseph Bradley discusses common paths to productionizing Apache Spark MLlib models and shares engineering challenges and corresponding best practices. Along the way, Joseph covers several deployment scenarios, including batch scoring, Structured Streaming, and real-time low-latency serving. Joseph concludes with a demo that illustrates key parts of these workflows. You’ll leave with a high-level view of deployment modes as well as tips and resources for getting started with each mode.

Photo of Joseph Bradley

Joseph Bradley


Joseph Bradley is a software engineer working on machine learning at Databricks. Joseph is an Apache Spark committer and PMC member. Previously, he was a postdoc at UC Berkeley. Joseph holds a PhD in machine learning from Carnegie Mellon University, where he focused on scalable learning for probabilistic graphical models, examining trade-offs between computation, statistical efficiency, and parallelization.

Comments on this page are now closed.


03/07/2018 7:05am PST

Is this presentation available on github?

Picture of Esteban Hernández
Esteban Hernández | SOFTWARE ARCHITECT
03/07/2018 6:55am PST

Is possible download your presentation ?