Apache Spark has become a key tool for data scientists to explore, understand, and transform massive datasets and build and train advanced machine learning models. The question then becomes how to deploy these machine learning models in a production environment. How do you embed what you’ve learned into customer-facing data applications?
When companies begin to employ machine learning in actual production workflows, they encounter new sources of friction. Sharing models across teams can be challenging, especially when sharing means migrating to new deployment environments. Ensuring that identical models are deployed in different systems, especially while maintaining complex featurization logic, can cause subtle bugs and changes of behavior.
Joseph Bradley discusses common paths to productionizing Apache Spark MLlib models and shares engineering challenges and corresponding best practices. Along the way, Joseph covers several deployment scenarios, including batch scoring, Structured Streaming, and real-time low-latency serving. Joseph concludes with a demo that illustrates key parts of these workflows. You’ll leave with a high-level view of deployment modes as well as tips and resources for getting started with each mode.
Joseph Bradley is a software engineer working on machine learning at Databricks. Joseph is an Apache Spark committer and PMC member. Previously, he was a postdoc at UC Berkeley. Joseph holds a PhD in machine learning from Carnegie Mellon University, where he focused on scalable learning for probabilistic graphical models, examining trade-offs between computation, statistical efficiency, and parallelization.
Comments on this page are now closed.
©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com