Presented By O'Reilly and Cloudera
Make Data Work
22–23 May 2017: Training
23–25 May 2017: Tutorials & Conference
London, UK

Spark machine-learning pipelines: The good, the bad, and the ugly

14:5515:35 Wednesday, 24 May 2017
Spark & beyond
Location: Capital Suite 12
Level: Intermediate
Average rating: ***..
(3.31, 13 ratings)

Who is this presentation for?

  • Data scientists, data engineers, and anyone interested in machine-learning pipelines with Spark

What you'll learn

  • Learn why Spark is a great tool to build machine-learning pipelines
  • Explore two real-world applications that use Spark to build functional machine-learning pipelines


Spark is now the de facto engine for big data processing. Vincent Van Steenbergen walks you through two real-world applications that use Spark to build functional machine-learning pipelines (wine price prediction and malware analysis), discussing the architecture and implementation and sharing the good, the bad, and the ugly experiences he had along the way.

Topics include:

  • The concepts and principles behind machine-learning pipelines
  • The algorithms Spark MLlib covers (regression, classigification, clustering)
  • The core concepts behind transformers, estimators, and model selection through hyperparameter tuning
  • An overview of, a pipeline to ingest time series data of fine wine prices, cluster them, and predict market price variation for the next five months
  • An overview of Karu Anti-Malware, a pipeline to ingest binary files, analyze them, detect malware signatures, and enrich the model
Photo of Vincent Van Steenbergen

Vincent Van Steenbergen

w00t data

Vincent Van Steenbergen is a certified Spark consultant and trainer at w00t data, where he helps companies scale big data and machine-learning solutions into production-ready applications and provides Spark training and consulting to a broad range of companies across Europe and the US. Vincent is a coorganizer of the international conference.

Comments on this page are now closed.


18/07/2017 12:50 BST

Hi Vincent,

Great session! Could you please post the slides and resources used.