Skip to main content

Data Workflows for Machine Learning

Paco Nathan (
Computational Thinking
Portland 255
Average rating: ****.
(4.33, 6 ratings)
Slides:   1-PDF 

A variety of tools and frameworks for large-scale data workflows have emerged, which have substantial impact on practices in industry. Meanwhile, the use of Machine Learning in production apps has become less and algorithms (even though that work is fun and vital) and instead more about: socializing a problem within an organization; feature engineering; tournaments in continuous integration / continuous deployment environments; and operationalizing high-ROI apps at scale. In other words, leveraging great frameworks to build data workflows has become more important than chasing after diminishing returns on highly nuanced algorithms.

This talk considers a compare/contrast of these different workflow approaches, along with perspectives on use cases and indications, plus where they appear to be heading. Summary points build a scorecard for how to evaluate workflow frameworks based on your needs and use cases.

Sub-topics include:

  • What is Machine Learning? What kinds of team processes are needed to deploy high-ROI apps at scale?
  • What is a Data Workflow? How does that differ from simply a dataflow pipeline?

Building on the topic of “Enterprise Data Workflows”, using the Cascading open source project as a key example, this talk considers a variety of other available open source frameworks for building data workflows. Each bring their strengths and weaknesses, and address particular kinds of environments and use cases. Examples include:

  • Cascading, Cascalog, Scalding, and related projects
  • KNIME (which integrates R, Weka, Eclipse, Hadoop, etc.)
  • IPython Notebook, and related Py frameworks for ML: scikit-learn, Pandas, Augustus, etc.
  • Summingbird and related libraries from Twitter which integrate Scalding, Storm, Spark, etc.
  • MBrace for .NET and F#
  • Titan
  • Spark, MLBase, and related projects from Berkeley AMPLab
  • Julia

In addition, a review of the PMML open standard show how some frameworks leverage it to make ML models portable across platforms and teams, while serving to help structure metadata in data workflows.

In summary, several “best of breed” points propose a basis for evaluating open source frameworks for data workflows: which to consider first, based on your needs and use cases, leading up to a scorecard. Suggestions point to where some of these projects could be improved, including better integration with the PMML open standard for machine learning model portability across platforms.

Photo of Paco Nathan

Paco Nathan

O’Reilly author (Enterprise Data Workflows with Cascading) and a “player/coach” who’s led innovative Data teams building large-scale apps for 10+ yrs. Expert in machine learning, cluster computing, and Enterprise use cases for Big Data. Interests: Mesos, PMML, Open Data, Cascalog, Scalding, Python for analytics, NLP.