A variety of tools and frameworks for large-scale data workflows have emerged, which have substantial impact on practices in industry. Meanwhile, the use of Machine Learning in production apps has become less and algorithms (even though that work is fun and vital) and instead more about: socializing a problem within an organization; feature engineering; tournaments in continuous integration / continuous deployment environments; and operationalizing high-ROI apps at scale. In other words, leveraging great frameworks to build data workflows has become more important than chasing after diminishing returns on highly nuanced algorithms.
This talk considers a compare/contrast of these different workflow approaches, along with perspectives on use cases and indications, plus where they appear to be heading. Summary points build a scorecard for how to evaluate workflow frameworks based on your needs and use cases.
Building on the topic of “Enterprise Data Workflows”, using the Cascading open source project as a key example, this talk considers a variety of other available open source frameworks for building data workflows. Each bring their strengths and weaknesses, and address particular kinds of environments and use cases. Examples include:
In addition, a review of the PMML open standard show how some frameworks leverage it to make ML models portable across platforms and teams, while serving to help structure metadata in data workflows.
In summary, several “best of breed” points propose a basis for evaluating open source frameworks for data workflows: which to consider first, based on your needs and use cases, leading up to a scorecard. Suggestions point to where some of these projects could be improved, including better integration with the PMML open standard for machine learning model portability across platforms.
O’Reilly author (Enterprise Data Workflows with Cascading) and a “player/coach” who’s led innovative Data teams building large-scale apps for 10+ yrs. Expert in machine learning, cluster computing, and Enterprise use cases for Big Data. Interests: Mesos, PMML, Open Data, Cascalog, Scalding, Python for analytics, NLP.
For exhibition and sponsorship opportunities, contact Sharon Cordesse at email@example.com
For information on trade opportunities with O'Reilly conferences contact firstname.lastname@example.org
For media-related inquiries, contact Maureen Jennings at email@example.com
View a complete list of OSCON contacts