Presented By O'Reilly and Cloudera
Make Data Work
September 25–26, 2017: Training
September 26–28, 2017: Tutorials & Conference
New York, NY

Leveraging open source automated data science tools

Eduardo Arino de la Rubia (Domino Data Lab)
11:20am12:00pm Thursday, September 28, 2017
Data science & advanced analytics, Machine Learning & Data Science
Location: 1A 08/10 Level: Intermediate
Average rating: *****
(5.00, 5 ratings)

Who is this presentation for?

  • Data scientists and software engineers getting their start in machine learning

Prerequisite knowledge

  • Familiarity with machine learning and the goals of data science in organizations

What you'll learn

  • Understand the current state of the art in open source automated model building and data science, the limitations of these approaches, and where the industry and community is likely to go next


The data science process seeks to transform and empower organizations by finding and exploiting market inefficiencies and potentially hidden opportunities, but this is often an expensive, tedious process. However, many steps can be automated to provide a streamlined experience for data scientists. Eduardo Arino de la Rubia explores the tools being created by the open source community to free data scientists from tedium, enabling them to work on the high-value aspects of insight creation and impact validation.

The promise of the automated statistician is almost as old as statistics itself. From the creations of vast tables, which saved the labor of calculation, to modern tools which automatically mine datasets for correlations, there has been a considerable amount of advancement in this field. Eduardo compares and contrasts a number of open source tools, including TPOT and auto-sklearn for automated model generation and scikit-feature for feature generation and other aspects of the data science workflow, evaluates their results, and discusses their place in the modern data science workflow. Along the way, Eduardo outlines the pitfalls of automated data science and applications of the “no free lunch” theorem and dives into alternate approaches, such as end-to-end deep learning, which seek to leverage massive-scale computing and architectures to handle automatic generation of features and advanced models.

Photo of Eduardo Arino de la Rubia

Eduardo Arino de la Rubia

Domino Data Lab

Eduardo Arino de la Rubia is chief data scientist at Domino Data Lab. Eduardo is a lifelong technologist with a passion for data science who thrives on effectively communicating data-driven insights throughout an organization. He is a graduate of the MTSU Computer Science Department, General Assembly’s Data Science Program, and the Johns Hopkins Coursera Data Science Specialization. Eduardo is currently pursuing a master’s degree in negotiation, conflict resolution, and peacebuilding from CSUDH. You can follow him on Twitter at @earino.