Presented By O’Reilly and Cloudera
Make Data Work
September 11, 2018: Training & Tutorials
September 12–13, 2018: Keynotes & Sessions
New York, NY

From two weeks in Python to two hours in Pentaho: Building modern big data pipelines for machine learning (sponsored by Hitachi Vantara)

Dave Huh (Hitachi Vantara), Kevin Haas (Hitachi Vantara)
1:10pm–1:50pm Thursday, 09/13/2018
Location: 1E 06

What you'll learn

  • Learn how to build a data pipeline to output machine learning that meets the stringent demands for efficiency in today's modern big data context
  • Explore a number of big data and machine learning cases


IoT devices from cameras to advanced sensors are now emitting tremendous amounts of data that streams in to analytics environments. Healthcare providers and payers are collecting vast amounts of medical images, clinical, claims, and operational data. These are common big data cases where both structured data (e.g., claim processed) and unstructured data (e.g., storage of large images) are accumulated. However, the massiveness and messiness of data environments today makes traditional ETL processes difficult.

David Huh details an end-to-end construction of a data pipeline to output machine learning that meets the stringent demands for efficiency in today’s modern big data context and covers a number of big data and machine learning cases, including one where a team first spent two weeks building a data pipeline and machine learning model using Python and then built the same pipeline and model in two hours using Pentaho. David shares strategies to architect a data pipeline in Pentaho Data Integration (PDI) to refine raw data for analysis; the pipeline employs a reusable process of extracting metadata from images and then passing that dynamic data into a pipeline through metadata injection—a process similar in concept to creating a template that can receive dynamic parameters in the ETL processes. Using the same PDI environment as an example, David explores cases of plug-in machine intelligence that extends machine learning capabilities, allowing seamless training, orchestrating, and outputting machine learning models, including one that predicts complications from surgery.

This session is sponsored by Hitachi Vantara.

Photo of Dave Huh

Dave Huh

Hitachi Vantara

Dave Huh is a data scientist in the Professional Services Group at Hitachi Vantara, where he works with healthcare and insurance companies to provide insights with advanced analytics. Dave is passionate about making analytics technologies accessible to the broader public.

Kevin Haas

Hitachi Vantara