Presented By O'Reilly and Cloudera
Make Data Work
31 May–1 June 2016: Training
1 June–3 June 2016: Conference
London, UK

A functional data integration pipeline using Scala

Johannes Bauer (IHS Markit)
14:55–15:35 Friday, 3/06/2016
Hardcore data science
Location: Capital Suite 17 Level: Intermediate
Average rating: **...
(2.00, 1 rating)

Prerequisite knowledge

Attendees should have a basic knowledge of databases. Knowledge of Scala and functional programming will be useful but is not necessary.


Efficient, accurate, and robust ETL (extract, transform, load) pipelines are essential components for building successful data products. Johannes Bauer discusses the fundamental requirements for ETL pipelines, which port information stored in large flat files into a suitable database representation, highlighting major guiding principles as well as challenges. Since the focus of ETL process is data integrity and accuracy, a statically typed functional language like Scala is an excellent choice to accomplish the task in a scalable fashion. For illustration, Johannes presents selected elements of ETL pipeline implementations, emphasizing particularly useful libraries.

Johannes Bauer

IHS Markit

Johannes Bauer is currently lead data scientist at IHS Markit. Johannes holds a PhD in theoretical condensed matter physics and has postdoctoral experience working with big data and parallel processing at the Max Planck Institute (Germany) and Harvard University.