Efficient, accurate, and robust ETL (extract, transform, load) pipelines are essential components for building successful data products. Johannes Bauer discusses the fundamental requirements for ETL pipelines, which port information stored in large flat files into a suitable database representation, highlighting major guiding principles as well as challenges. Since the focus of ETL process is data integrity and accuracy, a statically typed functional language like Scala is an excellent choice to accomplish the task in a scalable fashion. For illustration, Johannes presents selected elements of ETL pipeline implementations, emphasizing particularly useful libraries.
Johannes Bauer is currently lead data scientist at IHS Markit. Johannes holds a PhD in theoretical condensed matter physics and has postdoctoral experience working with big data and parallel processing at the Max Planck Institute (Germany) and Harvard University.
©2016, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.