Presented By O'Reilly and Cloudera
Make Data Work
Dec 4–5, 2017: Training
Dec 5–7, 2017: Tutorials & Conference
Singapore

Data production pipelines: Legacy, practices, and innovation

Natalino Busa (DBS), Matteo Pelati (DataRobot)
2:35pm3:15pm Wednesday, December 6, 2017
Average rating: ****.
(4.00, 3 ratings)

Who is this presentation for?

  • Data engineers, machine learning engineers, and managers and analysts with an interest in modern data architectures

Prerequisite knowledge

  • Familiarity with the Jupyter Notebook, Spark SQL, and PySpark

What you'll learn

  • Learn how to quickly port ETL flows to Spark via a user-friendly web UI, robustly manage ETL and data science models from development to production, and design a CI/CD pipeline for data science models using Jupyter notebooks

Description

Modern engineering requires machine learning engineers, who are needed to monitor and implement ETL and machine learning models in production. Natalino Busa shares technologies, techniques, and blueprints on how to robustly and reliably manage data science and ETL flows from inception to production.

In particular, Natalino explains how to solve one of the most annoying problems in modern data pipelines—migrating and managing legacy ETL—by generating Spark jobs from a textual representation (NLP and SQL). Natalino also demonstrates an open source web UI implemented in React that transforms high-level representations to Spark code and shows how users are able to capture and discover data in the organization by accessing a metadata service. Natalino also introduces the datalabframework, a Jupyter-powered lightweight framework that allows machine learning scientists and engineers to build a robust production ML system only using notebooks.

Photo of Natalino Busa

Natalino Busa

DBS

Natalino Busa is the chief data architect at DBS, where he leads the definition, design, and implementation of big, fast data solutions for data-driven applications, such as predictive analytics, personalized marketing, and security event monitoring. Natalino is an all-around technology manager, product developer, and innovator with a 15+-year track record in research, development, and management of distributed architectures and scalable services and applications. Previously, he was the head of data science at Teradata, an enterprise data architect at ING, and a senior researcher at Philips Research Laboratories on the topics of system-on-a-chip architectures, distributed computing, and parallelizing compilers.

Photo of Matteo Pelati

Matteo Pelati

DataRobot

Matteo is the Head of Data Engineering at DBS bank overseeing the design and development of the entire DBS big data compute platform. Matteo has more than 15 years of experience in software engineering. In the recent years he has been focusing on scalable BigData platforms and machine learning, specifically using Hadoop and Spark. Matteo has previously held different roles in startup companies and MNCs: he has led engineering teams at DataRobot, Bubbly, Microsoft and Nokia.