Presented By O’Reilly and Cloudera
Make Data Work
March 5–6, 2018: Training
March 6–8, 2018: Tutorials & Conference
San Jose, CA

Enough data engineering for a data scientist; or, How I learned to stop worrying and love the data scientists

Stephen O'Sullivan (Data Whisperers)
11:50am12:30pm Thursday, March 8, 2018
Average rating: ****.
(4.25, 4 ratings)

Who is this presentation for?

  • Data scientists and data scientists in training

Prerequisite knowledge

  • A basic understanding of SQL, data engineering design principles, core software engineering, and DevOps

What you'll learn

  • Gain an understanding of data engineering to improve productivity and the relationship between data scientists and data engineers


How much data engineering should a data scientist know? For a data scientist to get to the fun part of their job, they normally have to do a bit of data engineering—in most cases, 50%–80% of their time is spent onboarding or wrangling data. Then it gets handed over to the data engineering team to put it into production (via dev, test, and QA). However, in most cases, the data engineering team will have to do some modifications, rewrites, head shaking, and hand wringing to make the code production ready and meet the SLAs defined by the business, as there is a disconnect in how data scientists and data engineers develop code and models.

Stephen O’Sullivan takes you along the data science journey, from onboarding data (using a number of data/object stores) to understanding and choosing the right data format for the data assets to using query engines (and basic query tuning). You’ll learn how a distributed streaming platform works and how to take advantage of it and explore good coding practices. Along the way, you’ll learn some new skills to help you be more productive and reduce contention with the data engineering team.

Photo of Stephen O'Sullivan

Stephen O'Sullivan

Data Whisperers

A leading expert on big data architectures, Stephen O’Sullivan has 25 years of experience creating scalable, high-availability data and applications solutions. A veteran of Silicon Valley Data Science, @WalmartLabs, Sun, and Yahoo. Stephen is an independent adviser to enterprises on all things data..