Presented By O'Reilly and Cloudera
Make Data Work
22–23 May 2017: Training
23–25 May 2017: Tutorials & Conference
London, UK

Automated data exploration: Building efficient analysis pipelines with dask

Victor Zabalza (ASI Data Science)
14:5515:35 Wednesday, 24 May 2017
Data engineering and architecture
Location: Capital Suite 10/11
Level: Intermediate
Average rating: ***..
(3.75, 4 ratings)

Who is this presentation for?

  • Data scientists, data engineers, and project managers

Prerequisite knowledge

  • Intermediate knowledge about the Python scientific stack (useful but not required)

What you'll learn

  • Explore a Python package based on dask execution graphs and interactive visualization in Jupyter widgets that enables efficient data exploration


The first step in any data science project is understanding the available data. To this end, data scientists spend a significant part of their time carrying out data quality assessments and data exploration. In spite of this being a crucial step, carrying it out usually means repeating a series of menial tasks before the data scientist gains an understanding of the dataset and can progress to the next steps in the project.

Victor Zabalza shares a Python package based on dask execution graphs and interactive visualization in Jupyter widgets built to overcome this drudge work, enabling efficient data exploration and kickstarting data science projects. The tool generates a summary for each dataset that includes general information about the dataset, including data quality of each of the columns; the distribution of each of the columns through statistics and plots (histogram, CDF, KDE), optionally grouped by other categorical variables; a 2D distribution between pairs of columns; and a correlation coefficient matrix for all numerical columns.

Victor explains how building this tool has provided a unique view into the full Python data stack, from the parallelized analysis of a data frame within a dask custom execution graph to interactive visualization with Jupyter widgets and Plotly, and why it will become essential in the first steps of every data science project, cutting down the time data scientists spend making one-use exploratory graphs and getting them more quickly to deriving insights from the data.

Photo of Victor Zabalza

Victor Zabalza

ASI Data Science

VĂ­ctor Zabalza is a data engineer at ASI Data Science. Interested in building awesome Python tools for Data Science. He has a background in high-energy astrophysics, with 10 years of research experience that included work on the origin of gamma-ray emission from systems within our galaxy.