The first step in any data science project is understanding the available data. To this end, data scientists spend a significant part of their time carrying out data quality assessments and data exploration. In spite of this being a crucial step, carrying it out usually means repeating a series of menial tasks before the data scientist gains an understanding of the dataset and can progress to the next steps in the project.
Victor Zabalza shares a Python package based on dask execution graphs and interactive visualization in Jupyter widgets built to overcome this drudge work, enabling efficient data exploration and kickstarting data science projects. The tool generates a summary for each dataset that includes general information about the dataset, including data quality of each of the columns; the distribution of each of the columns through statistics and plots (histogram, CDF, KDE), optionally grouped by other categorical variables; a 2D distribution between pairs of columns; and a correlation coefficient matrix for all numerical columns.
Victor explains how building this tool has provided a unique view into the full Python data stack, from the parallelized analysis of a data frame within a dask custom execution graph to interactive visualization with Jupyter widgets and Plotly, and why it will become essential in the first steps of every data science project, cutting down the time data scientists spend making one-use exploratory graphs and getting them more quickly to deriving insights from the data.
Víctor Zabalza is a data engineer at ASI Data Science. Interested in building awesome Python tools for Data Science. He has a background in high-energy astrophysics, with 10 years of research experience that included work on the origin of gamma-ray emission from systems within our galaxy.
©2017, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org