Brought to you by NumFOCUS Foundation and O’Reilly Media Inc.
The official Jupyter Conference
August 22-23, 2017: Training
August 23-25, 2017: Tutorials & Conference
New York, NY

Analyst's Nightmare or Laundering Massive Spreadsheets

Moderated by: Feyzi Bagirov & Tatiana Yarmola

Who is this presentation for?

Data Analysts/Scientists, Beginner

Prerequisite knowledge

-Python 3 -Jupyter -Regex

What you'll learn

-Attendees will learn how to spot and clean data quality issues in spreadsheet data using Python

Description

The spreadsheet lives on, especially in sectors slow to adopt new technology, such as medicine and finance. Not only data is frequently stored and passed around in the spreadsheet formats, analysis is also frequently performed without leaving Excel. And when the data happens to be not as clean as you hoped it to be, serious errors occur and reproduce through the spreadsheet workcycle. Data quality issues such as duplicates and nulls, common practices such as copy-pastes, VLOOKUPS, and manual imputations as well as failure to properly understand and clean the data prior to making conclusions frequently lead to significant errors.
Pandas library provides a powerful tool of ingesting, cleaning, transforming, and visualizing spreadsheet data that are either lacking in Excel or are very painful to implement given the number of worksheets required for a task. This talk will demonstrate several frequently occurring data issues and show how they can be dealt with in Pandas.