Brought to you by NumFOCUS Foundation and O’Reilly Media
The official Jupyter Conference
Aug 21-22, 2018: Training
Aug 22-24, 2018: Tutorials & Conference
New York, NY

Advanced data science, part 2: Five ways to handle missing data in Jupyter notebooks

Matt Brems (General Assembly)
3:30pm–5:00pm Wednesday, August 22, 2018
Average rating: *****
(5.00, 2 ratings)

Who is this presentation for?

  • Academics, practitioners, aspiring data scientists, and casual enthusiasts

Prerequisite knowledge

  • Familiarity with linear regression, logistic regression, standard deviation and variance, confidence intervals, and histograms and scatterplots
  • A working knowledge of Jupyter notebooks and Python programming (statsmodels, NumPy, etc.)

Materials or downloads needed in advance

What you'll learn

  • Understand how to visualize and handle missing data
  • Learn the types of missing data, how to identify them, and how to attempt to fix each
  • Learn how to implement reweighting and imputation methods in Jupyter notebooks

Description

If you work with data, you’ve almost certainly encountered missing data. The most common approaches are to either ignore or drop anything that’s missing, but this can lead to really bad results.

Matt Brems identifies the three types of missing data, explains how bad dropping or ignoring missing data can be, and teaches you how to handle missing data the right way by leveraging Jupyter notebooks to properly reweight or impute your data. Matt focuses on the following techniques: no imputation, deductive imputation, mean, median, and mode imputation, regression imputation, stochastic imputation, and multiply stochastic imputation. You’ll come away with a solid, intuitive understanding of how to handle missing data, practical tips for implementing these techniques, and recommendations for integrating them with your or your company’s workflow.

Photo of Matt Brems

Matt Brems

General Assembly

Matt currently leads instruction for General Assembly’s Data Science Immersive in Washington, DC, where he helps bridge the gap between theoretical statistics and real-world insights. Matt is passionate about making data science more accessible and putting the revolutionary power of machine learning into the hands of as many people as possible. A recovering politico, Matt was a data scientist for a political consulting firm through the 2016 election. He holds a master’s degree in statistics from the Ohio State University. When he isn’t teaching, he’s thinking about how to be a better teacher, falling asleep to Netflix, or cuddling with his pug.