Sep 23–26, 2019
Please log in

Working with time series: Denoising and imputation frameworks to improve data density

Anjali Samani (CircleUp)
11:20am12:00pm Thursday, September 26, 2019
Location: 1A 08/10
Average rating: *****
(5.00, 3 ratings)

Who is this presentation for?

  • Data scientists, data analysts, and people in business intelligence

Level

Beginner

Description

Increasingly, organizations are looking beyond conventional data provided by data aggregators and vendors in their industry. But alternative data, because of the way it’s generated and collected, is typically noisy and often ephemeral. A model’s ability to learn and correctly predict future outcomes is greatly influenced by the underlying data. Clean, complete data can make the difference between deriving correct and incorrect conclusions. Incomplete data can restrict its application to only a small set of techniques. And for alternative data sources, missed data is almost impossible to recover.

Anjali Samani explains two simple frameworks for evaluating a dataset’s candidacy for smoothing and quantitatively determining the optimal imputation strategy and the number of consecutive missing values that can be imputed without material degradation in signal quality.

To extract meaningful signals from alternative data, it’s necessary to apply denoising and imputation to generate clean and complete time series. There are numerous ways to smooth a noisy data series and impute missing values, each with relative strengths and weaknesses. Smoothing removes noise from the data and allows patterns and trends to be identified more easily. It can, however, make a series appear less volatile than it is and may mask the very patterns you’re seeking to identify. So you have to know when you should and shouldn’t smooth a series, and if it is smoothed, what type of smoothing you should apply.

Similarly, missing observations in time series can be imputed in many ways. These are covered in detail in both academic and practitioner literature. What caused the missing values in the first place and how the data is going to be used in downstream applications can often inform the most appropriate strategy for imputation. However, when there are multiple options to choose from, you have to objectively choose between different strategies and identify how many consecutive missing values can be safely imputed.

Prerequisite knowledge

  • A basic understanding of techniques such as simple and exponentially weighted moving averages, median filters, and linear interpolation and metrics such as root mean square error, mean absolute and percent errors, and relative error

What you'll learn

  • Gain frameworks for evaluating a dataset’s candidacy for smoothing and quantitatively determining the optimal imputation strategy and the number of consecutive missing values that can be imputed without material degradation in signal quality
Photo of Anjali Samani

Anjali Samani

CircleUp

Anjali Samani is a data science manager and leads the predictive modelling team at CircleUp, an innovative fintech company recently honored as one of the World’s Top 10 Most Innovative Companies in Data Science. Anjali has extensive experience in managing and delivering commercial data science projects and has worked with senior decision makers in startups, Financial Times Stock Exchange (FTSE) 100 businesses and public sector organizations in the UK and US to enable them to develop their data strategy and execute data science projects. Her roles bridge technical data science and business to identify and execute innovative solutions that leverage proprietary and open data sources to deliver value and drive growth. In her former life, Anjali was a quantitative analyst in asset management, and she has a background in computer science, economics, and mathematics.

Comments on this page are now closed.

Comments

Picture of Anjali Samani
Anjali Samani | Data Science Manager
09/26/2019 10:06am EDT

Thank you to everyone who attended, and for all the questions! I will be uploading the slides shortly. And if you have additional questions, please feel free to reach out!

  • Cloudera
  • O'Reilly
  • Google Cloud
  • IBM
  • Cisco
  • Dataiku
  • Intel
  • Io-Tahoe
  • MemSQL
  • Microsoft Azure
  • Oracle Cloud Infrastructure
  • SAS
  • Arcadia Data
  • BMC Software
  • Hazelcast
  • SAP
  • Amazon Web Services
  • Anaconda
  • Esri
  • Infoworks.io, Inc.
  • Kyligence
  • Pitney Bowes
  • Talend
  • Google Cloud
  • Confluent
  • DataStax
  • Dremio
  • Immuta
  • Impetus Technologies Inc.
  • Keyence
  • Kyvos Insights
  • StreamSets
  • Striim
  • Syncsort
  • SK holdings C&C

    Contact us

    confreg@oreilly.com

    For conference registration information and customer service

    partners@oreilly.com

    For more information on community discounts and trade opportunities with O’Reilly conferences

    strataconf@oreilly.com

    For information on exhibiting or sponsoring a conference

    pr@oreilly.com

    For media/analyst press inquires