Working with Time Series: Denoising & Imputation Frameworks to Improve Data Density
Who is this presentation for?Data Scientists, Data Analysts, Business Intelligence
Prerequisite knowledge- Basic understanding techniques such as simple and exponentially weighted moving averages, median filters, linear interpolation - Basic understanding of metrics such as root mean square error, mean absolute and percent errors, relative error
What you'll learn
Increasingly, organisations are looking beyond conventional data provided by data aggregators and vendors in their industry. But alternative data, because of the way it is generated and collected, is typically noisy and often ephemeral. A model’s ability to learn and correctly predict future outcomes is greatly influenced by the underlying data. Clean, complete data can make the difference between deriving correct and incorrect conclusions from analyses conducted on this data. Incomplete data can restrict its application to only a small set of techniques. For alternative data sources, missed data is almost impossible to recover.
To extract meaningful signals from alternative data, it is necessary to apply de-noising and imputation to generate clean and complete time-series. There are numerous ways to smooth a noisy data series and impute missing values, each with their relative strengths and weaknesses. Smoothing removes noise from the data and allows patterns and trends to be identified more easily. It can, however, make a series appear less volatile than it is and may mask the very patterns a practitioner is seeking to identify. So, when should and should one not smooth a series? If it is smoothed, what type of smoothing should be applied?
Similarly, missing observations in time series can be imputed in many ways. These are covered in detail in both academic and practitioner literature. What caused the missing values in the first place and how the data is going to be used in downstream applications can often inform the most appropriate strategy for imputation. However, when there are multiple options to choose from, how does one objectively choose between different strategies? Furthermore, how many consecutive missing values can be safely imputed?
In this session, Anjali Samani – a Data Science Manager at CircleUp – will share two simple frameworks for 1) evaluating a dataset’s candidacy for smoothing; and 2) quantitatively determining the optimal imputation strategy and the number of consecutive missing values that can be imputed without material degradation in signal quality.
Anjali Samani leads the Predictive Modelling team at CircleUp, an innovative fintech company recently honored as one of the World’s Top 10 Most Innovative Companies in Data Science.
Anjali has extensive experience in managing and delivering commercial data science projects, and has worked with senior decision makers in startups, FTSE 100 businesses and public sector organisations in UK and US to enable them to develop their data strategy and execute data science projects. Her roles bridge technical data science and business to identify and execute innovative solutions that leverage proprietary and open data sources to deliver value and drive growth.
In her former life, Anjali was a quantitative analyst in asset management, and she has a background in computer science, economics, and mathematics.
Leave a Comment or Question
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
For conference registration information and customer service
For more information on community discounts and trade opportunities with O’Reilly conferences
For information on exhibiting or sponsoring a conference
View a complete list of Strata Data Conference contacts