Presented By O'Reilly and Cloudera
Make Data Work
September 26–27, 2016: Training
September 27–29, 2016: Tutorials & Conference
New York, NY

Evaluating models for a needle in a haystack: Applications in predictive maintenance

Danielle Dean (Microsoft), Shaheen Gauher (Microsoft)
2:55pm–3:35pm Thursday, 09/29/2016
Data science & advanced analytics
Location: Hall 1C Level: Intermediate
Average rating: ****.
(4.20, 5 ratings)

Prerequisite knowledge

  • A basic understanding of data science
  • What you'll learn

  • Understand how to build and evaluate models for rare events, using an example application of predictive maintenance
  • Description

    Predictive maintenance is about anticipating a failure and taking preemptive action. With the recent advances in accessible machine learning and cloud storage, there is tremendous opportunity to utilize the entire gamut of data coming from factories, buildings, machines, and sensors to not only monitor the health of equipment but also predict when it is likely to malfunction or fail. However, as simple as it sounds in principle, in reality the data required to actually make a prediction in advance and in a timely manner is hard to come by. The data that is collected is often incomplete, partial, or just not enough, making it unsuitable for modeling.

    In the realm of predictive maintenance, the event of interest is an equipment failure. In real scenarios, this is usually a rare event. Ideally, the data should have hundreds or even thousands of failures. However, unless the data collection has been taking place over a long period of time, the data will have very few of these events or, in the worst case, none at all. But even in these cases, the distribution or the ratio of failure to nonfailure data is highly skewed.

    Modeling for failure thus often falls under the classic problem of modeling with imbalanced data when only a fraction of the data constitutes failure. Standard methods for feature selection and feature construction do not work so well for imbalanced data. Moreover, the metrics used to evaluate the model can be misleading. Danielle Dean and Shaheen Gauher discuss the best ways to build and evaluate models, offering examples that reference sample code in regular open source R as well as Microsoft R Server, which allows the computations to be done on big data. Danielle and Shaheen explain why a clear understanding of business requirements and tolerance to false negative and false positives is necessary. For example, for some businesses, failure to predict a malfunction can be extremely detrimental (e.g., aircraft engine failure) or exorbitantly expensive (e.g., production shutdown in a factory), while for others falsely predicting a failure when there is none leads to a significant loss of time and resources. In the language of statistics, this is what we call misclassification cost. Danielle and Shaheen conclude by illustrating how to deal with imbalanced data through two predictive maintenance example case studies.

    Photo of Danielle Dean

    Danielle Dean

    Microsoft

    Danielle Dean is a principal data scientist lead in AzureCAT within the Cloud AI Platform Division at Microsoft, where she leads an international team of data scientists and engineers to build predictive analytics and machine learning solutions with external companies utilizing Microsoft’s Cloud AI platform. Previously, she was a data scientist at Nokia, where she produced business value and insights from big data through data mining and statistical modeling on data-driven projects that impacted a range of businesses, products, and initiatives. Danielle holds a PhD in quantitative psychology from the University of North Carolina at Chapel Hill, where she studied the application of multilevel event history models to understand the timing and processes leading to events between dyads within social networks.

    Photo of Shaheen Gauher

    Shaheen Gauher

    Microsoft

    Shaheen Gauher is a data scientist in information management and machine learning at Microsoft, where she develops end-to-end, data-driven advanced analytics solutions for external customers. She is passionate about data and science and uses machine learning to come up with key insights that generate value for better decisions and better business performance. A climate scientist by training, Shaheen received her PhD in earth, ocean, and atmospheric sciences with a focus on satellite retrievals.