Predictive maintenance is about anticipating a failure and taking preemptive action. With the recent advances in accessible machine learning and cloud storage, there is tremendous opportunity to utilize the entire gamut of data coming from factories, buildings, machines, and sensors to not only monitor the health of equipment but also predict when it is likely to malfunction or fail. However, as simple as it sounds in principle, in reality the data required to actually make a prediction in advance and in a timely manner is hard to come by. The data that is collected is often incomplete, partial, or just not enough, making it unsuitable for modeling.
In the realm of predictive maintenance, the event of interest is an equipment failure. In real scenarios, this is usually a rare event. Ideally, the data should have hundreds or even thousands of failures. However, unless the data collection has been taking place over a long period of time, the data will have very few of these events or, in the worst case, none at all. But even in these cases, the distribution or the ratio of failure to nonfailure data is highly skewed.
Modeling for failure thus often falls under the classic problem of modeling with imbalanced data when only a fraction of the data constitutes failure. Standard methods for feature selection and feature construction do not work so well for imbalanced data. Moreover, the metrics used to evaluate the model can be misleading. Danielle Dean and Shaheen Gauher discuss the best ways to build and evaluate models, offering examples that reference sample code in regular open source R as well as Microsoft R Server, which allows the computations to be done on big data. Danielle and Shaheen explain why a clear understanding of business requirements and tolerance to false negative and false positives is necessary. For example, for some businesses, failure to predict a malfunction can be extremely detrimental (e.g., aircraft engine failure) or exorbitantly expensive (e.g., production shutdown in a factory), while for others falsely predicting a failure when there is none leads to a significant loss of time and resources. In the language of statistics, this is what we call misclassification cost. Danielle and Shaheen conclude by illustrating how to deal with imbalanced data through two predictive maintenance example case studies.
Danielle Dean is the technical director of machine learning at iRobot. Previously, she was a principal data science lead at Microsoft. She holds a PhD in quantitative psychology from the University of North Carolina at Chapel Hill.
Shaheen Gauher is a data scientist in information management and machine learning at Microsoft, where she develops end-to-end, data-driven advanced analytics solutions for external customers. She is passionate about data and science and uses machine learning to come up with key insights that generate value for better decisions and better business performance. A climate scientist by training, Shaheen received her PhD in earth, ocean, and atmospheric sciences with a focus on satellite retrievals.
©2016, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.