Presented By O’Reilly and Cloudera
Make Data Work
March 5–6, 2018: Training
March 6–8, 2018: Tutorials & Conference
San Jose, CA

Machine learning to tackle industrial data fusion

Alexandra Gunderson (Arundo Analytics)
1:50pm2:30pm Wednesday, March 7, 2018
Secondary topics:  Graphs and Time-series
Average rating: *****
(5.00, 1 rating)

Who is this presentation for?

  • Data scientists, engineering leaders, and architects

Prerequisite knowledge

  • Familiarity with machine learning
  • Experience working in heavy industry (useful but not required)

What you'll learn

  • Understand best practices for machine learning unique to heavy industry


Asset-heavy industries, such as oil and gas and maritime, generate tremendous volumes of data in the form of sensors, failures, and maintenance records. However, because of a siloed data infrastructure, industrial leaders within the field struggle to make use of the entirety of this data and are thus unable to capitalize on the insights embedded.

Data may be organized in a number of different formats—historians, databases, locally on laptops, and even onboard the rig or ship—depending on what it has traditionally been used for. This complicates machine learning at scale and forces the data science process to be case specific and an independent exercise for each analysis. For example, in order to develop a predictive model to identify leakage on a compressor, an engineer would need to sort through process diagrams and sensor lists to find all relevant sensors related to that compressor (and the upstream and downstream equipment). They would then need to review thousands of text entries to find when leakages occurred on this compressor and when the leakage was fixed. On a single oil rig, there can be tens of thousands of sensors streaming with failures and work orders being logged regularly, so this manual selection process is tedious, prone to error, and lacks scalability.

Alexandra Gunderson details the methodology behind an industry-tested approach that incorporates machine learning to structure and link data from different sources. The working pipeline expedites the time from independent data sources to one coherent dataset using a combination of unsupervised and semisupervised methods. Alexandra explains how this pipeline has been used in real-world applications to structure tens of thousands of sensors onto an equipment hierarchy, convert free text describing events on a ship or oil rig onto an equipment hierarchy, and label these free text events according to a specific failure mode or action taken. Alexandra also explores the insights that can be gained after you’ve joined the different data sources.

Topics include:

  • PDF mining: Mining process and instrumentation diagrams to find how equipment interrelates and build meaningful information models (e.g., this heat exchanger is upstream of the compressor and should thus be considered when modeling compressor failures)
  • Mapping: Using text mining, clustering, and topic mining to automatically structure equipment, sensors, and events to a hierarchy
  • Event labeling: Using text mining, clustering, and topic mining to automatically pull keywords from event data and build datasets to be used with the sensors for supervised learning techniques
  • Label prediction: Using previous labeling and mapping data to limit the need for human intervention and do the process with limited oversight
Photo of Alexandra Gunderson

Alexandra Gunderson

Arundo Analytics

Alexandra Gunderson is a data scientist at Arundo Analytics. Her background is in mechanical engineering and applied numerical methods.