Flying with Elephants: How CAASD Uses Hadoop to Mine Aviation Data

DDBD Ballroom CD
Average rating: ****.
(4.33, 3 ratings)

The MITRE Corporation’s Center for Advanced Aviation System Development (CAASD) supports the Federal Aviation Administration (FAA) in advancing the safety, security, and efficiency of civil aviation. In order to support the development of new operational concepts, CAASD analyzes a diverse set of aviation data including surveillance, weather, terrain, and infrastructure information. This analysis traditionally consisted of sub-selecting from each data set on a case-by-case basis. Recently, CAASD has begun using Hadoop to consolidate our data research, which has enabled us to take a wider scope for our data analyses.

Historically, extensive preprocessing has been required for each analysis due to the unstructured mixture of aviation data. This data arrives in a variety of data formats, with different organizational structures. Specifically, surveillance data sources are keyed to each radar hit; weather data is blocked by hour; terrain data is provided as a geo- rasterized map; and infrastructure data is provided as vectorized position data. Our goal is to unify all of this information within the context of a single flight.

One of the major challenges in this goal is the absence of any unique identifier for a flight to serve as a key. The radar data is a hybrid collection of position reports from a range of surveillance systems across many radar facilities, which includes a set of textual non- unique flight-identifying information linked to each time/position report. We therefore had to develop novel processes to join these flights based on statistical inference of textual, temporal, and geospatial data. The large volume of data being processed and the computational complexity of the fusing algorithms led us to the Hadoop ecosystem of tools for implementing these fusing processes.

The processing was developed by a diverse team using a variety of languages and tools, each to best support the development needs of the specific sub-component. We were able to leverage many of the tools in the Hadoop stack for the task. For example, in the fusing processes, we used the Java Map Reduce API, Apache Pig scripts, and python programs executed via Hadoop Streaming. The individual sub-processes are all tied together through Oozie workflows, and Oozie’s scheduling capabilities have greatly simplified operations.

One area where we had integration challenges was leveraging the extensive pre-existing MATLAB® code-base within CAASD. We therefore developed an internal tool suite to run MATLAB programs in Map Reduce jobs. The tools exposed an API similar to the Apache Avro tethering protocol and provided helper utilities to deploy compiled MATLAB applications on the cluster. This enabled analysts and developers to continue their local testing and development in MATLAB, while easily porting their algorithms into the final Hadoop-based workflows.

While transferring these preprocessing steps to Hadoop helps to unify our analysis, this transition has also enabled us to dramatically enhance our analytical capabilities. We are now positioned to mine data on a much larger scale and study global trends over a larger historical record. This provides substantially more confidence in the data, which in turn enables more informed decision making to advance civil aviation.

Examples of the types of analyses being performed in Hadoop include:

  • Simulations of terrain avoidance and aircraft conflict detection alerting systems to look for safety hot-spots
  • Calculating aircraft procedure conformance metrics to study system efficiency.
  • Modeling traffic flows and sector densities
  • Detecting and classifying certain types of flight operations such as missed approaches or rejected takeoffs.

In this talk, we will describe how CAASD is using Hadoop to perform these multi-dataset fusion tasks, and how we are using these fused datasets to study ways in which we can help the FAA improve the safety and efficiency of civil aviation.

Photo of Marcio Silva

Marcio Silva


Marcio Silva is a Lead Data Mining Engineer at the MITRE Corporation’s Center for Advanced Aviation System Development (CAASD). Marcio holds a B.S. in Computer Science from George Mason University, and his areas of focus include data intensive computing, visualization, and web application development.

Prior to joining MITRE, Marcio held positions at Blackboard, Celera Genomics and Applied Biosystems where he helped develop a wide-range of data-heavy software products, including bioinformatic client applications, learning management systems, and life-science research portals.


Sponsorship Opportunities

For information on exhibition and sponsorship opportunities at the conference, contact Susan Stewart at

Media Partner Opportunities

For information on trade opportunities with O'Reilly conferences contact Kathy Yu at mediapartners

Press and Media

For media-related inquiries, contact Maureen Jennings at

Contact Us

View a complete list of Strata contacts