Presented By O'Reilly and Cloudera
Make Data Work
March 13–14, 2017: Training
March 14–16, 2017: Tutorials & Conference
San Jose, CA

Data integration and governance for big data with Apache Avro; or, How to solve the GIGO problem

Barbara Eckman (Comcast)
10:00am10:30am Tuesday, March 14, 2017
DCS, Strata Business Summit
Location: LL20 A Level: Intermediate
Average rating: ***..
(3.80, 5 ratings)

Big data solutions like NoSQL databases and data lakes make it easy for anyone to contribute data to the enterprise data store. Integrating previously siloed data streams with other sources across the enterprise makes it possible to uncover otherwise hidden trends, anomalies, and powerful predictors of business successes and failures.

But integrating data across silos in a large enterprise is fraught with peril. There are typically few standards on naming conventions and data representation, and documentation is spotty at best. The old rule of thumb often applies: 70% of an analyst’s time goes into gathering, understanding, cleaning, and integrating the data while only 30% goes toward the actual analyses and simulations. (For example, each silo may have its own identifiers for central concepts like customer accounts and devices, but they all may simply be named “accountId” or ”deviceId.”) Joining data by simply matching on attribute names may yield “frankendata,” a monstrous data chimera whose parts do not belong together. Analysis of frankendata will inevitably lead to misleading results and may even cause decreases rather than increases in performance and customer satisfaction. In other words: garbage in, garbage out (GIGO).

Comcast is developing Aeolus, a new internal system that acts as a single point of ingest for real-time data, and a set of data transformation and storage solutions that are designed to avoid these data integration problems. The secret? Comcast uses Apache Avro schemas for data governance across the entire architecture.

Avro schemas document and enforce the types and structures of data, and also document the meaning of each attribute. A library of core subschemas enables reuse of standard naming conventions and formats for commonly referenced data such as device and account. When core subschemas are used and the data producer refers to “deviceId,” the semantics of that field are well known and documented.

Avro schemas follow the data from the upstream to the downstream end of the architecture, from initial ingest via Apache Kafka through intermediate processing and enrichment in flight until it finally is at rest in one or more data storage options (big data lake, key-value store, time series database, etc). Even batch data ingested through ETL must be stored in Avro format. This enables Comcast to understand and integrate data at any point in its journey.

Barbara Eckman explains how Comcast is using Apache Avro for enterprise data governance, the challenges faced, and methods to address these challenges.

Photo of Barbara Eckman

Barbara Eckman


Barbara Eckman is a principal data architect at Comcast, where she leads data governance for an innovative, division-wide initiative comprising near-real-time ingesting, streaming, transforming, storing, and analyzing big data. Barbara is a technical innovator and strategist with internationally recognized expertise in scientific data architecture and integration. Her experience includes technical leadership positions at a Human Genome Project center, Merck, GlaxoSmithKline, and IBM. She served on the IBM Academy of Technology, an internal peer-elected organization akin to the National Academy of Sciences.