Presented By O'Reilly and Cloudera
Make Data Work
September 25–26, 2017: Training
September 26–28, 2017: Tutorials & Conference
New York, NY

End-to-end data discovery and lineage in a heterogeneous big data environment with Apache Atlas and Avro

Barbara Eckman (Comcast)
4:35pm5:15pm Wednesday, September 27, 2017
Data engineering, Data Engineering & Architecture
Location: 1A 23/24 Level: Advanced
Secondary topics:  Architecture, Media, Platform
Average rating: ***..
(3.00, 2 ratings)

Who is this presentation for?

  • Architects, data analysts, and engineers

Prerequisite knowledge

  • Basic knowledge of big data and streaming data architectures

What you'll learn

  • Understand how Comcast extends Apache Atlas for data sources and transformations outside the Hadoop ecosystem, performs end-to-end data governance using Apache Avro, and combines Avro and Atlas for end-to-end data discovery and lineage

Description

Barbara Eckman offers an overview of Comcast’s streaming data platform, comprised of a variety of ingest, transformation, and storage services, which uses Apache Avro schemas to support end-to-end data governance, Apache Atlas for data discovery and lineage, and custom asynchronous messaging libraries to notify Atlas of new data and schema entities and lineage links as they are created. The platform provides access to an integrated view of a wide variety of high-quality, near-real-time data as well as aggregated and enriched data in long-term storage.

Such integration can enable data scientists to uncover otherwise hidden trends, anomalies, and powerful predictors of business successes and failures. But integrating data across silos in a large enterprise is fraught with peril. There are typically few standards on naming conventions and data representation and spotty documentation at best. And it’s well known that analysts spend 70% of their time data wrangling and only 30% doing actual analyses and simulations.

Apache Avro provides a lingua franca for data representation, data integration, and schema evolution. All data published, enriched, aggregated, and stored for community consumption must have an associated, peer-reviewed Avro schema. But encoding data in well-documented, standardized, language-independent schemas is only half the battle. Data must also be discoverable by those who wish to use it, and its lineage through the pipeline must be maintained. While metadata repositories for data discovery and lineage abound, none of them have built-in connectors and listeners for the entire complement of data sources that Comcast and many other large enterprises use to support their business needs.

Apache Atlas natively supports the data structures of the Hadoop ecosystem, including Apache Sqoop, Hive, Kafka, and Storm, and provides automatic lineage updates on data in these applications. More importantly, Atlas is extensible with respect to new data types and processes, and offers an asynchronous method of notifying Atlas of lineage changes on any data source in the system. This extensibility has allowed Comcast to add or update various entity types (e.g., Avro schemas, Kafka topics, object store pseudodirectories, and time series objects) and lineage types (e.g., storing streaming data in object storage, embellishing and republishing streaming data, performing aggregations and other transformations on data at rest, and evolution of schemas with compatibility flags).

A set of data storage and data transformation services notify Atlas of lineage links via custom asynchronous messaging. By making these services available to a variety of users, Comcast ensures that data lineage information is maintained no matter who stores the data. By limiting storage access to these services, the company ensures that its data lake does not contain zombie files of unknown lineage and unknown semantics. Once populated with metadata, Atlas provides self-service data discovery and lineage browsing and querying, via full-text search, DSL query language, or Gremlin graph query language.

Photo of Barbara Eckman

Barbara Eckman

Comcast

Barbara Eckman is a principal data architect at Comcast, where she leads data governance for an innovative, division-wide initiative comprising near-real-time ingesting, streaming, transforming, storing, and analyzing big data. Barbara is a technical innovator and strategist with internationally recognized expertise in scientific data architecture and integration. Her experience includes technical leadership positions at a Human Genome Project center, Merck, GlaxoSmithKline, and IBM. She served on the IBM Academy of Technology, an internal peer-elected organization akin to the National Academy of Sciences.