Presented By O’Reilly and Cloudera
Make Data Work
September 11, 2018: Training & Tutorials
September 12–13, 2018: Keynotes & Sessions
New York, NY

Data discovery and lineage: Integrating streaming data in the public cloud with on-prem, classic data stores, and heterogeneous schema types

Barbara Eckman (Comcast)
11:20am–12:00pm Thursday, 09/13/2018
Secondary topics:  Data Integration and Data Pipelines, Data preparation, governance and privacy, Media, Marketing, Advertising
Average rating: ****.
(4.33, 6 ratings)

Who is this presentation for?

  • Architects, CxOs, developers, and data scientists

Prerequisite knowledge

  • Familiarity with Apache Atlas or another data discovery/lineage system
  • A basic understanding of the value and challenges of data discovery and lineage in the big data world

What you'll learn

  • Learn how Comcast has extended Apache Atlas in the public cloud with on-prem data sources and transformations to provide end-to-end data discovery
  • Understand how to capture end-to-end lineage to trace data’s journey from the original Oracle databases to its final resting place in Amazon object storage and how to extend Apache Avro-based data governance to embrace relational and JSON schema types

Description

Comcast’s streaming data platform comprises a variety of ingest, transformation, and storage services in the public cloud. Peer-reviewed Apache Avro schemas support end-to-end data governance. At last year’s Strata New York, speakers from Comcast explained how the company extended Apache Atlas with custom entity and process types for discovery and lineage in the AWS public cloud. Custom lambda functions notify Atlas of creation of new entities and new lineage links via asynchronous Kafka messaging.

Comcast recently integrated on-prem data sources, including Hadoop-based traditional data warehouses and RDBMSs, which required its data governance strategy to include relational and JSON schemas in addition to Apache Avro. Barbara Eckman details how Comcast met that challenge, offering an overview of the federated architecture, in which Atlas provides SQL-like free text and graph search across select metadata from a wide variety of on-prem and public cloud data. Lightweight, custom connectors and bridges identify metadata and lineage changes in underlying sources and publish them to Atlas via the asynchronous API. A portal layer provides Atlas query access and a federation of UIs. Once data of interest is identified via Atlas queries, interfaces specific to underlying sources may be used for special-purpose metadata mining. Comcast provides end-to-end lineage for both batch and streaming processes, identifying, for example, the on-prem relational and Hive precursors of objects in the public cloud data lake.

Barbara outlines how Comcast extends its data governance practices to include not only Avro but also relational and JSON schemas. A data maturity model represents the schema type, the richness of its documentation, and the level of operational support that the datasource boasts. A heterogeneous schema registry still provides Avro schemas for SerDe but extends such features as schema evolution to other schema types. Comcast then captures semantic mappings of heterogeneous schemas to a set of common, company-wide data models.

Photo of Barbara Eckman

Barbara Eckman

Comcast

Barbara Eckman is a principal data architect at Comcast, where she leads data governance for an innovative, division-wide initiative comprising near-real-time ingesting, streaming, transforming, storing, and analyzing big data. Barbara is a technical innovator and strategist with internationally recognized expertise in scientific data architecture and integration. Her experience includes technical leadership positions at a Human Genome Project center, Merck, GlaxoSmithKline, and IBM. She served on the IBM Academy of Technology, an internal peer-elected organization akin to the National Academy of Sciences.