Engineering the Future of Software
Feb 3–4, 2019: Training
Feb 4–6, 2019: Tutorials & Conference
New York, NY

Data governance and discovery in an end-to-end, heterogeneous data infrastructure

Barbara Eckman (Comcast)
4:50pm–5:40pm Wednesday, February 6, 2019
Data
Location: Trianon Ballroom
Secondary topics:  Best Practice, Case Study
Average rating: ****.
(4.75, 4 ratings)

Who is this presentation for?

  • Architects and managers

Level

Intermediate

Prerequisite knowledge

  • Familiarity with cloud computing and the big data ecosystem

What you'll learn

  • Learn why data governance and discovery are important for the most effective use of data across silos for predictions of business success and failure
  • Discover where data governance and discovery fit in an enterprise architecture
  • Explore challenges for data governance and discovery in a heterogeneous environment (multiple Hadoop providers, on-premises, and in the public cloud)
  • Learn considerations for choosing data governance and discovery software and best practices for data governance and discovery

Description

Comcast is evolving a cloud-based infrastructure for ingesting, enriching, storing, and accessing data to support classic analytic use cases, real-time operational analysis, and modern machine learning techniques. What all these use cases have in common is the need to find high-quality data of interest, understand its semantics, and trace its route from streaming ingestion to final storage in a data lake.

Data governance comprises a wide variety of concerns: ensuring metadata is well documented and uses common structures wherever possible; representing data quality and maturity levels; identifying and protecting PII; capturing data lineage at every stage of the pipeline, so that consumers of data lake objects know where the data originated, who produced it, and how it has been enriched, aggregated, or manipulated in its journey through the pipeline and data producers know what happens to their data after they publish it (who touches it and how they change it); and providing the ability to search a wide variety of metadata to find data to meet specific use cases (for example, EDWs, RDBMSs, Kafka topics, Hadoop objects on-premises, S3 data lake objects in the public cloud, and even ML models and feature sets).

Capturing this metadata, data quality metrics, and lineage requires data governance tooling to interleave with the enterprise architecture at many touchpoints. Metadata is loaded in batch from on-premises data stores like EDWs or RDBMSs. Capturing metadata and lineage is event driven in on-premises Hadoop environments and public cloud S3 data lakes as individual buckets, files, and tables are created or updated. Additional lineage is captured when transformation jobs or modeling pipelines are defined and registered in the self-service portal. Avro schemas for Kafka topics are captured when they have been reviewed and approved by our governance process. Serverless technologies like AWS lambda functions enable us to capture metadata and lineage in an event-driven manner with a small footprint.

Barbara Eckman discusses factors to consider in choosing data governance and metadata management software, shares the choices made at Comcast, and outlines some of the challenges faced along the way.

Photo of Barbara Eckman

Barbara Eckman

Comcast

Barbara Eckman is a principal data architect at Comcast, where she leads data governance for an innovative, division-wide initiative comprising near-real-time ingesting, streaming, transforming, storing, and analyzing big data. Barbara is a technical innovator and strategist with internationally recognized expertise in scientific data architecture and integration. Her experience includes technical leadership positions at a Human Genome Project center, Merck, GlaxoSmithKline, and IBM. She served on the IBM Academy of Technology, an internal peer-elected organization akin to the National Academy of Sciences.