Comcast is evolving a cloud-based infrastructure for ingesting, enriching, storing, and accessing data to support classic analytic use cases, real-time operational analysis, and modern machine learning techniques. What all these use cases have in common is the need to find high-quality data of interest, understand its semantics, and trace its route from streaming ingestion to final storage in a data lake.
Data governance comprises a wide variety of concerns: ensuring metadata is well documented and uses common structures wherever possible; representing data quality and maturity levels; identifying and protecting PII; capturing data lineage at every stage of the pipeline, so that consumers of data lake objects know where the data originated, who produced it, and how it has been enriched, aggregated, or manipulated in its journey through the pipeline and data producers know what happens to their data after they publish it (who touches it and how they change it); and providing the ability to search a wide variety of metadata to find data to meet specific use cases (for example, EDWs, RDBMSs, Kafka topics, Hadoop objects on-premises, S3 data lake objects in the public cloud, and even ML models and feature sets).
Capturing this metadata, data quality metrics, and lineage requires data governance tooling to interleave with the enterprise architecture at many touchpoints. Metadata is loaded in batch from on-premises data stores like EDWs or RDBMSs. Capturing metadata and lineage is event driven in on-premises Hadoop environments and public cloud S3 data lakes as individual buckets, files, and tables are created or updated. Additional lineage is captured when transformation jobs or modeling pipelines are defined and registered in the self-service portal. Avro schemas for Kafka topics are captured when they have been reviewed and approved by our governance process. Serverless technologies like AWS lambda functions enable us to capture metadata and lineage in an event-driven manner with a small footprint.
Barbara Eckman discusses factors to consider in choosing data governance and metadata management software, shares the choices made at Comcast, and outlines some of the challenges faced along the way.
Barbara Eckman is a principal data architect at Comcast, where she leads data governance for an innovative, division-wide initiative comprising near-real-time ingesting, streaming, transforming, storing, and analyzing big data. Barbara is a technical innovator and strategist with internationally recognized expertise in scientific data architecture and integration. Her experience includes technical leadership positions at a Human Genome Project center, Merck, GlaxoSmithKline, and IBM. She served on the IBM Academy of Technology, an internal peer-elected organization akin to the National Academy of Sciences.
©2019, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org