Comcast’s streaming data platform comprises a variety of ingest, transformation, and storage services in the public cloud. Peer-reviewed Apache Avro schemas support end-to-end data governance. At last year’s Strata New York, speakers from Comcast explained how the company extended Apache Atlas with custom entity and process types for discovery and lineage in the AWS public cloud. Custom lambda functions notify Atlas of creation of new entities and new lineage links via asynchronous Kafka messaging.
Comcast recently integrated on-prem data sources, including Hadoop-based traditional data warehouses and RDBMSs, which required its data governance strategy to include relational and JSON schemas in addition to Apache Avro. Barbara Eckman details how Comcast met that challenge, offering an overview of the federated architecture, in which Atlas provides SQL-like free text and graph search across select metadata from a wide variety of on-prem and public cloud data. Lightweight, custom connectors and bridges identify metadata and lineage changes in underlying sources and publish them to Atlas via the asynchronous API. A portal layer provides Atlas query access and a federation of UIs. Once data of interest is identified via Atlas queries, interfaces specific to underlying sources may be used for special-purpose metadata mining. Comcast provides end-to-end lineage for both batch and streaming processes, identifying, for example, the on-prem relational and Hive precursors of objects in the public cloud data lake.
Barbara outlines how Comcast extends its data governance practices to include not only Avro but also relational and JSON schemas. A data maturity model represents the schema type, the richness of its documentation, and the level of operational support that the datasource boasts. A heterogeneous schema registry still provides Avro schemas for SerDe but extends such features as schema evolution to other schema types. Comcast then captures semantic mappings of heterogeneous schemas to a set of common, company-wide data models.
Barbara Eckman is a principal data architect at Comcast, where she leads data governance for an innovative, division-wide initiative comprising near-real-time ingesting, streaming, transforming, storing, and analyzing big data. Barbara is a technical innovator and strategist with internationally recognized expertise in scientific data architecture and integration. Her experience includes technical leadership positions at a Human Genome Project center, Merck, GlaxoSmithKline, and IBM. She served on the IBM Academy of Technology, an internal peer-elected organization akin to the National Academy of Sciences.
For exhibition and sponsorship opportunities, email strataconf@oreilly.com
For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com
View a complete list of Strata Data Conference contacts
©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com