Barbara Eckman offers an overview of Comcast’s streaming data platform, comprised of a variety of ingest, transformation, and storage services, which uses Apache Avro schemas to support end-to-end data governance, Apache Atlas for data discovery and lineage, and custom asynchronous messaging libraries to notify Atlas of new data and schema entities and lineage links as they are created. The platform provides access to an integrated view of a wide variety of high-quality, near-real-time data as well as aggregated and enriched data in long-term storage.
Such integration can enable data scientists to uncover otherwise hidden trends, anomalies, and powerful predictors of business successes and failures. But integrating data across silos in a large enterprise is fraught with peril. There are typically few standards on naming conventions and data representation and spotty documentation at best. And it’s well known that analysts spend 70% of their time data wrangling and only 30% doing actual analyses and simulations.
Apache Avro provides a lingua franca for data representation, data integration, and schema evolution. All data published, enriched, aggregated, and stored for community consumption must have an associated, peer-reviewed Avro schema. But encoding data in well-documented, standardized, language-independent schemas is only half the battle. Data must also be discoverable by those who wish to use it, and its lineage through the pipeline must be maintained. While metadata repositories for data discovery and lineage abound, none of them have built-in connectors and listeners for the entire complement of data sources that Comcast and many other large enterprises use to support their business needs.
Apache Atlas natively supports the data structures of the Hadoop ecosystem, including Apache Sqoop, Hive, Kafka, and Storm, and provides automatic lineage updates on data in these applications. More importantly, Atlas is extensible with respect to new data types and processes, and offers an asynchronous method of notifying Atlas of lineage changes on any data source in the system. This extensibility has allowed Comcast to add or update various entity types (e.g., Avro schemas, Kafka topics, object store pseudodirectories, and time series objects) and lineage types (e.g., storing streaming data in object storage, embellishing and republishing streaming data, performing aggregations and other transformations on data at rest, and evolution of schemas with compatibility flags).
A set of data storage and data transformation services notify Atlas of lineage links via custom asynchronous messaging. By making these services available to a variety of users, Comcast ensures that data lineage information is maintained no matter who stores the data. By limiting storage access to these services, the company ensures that its data lake does not contain zombie files of unknown lineage and unknown semantics. Once populated with metadata, Atlas provides self-service data discovery and lineage browsing and querying, via full-text search, DSL query language, or Gremlin graph query language.
Barbara Eckman is a principal data architect at Comcast, where she leads data governance for an innovative, division-wide initiative comprising near-real-time ingesting, streaming, transforming, storing, and analyzing big data. Barbara is a technical innovator and strategist with internationally recognized expertise in scientific data architecture and integration. Her experience includes technical leadership positions at a Human Genome Project center, Merck, GlaxoSmithKline, and IBM. She served on the IBM Academy of Technology, an internal peer-elected organization akin to the National Academy of Sciences.
©2017, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org