Skip to main content

Managing a Rapidly Evolving Analytics Pipeline

Feng Peng (LinkTime Cloud)
Hadoop in Action Sutton Center - Sutton South
Average rating: ****.
(4.25, 8 ratings)
Slides:   1-PDF 

At Twitter we have seen wide adoption of our Hadoop-centric data analytics pipeline, which creates non-trivial job dependency graphs and increases system complexities. Managing issues such as visibility into dependencies, ownership of data sources and the programs that produce them, data provenance, schema management, and storage format migrations becomes extremely important to provide a reliable and flexible Big Data platform. Left unmanaged, such complexities can cause product breakages, delays in troubleshooting, lost productivity, and critical knowledge loss.

We built the “Data Access Layer” (DAL) service to address such issues. DAL provides schema management (via Apache HCatalog), storage abstraction, data discovery, data provenance, and usage auditing services to data sources produced and consumed by both Hadoop and RDBMS-based processes. We will discuss the challenges we encountered while growing the Twitter analytics pipeline and illustrate how DAL helps us manage the pipeline efficiently and systematically.

Feng Peng

LinkTime Cloud

Feng Peng is the tech lead of Analytics Data Pipeline at Twitter. His current work focuses on ETL/workflow tools and data pipeline management. Prior to Twitter, he was a Principal Software Engineer and Director of Analytics at, where he led the analytics team to build the Hadoop analytics infrastructure and successfully migrated the legacy analytics applications to the new platform. Feng has a Ph.D. in Computer Science from University of Maryland, College Park.


Sponsorship Opportunities

For exhibition and sponsorship opportunities, contact Susan Stewart at

Media Partner Opportunities

For information on trade opportunities with O'Reilly conferences email mediapartners

Press & Media

For media-related inquiries, contact Maureen Jennings at

Contact Us

View a complete list of Strata + Hadoop World 2013 contacts