Presented By O'Reilly and Cloudera
Make Data Work
Sept 29–Oct 1, 2015 • New York, NY

The glue: Building the connectors and tools to manage big data warehouses

Siwei Zhu (Scribd), Kevin Perko (Scribd)
2:05pm–2:45pm Wednesday, 09/30/2015
Production Ready Hadoop
Location: 3D 05/08 Level: Intermediate
Average rating: ***..
(3.17, 12 ratings)

For most companies, data analysis means collecting the data, building a data pipeline to clean and transform the data into a usable form, and only then looking for insights. Without good tools to automate the data pipeline, data flow management can become a tedious and brittle process.

In this talk we highlight some useful tools that we built in-house:

  • Scheduling of nightly jobs and anomaly detection to catch errors in the data or in the data transformation code
  • A backfill tool to retroactively update historical data to reflect new changes in the code
  • A dependency management tool to capture dependency restrictions in the data pipeline and schedule jobs to run in the optimal order
  • A data versioning tool that “remembers” which query generated the data. Without this, there is no way to tell apart data generated from two different versions of the code that may have very different logic, leading to faulty conclusions
Photo of Siwei Zhu

Siwei Zhu


Siwei Zhu is a data scientist at Scribd focused on understanding how users engage with the product. Previously, he has worked as a data scientist at Facebook.

Photo of Kevin Perko

Kevin Perko


Kevin Perko is the Data Team Lead at Scribd, the leading subscription reading service. He focuses on evaluating search engine performance, building data pipelines, and democratizing access to data through various initiatives including Reddit-style AMAs, emails, and individual outreach. With nearly a decade of analytics experience, Kevin has worked for a multitude of Bay Area startups including Eventbrite, GREE, and He has a background in Finance from Santa Clara University and has volunteered with The University of Cape Town to teach computer skills in the townships of South Africa.

Comments on this page are now closed.


Michael Keane
09/29/2015 8:54am EDT

Any insight on scheduler tools such as Falcon, Luigi, Azkaban, Oozie/Hue, and AirFlow