Using Hadoop to do Agile Iterative ETL

Hadoop: Tools & Technology, Grand East (NY Hilton)
Average rating: ***..
(3.33, 6 ratings)

Traditional ETL assumes you know the target schema and organization of the data. That used to be a realistic assumption, but in a big-data world, data is much bigger, lower density and new sources arrive and evolve much more quickly. Implicit in this is that you are storing data before you know how you are going to use it.

A naive answer to this is schema-on-read. Just write data into Hadoop, and figure it out what you have and how you want to assemble it when you need it. But this means that advanced developers and lots of domain knowledge are needed any time anyone wants to pull anything from Hadoop. The sets the bar too high, and leads to complex and inflexible custom-coded integrations and jobs.

A new approach that we propose is ‘agile iterative ETL’. Hadoop makes this possible, since the data lands in its raw form and can be processed a first time and then revisited when additional detail or refinement is needed.

In other words:

  • 1. land raw data in Hadoop,
  • 2. lazily add metadata, and
  • 3. iteratively construct and refine marts/cubes based on the metadata from step 2.

The big difference is that, once steps #1 and #2 are completed, a relatively unsophisticated user could drive #3.
This approach can be used as a recipe for Hadoop developers looking to build a much more agile pipeline, and is heavily utilized in Platfora’s architecture.

Photo of Ben Werther

Ben Werther


Ben Werther is the Founder and Executive Chairman of Platfora. Ben launched Platfora, and was the founding CEO for four years, with the goal of transforming how ‘citizen data scientists’ in every company make sense and drive action through direct and effortless use of big data. Before founding Platfora, Ben was vice president of products for DataStax, where he shaped the company’s enterprise and Hadoop strategy, and was also head of products at Greenplum through its acquisition by EMC. Ben has a B.S. in Computer Science from Monash University (Australia) and an M.S. in Computer Science from Stanford University.

Kevin Beyer


Kevin Beyer is the Principal Architect at Platfora with 20 years of experience in building database systems. As a Research Staff Member at IBM, he created Jaql, a scripting language for large-scale, semi-structured data processing on Hadoop. Prior to the Jaql project, he added XML indexing support to IBM DB2. His Ph.D. dissertation at the University of Wisconsin focused on analytical query processing.

Comments on this page are now closed.


Charles LaCour
10/26/2012 5:27pm EDT

Is there a posibility of getting a copy of the presentation slides from this session?


Sponsorship Opportunities

For information on exhibition and sponsorship opportunities, contact Susan Stewart at

Media Partner Opportunities

For information on trade opportunities contact Kathy Yu at mediapartners

Press and Media

For media-related inquiries, contact Maureen Jennings at

Contact Us

View a complete list of Strata contacts.