Large Scale ETL with Hadoop

Hadoop: Tools & Technology, Grand East (NY Hilton)
Presentation: external link
Average rating: ****.
(4.25, 8 ratings)

Hadoop is commonly used for processing large swaths of data in batch. While many of the necessary building blocks for data processing exist within the Hadoop ecosystem – HDFS, MapReduce, HBase, Hive, Pig, Oozie, and so on – it can be a challenge to assemble and operationalize them as a production ETL platform. This presentation covers one approach to data ingest, organization, format selection, process orchestration, and external system integration, based on collective experience acquired across many production Hadoop deployments.

Photo of Eric Sammer

Eric Sammer


Eric Sammer is currently a Principal Solution Architect at Cloudera where he helps customers plan, deploy, develop for, and use Hadoop and the related projects at scale. His background is in the development and operations of distributed, highly concurrent, data ingest and processing systems. He’s been involved in the open source community and has contributed to a large number of projects over the last decade.

Comments on this page are now closed.


Picture of Sophia DeMartini
Sophia DeMartini
11/02/2012 2:59pm EDT

The slides for this talk are now available on SlideShare, and can be accessed via the “external link” above.

Picture of Sophia DeMartini
Sophia DeMartini
11/01/2012 10:46am EDT

Hi Bharath,

we’ll be posting the slides if the speaker provides them to us.

Thank you, Sophia

mundlapudi mundlapudi
11/01/2012 10:38am EDT

Will someone be posting the slides for this presentation?


Sponsorship Opportunities

For information on exhibition and sponsorship opportunities, contact Susan Stewart at

Media Partner Opportunities

For information on trade opportunities contact Kathy Yu at mediapartners

Press and Media

For media-related inquiries, contact Maureen Jennings at

Contact Us

View a complete list of Strata contacts.