Ephemeral Hadoop Clusters in the Cloud

Data: Hadoop
Location: C121/122
Average rating: ***..
(3.75, 4 ratings)

Amazon’s Elastic Map Reduce APIs provide a rich interface for the execution of Hadoop jobs on top of AWS’s S3 and EC2 infrastructure. In addition to the fault tolerance and scalability of Hadoop, EMR brings with it the ability to quickly create, use, and shut down independent Hadoop clusters made up of EC2 instances.

This talk discusses how this unique Hadoop environment has helped Etsy quickly build data-driven products such as the gift recommender, suggested shops, and the taste test. We’ll start with a cost-based analysis of the benefits of being able to create custom-fit, short-lived Hadoop clusters for specific jobs. We will discuss our in-house toolchain called Barnum and Bailey, which allows us to easily create, deploy, schedule, and monitor these ad-hoc clusters on EMR. Finally, we’ll explain the benefits this approach brings to the test-debug cycle for creating and maintaining jobs.

Photo of Greg Fodor

Greg Fodor


Greg Fodor is currently a engineer on Etsy’s “data wranglers” team, responsible for building products around ‘big data’ at Etsy.

Comments on this page are now closed.


Picture of Sheeri K. Cabral
Sheeri K. Cabral
09/05/2011 9:38pm PDT

A video for this presentation is online at www.youtube.com/watch?v=NF6...