Tools like Pig, Hive, and Cascading ease the burden of writing MapReduce pipelines by defining Tuple-oriented data models and providing support for filtering, joining and aggregating those records. However, there are many data sets that do not naturally fit into the Tuple model, such as images, time series, audio files and seismograms. To process data in these binary formats, developers often go back to writing MapReduces using the low-level Java APIs.
In this session, Cloudera Data Scientist Josh Wills will share insights and “how to” tricks about Crunch, a Java library that aims to make writing, testing and running MapReduce pipelines that run over any type of data easy, efficient and even fun. Crunch’s design is modeled after Google’s FlumeJava library and focuses on a small set of simple primitive operations and lightweight user-defined functions that can be combined to create complex, multi-stage pipelines. At runtime, Crunch compiles the pipeline into a sequence of MapReduce jobs and manages their execution on the Hadoop cluster.
Josh Wills is the director of data science at Cloudera. Wills is one of the main contributors to Cloudera’s most recent open source project, Crunch, a Java library that aims to make writing, testing, and running MapReduce pipelines easy, efficient, and even fun.
Prior to joining Cloudera, Wills was a software engineer at Google. Josh holds a M.S.E. in operations research from the University of Texas and a BS in mathematics from Duke University.
Comments on this page are now closed.
For information on exhibition and sponsorship opportunities at the conference, contact Susan Stewart at firstname.lastname@example.org.
For information on trade opportunities with O'Reilly conferences contact Kathy Yu at mediapartners
For media-related inquiries, contact Maureen Jennings at email@example.com
View a complete list of Strata contacts