Getting Started with Scalding, Twitter's High-level Scala API for Hadoop MapReduce

Avi Bryant (Stripe)
Start on low heat with a base of Hadoop; map, then reduce. Flavor, to taste, with Scala’s concise, functional syntax and collections library. Simmer with some Pig bones: a tuple model and high-level join and aggregation operators. Mix in Cascading to hold everything together and boil until it’s very, very hot, and you get Scalding, an API for MapReduce out of Twitter.

Scalding is an open source Scala framework for concisely describing Hadoop MapReduce jobs. I started the project at Twitter as a way for ad server engineers to run simple queries on the ad logs, without needing to learn a specialized language like Pig, or dive too deeply into the guts of Hadoop. Since then, it’s been adopted by teams at Etsy, LinkedIn, EBay, SoundCloud, LivePerson, Stripe, and others, and been extended with convenient APIs for everything from large-scale sparse matrix multiplication to locality-sensitive hashing.

This tutorial will walk you through getting started with Scalding, from writing the simplest word-count job up to using probabilistic data structures for distributed machine learning. No specific background in Scala, Hadoop, distributed computing or machine learning is required, though an interest in any or all of these might help.

Bring a laptop.


* No specific knowledge needed. Some familiarity with either Scala or Hadoop would be helpful but is not at all required.
* A laptop with a working JDK installation.

Avi has led product, engineering, and data science teams at Etsy, Twitter and Dabble DB (which he co-founded and Twitter acquired). He’s known for his open source work on projects such as Seaside, Scalding, and Algebird. Avi currently works at Stripe.