Skip to main content

How to Build a Hadoop Data Application

Tom White (Cloudera), Eric Sammer (Rocana), Joey Echeverria (Rocana)
Hadoop in Action Sutton Center - Sutton South
Tutorial Please note: to attend, your registration must include Tutorials on Monday.
Average rating: ***..
(3.71, 14 ratings)

Prerequisites for attendees

This is a hands-on tutorial, so you will need to bring a laptop with a 64-bit OS with you. In order to participate in the hands-on, you MUST do the following in advance:

  1. Install a VM player: VirtualBox (recommended), VMware Player for Windows or Linux, or VMware Fusion for Mac. (Again, a 64-bit host OS is required.)
  2. Install the VM image for the lab: Download here (username: cloudera, password: cloudera). (Note: be sure to install the correct image for whatever player you have, and be sure to unpack the file before using.)
  3. If using a PC, confirm that your laptop is configured to support virtualization. (Enter BIOS, find the “Virtualization” settings [usually under “Security”] and enable all the virtualization options.)

For common troubleshooting tips during installation, read this.

Tutorial Description

With a such a large number of components in the Hadoop ecosystem, writing Hadoop applications can be a challenge for users who are new to the platform. The Cloudera Development Kit (CDK) is an open source project with the goal of simplifying Hadoop application development. It codifies best-practice for writing Hadoop applications by providing documentation, examples, tools, and APIs for Java developers.

We will discuss the architecture of a common data pipeline from data ingest from an application to report generation. Hadoop concepts and components (including HDFS, Avro, Flume, Crunch, HCatalog, Hive, Impala, Oozie) will be introduced along the way, and they will be explained in the context of solving a concrete problem for the application. The goal is to build a simple end-to-end Hadoop data application that you can take away and adapt to your own use cases.

Attendees should be familiar with Java and common enterprise APIs like Servlets. No prior experience of Hadoop is necessary, although an awareness of the functions of components in the Hadoop stack is a plus.

Photo of Tom White

Tom White


Tom White is one of the foremost experts on Hadoop. He has been an Apache Hadoop committer since February 2007, and is a member of the Apache Software Foundation. His book Hadoop: The Definitive Guide (O’Reilly) is recognized as the leading reference on the subject. In 2011, Whirr, the project he founded to run Hadoop and other distributed systems in the cloud, became a top-level Apache project.

Tom is a software engineer at Cloudera, where he has worked since its foundation, on the core distributions from Cloudera and Apache. Previously he was an independent Hadoop consultant, working with companies to set up, use, and extend Hadoop. He has written numerous articles for O’Reilly, and IBM’s developerWorks, and has spoken at several conferences, most recently at ApacheCon and OSCON in 2011. Tom has a Bachelor’s degree in Mathematics from the University of Cambridge and a Master’s in Philosophy of Science from the University of Leeds, UK.

Photo of Eric Sammer

Eric Sammer


Eric Sammer is currently a Principal Solution Architect at Cloudera where he helps customers plan, deploy, develop for, and use Hadoop and the related projects at scale. His background is in the development and operations of distributed, highly concurrent, data ingest and processing systems. He’s been involved in the open source community and has contributed to a large number of projects over the last decade.

Photo of Joey Echeverria

Joey Echeverria


Joey Echeverria is a Senior Solutions Architect at Cloudera where he works directly with customers to deploy production Hadoop clusters and solve a diverse range of business and technical problems. Joey joined Cloudera from the NSA where he worked on data mining, network security, and clustered data processing using Hadoop. Prior to working full time for NSA, Joey attended Carnegie Mellon University where he attained an M.S. and a B.S. in Electrical and Computer Engineering.

Comments on this page are now closed.


Picture of louis Vainqueurs
louis Vainqueurs
11/13/2013 12:32am EST

Could you share the slides?. Thanks

Robert Fielding
10/25/2013 6:24pm EDT

The main trick to getting the guest additions to work (significantly faster system, full screen resolution, copy-paste from VM to host, etc) is that the cloudera VM is missing the exact source code that the kernel was compiled with. Install this package with yum install first (in the cloudera guest vm): , then after that, choose “Install Guest Additions” from the host. If it worked, after a reboot of the VM, you should be able to get full screen resolution and copy/paste with your host OS. If you don’t do this, your system might be so slow that you will have trouble using Eclipse and you may have fonts and resolution that are hard to read.

I also update to VirtualBox 4.3 and its extension packs first, but it might not be necessary.

Robert Fielding
10/25/2013 1:45pm EDT

VirtualBox: Without the virtualbox guest additions, system may be too slow to actually use Eclipse. Also, may need to dial down the amount of ram allocated to the VM to run well on laptop (I have 6Gb).

I tried to get the virtualbox additions to compile (needed to install kernel headers, etc), but still working on it… upgrading virtual box to latest version.


Sponsorship Opportunities

For exhibition and sponsorship opportunities, contact Susan Stewart at

Media Partner Opportunities

For information on trade opportunities with O'Reilly conferences email mediapartners

Press & Media

For media-related inquiries, contact Maureen Jennings at

Contact Us

View a complete list of Strata + Hadoop World 2013 contacts