Presented By O'Reilly and Cloudera
Make Data Work
5–7 May, 2015 • London, UK

Architectural considerations for Hadoop applications

Gwen Shapira (Confluent), Mark Grover (Lyft), Ted Malaska (Capital One), Jonathan Seidman (Cloudera)
9:00–12:30 Tuesday, 5/05/2015
Hadoop Platform
Location: King's Suite - Balmoral
Average rating: ****.
(4.20, 20 ratings)

Prerequisite Knowledge

Basic knowledge of Java and Hadoop components

Materials or downloads needed in advance

The tutorial will cover best practices and considerations for architecting applications on Hadoop and a live demo of how to create a clickstream analytics engine using those best practices. The tutorial is not “hands-on”, meaning that during the presentation, we will not attempt to walk you through building the application on your own Hadoop installation. You are welcome to try it later, on your own though! Code for the demo and associated instructions are available at Slides (subject to change) are available at We will be using Cloudera's QuickStart VM during our presentation which is available at so we expect best results by running the demo on the same VM.


Implementing solutions with Apache Hadoop requires understanding not just Hadoop, but also a broad range of related projects in the Hadoop ecosystem such as Hive, Pig, Oozie, Sqoop, and Flume. The good news is that there’s an abundance of materials – books, websites, conferences, etc. – for gaining a deep understanding of Hadoop and these related projects. The bad news is there’s still a scarcity of information on how to integrate these components to implement complete solutions.

In this tutorial we’ll walk through an end-to-end case study of a clickstream analytics engine to provide a concrete example of how to architect and implement a complete solution with Hadoop. We’ll use this example to illustrate important topics such as:

- Modeling data in Hadoop
- Selecting optimal storage formats for data stored in Hadoop
- Moving data between Hadoop and external data management systems such as relational databases
- Moving event-based data such as logs and machine-generated data into Hadoop
- Accessing and processing data in Hadoop
- Orchestrating and scheduling workflows on Hadoop

Throughout the example, best practices and considerations for architecting applications on Hadoop will be covered. This tutorial will be valuable for developers, architects, or project leads who are already knowledgeable about Hadoop and are now looking for more insight into how it can be leveraged to implement real-world applications.

Photo of Gwen Shapira

Gwen Shapira


Gwen Shapira is a solutions architect at Cloudera and leader of the IOUG Big Data SIG. Gwen studied computer science, statistics, and operations research at the University of Tel Aviv, and then went on to spend the next 15 years in various technical positions in the IT industry. She specializes in scalable and resilient solutions, and helps her customers build high-performance large-scale data architectures using Hadoop. Gwen is a frequent presenter at conferences and regularly publishes articles in technical magazines and her blog.

Photo of Mark Grover

Mark Grover


Mark Grover is a committer on Apache Bigtop, a committer and PMC member on Apache Sentry (incubating), and a contributor to Apache Hadoop, Apache Hive, Apache Sqoop, and Apache Flume. He is currently co-authoring O’Reilly’s “Hadoop Application Architectures” title and is a section author of O’Reilly’s book on Apache Hive – Programming Hive. He has written a few guest blog posts and presented at many conferences about technologies in the Hadoop ecosystem.

Photo of Ted Malaska

Ted Malaska

Capital One

Ted has worked on close to 60 clusters for over 2- to 3-dozen clients with over hundreds of use cases. He has 18 years of professional experience working for startups, the US government, a number of the world’s largest banks, commercial firms, bio firms, retail firms, hardware appliance firms, and the largest non-profit financial regulator in the US. He has architecture experience across topics such as Hadoop, Web 2.0, Mobile, SOA (ESB, BPM), and big data. Ted is a regular committer to Flume, Avro, Pig, and YARN.

Photo of Jonathan Seidman

Jonathan Seidman


Jonathan is a solutions architect on the Partner Engineering team at Cloudera. Before joining Cloudera, he was a lead engineer on the Big Data team at Orbitz Worldwide, helping to build out the Hadoop clusters supporting the data storage and analysis needs of one of the most heavily trafficked sites on the internet. Jonathan is also a co-founder of the Chicago Hadoop User Group and the Chicago Big Data meetup, and a frequent speaker on Hadoop and big data at industry conferences such as Hadoop World, Strata, and OSCON. Jonathan is co-authoring a book on architecting applications with Apache Hadoop for O’Reilly Media.

Comments on this page are now closed.


lawrence mercy
16/06/2015 11:28 BST

Nice explanation, Are You trainer in hadoop? i have interest in takeup training for me? Java Training in Chennai