Skip to main content
Make Data Work
Oct 15–17, 2014 • New York, NY

Schedule: Hadoop Platform sessions

A deep dive into the dominant big data stack, with practical lessons, integration tricks, and glimpse of the road ahead.

Track Hosts

Gwen Shapira (Cloudera)

Kathleen Ting (Cloudera)

Ted Malska (Cloudera)

Wednesday, October 15

Add to your personal schedule
1:30pm–5:00pm Wednesday, 10/15/2014
SOLD OUT
Location: 1 E10/1 E11
Stephen O'Sullivan (Silicon Valley Data Science), John Akred (Silicon Valley Data Science), Richard Williamson (Silicon Valley Data Science)
Average rating: ***..
(3.09, 23 ratings)
What are the essential components of a data platform? This tutorial will explain how the various parts of the Hadoop and big data ecosystem fit together in production to create a data platform supporting batch, interactive and realtime analytical workloads. Read more.

Thursday, October 16

Add to your personal schedule
11:00am–11:40am Thursday, 10/16/2014
Location: Hall A 23/24
Marcel Kornacker (Cloudera), Lenni Kuff (Cloudera)
Average rating: **...
(2.50, 26 ratings)
Find out how to run real-time analytics over raw data without requiring a manual ETL process targeted at an RDBMS. This talk describes Impala’s approach to on-the-fly data transformation and its support for nested data; examples demonstrate how this can be used to query raw data feeds in formats such as text, JSON and XML, at a performance level commonly associated with specialized engines. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 10/16/2014
Location: Hall A 23/24
Julian Hyde (Hortonworks)
Average rating: ***..
(3.25, 8 ratings)
Hyde shows how to quickly build a SQL interface to a NoSQL system using Optiq. He shows how to add rules and operators to Optiq to push down processing to the source system, and how to automatically build materialized data sets in memory for blazing-fast interactive analysis. Read more.
Add to your personal schedule
1:45pm–2:25pm Thursday, 10/16/2014
Location: Hall A 23/24
Guy Harrison (Dell Software), David Robson (Dell Software), Kathleen Ting (Cloudera)
Average rating: ***..
(3.71, 7 ratings)
When people think of big data processing, they think of Apache Hadoop, but that doesn't mean traditional databases don't play a role. In most cases users will still draw from data stored in RDBMS systems. Apache Sqoop can be used to unlock that data and transfer it to Hadoop, enabling users with information stored in existing SQL tables to use new analytic tools. Read more.
Add to your personal schedule
2:35pm–3:15pm Thursday, 10/16/2014
Location: Hall A 23/24
Mithun Radhakrishnan (Yahoo! Inc.)
Average rating: ****.
(4.67, 6 ratings)
The past year has seen the advent of various "low latency" solutions for querying big data such as Shark, Impala, and Presto. The Hive team at Yahoo has spent the past several months benchmarking several versions of Hive (and Tez), with several permutations of file-formats, compression, and query engine features, at various data sizes. In this talk, we present our tests, the results, and findings. Read more.
Add to your personal schedule
4:15pm–4:55pm Thursday, 10/16/2014
Location: Hall A 23/24
P. Taylor Goetz (Hortonworks )
Average rating: ****.
(4.33, 6 ratings)
We will discuss the basics of scaling, common mistakes and misconceptions, how different technology decisions affect performance, and how to identify and scale around the bottlenecks in a Storm deployment. Read more.
Add to your personal schedule
5:05pm–5:45pm Thursday, 10/16/2014
Location: Hall A 23/24
Martin Kleppmann (Independent)
Average rating: ****.
(4.71, 14 ratings)
Apache Samza is a framework for processing high-volume real-time event streams. In this session we will walk through our experiences of putting Samza into production at LinkedIn, discuss how it compares to other stream processing tools, and share the lessons we learnt about dealing with real-time data at scale. Read more.

Friday, October 17

Add to your personal schedule
11:00am–11:40am Friday, 10/17/2014
Location: Hall A 23/24
nick dimiduk (Hortonworks, Inc), Nicolas Liochon (Scaled Risk)
Average rating: ****.
(4.40, 5 ratings)
This talk examines sources of latency in HBase, detailing steps along the read and write paths. We'll examine the entire request lifecycle, from client to server and back again. We'll also look at the different factors that impact latency, including GC, cache misses, and system failures. Finally, the talk will highlight some of the work done in 0.96+ to improve the reliability of HBase. Read more.
Add to your personal schedule
11:50am–12:30pm Friday, 10/17/2014
Location: Hall A 23/24
Jonathan Hsieh (Cloudera, Inc), Lars George (Cloudera)
Average rating: ***..
(3.40, 5 ratings)
Today, there are hundreds of production Apache HBase clusters running either entity-centric or event-based applications. Gathered from known clusters and a survey conducted by Cloudera's development, product, and services teams from their experiences with the nearly 20,000 HBase nodes under management, this talk categorizes these the gamut of use-case into a compact set of application archetypes. Read more.
Add to your personal schedule
1:45pm–2:25pm Friday, 10/17/2014
Location: Hall A 23/24
Chris Nauroth (Hortonworks), Suresh Srinivas (Hortonworks)
Average rating: ****.
(4.25, 4 ratings)
Are you taking advantage of all of Hadoop’s features to operate a stable and effective cluster? Inspired by real-world support cases, this talk discusses best practices and new features to help improve incident response and daily operations. Chances are that you’ll walk away from this talk with some new ideas to implement in your own clusters. Read more.
Add to your personal schedule
2:35pm–3:15pm Friday, 10/17/2014
Location: Hall A 23/24
Anubhav Dhoot (Cloudera)
Average rating: ****.
(4.88, 8 ratings)
This talk will cover resource management using YARN - the new resource management platform introduced in Hadoop 2.0. It will cover how it achieves effective cluster utilization, fair sharing of resources, and allow different type of applications to utilize the cluster. We will go over the architecture, recent improvements, and things coming down the pipeline. Read more.
Add to your personal schedule
4:15pm–4:55pm Friday, 10/17/2014
Location: Hall A 23/24
Jean-Daniel Cryans (Cloudera)
Average rating: ****.
(4.80, 5 ratings)
This presentation will show you how to get your Big Data into Apache HBase as fast as possible. Those 40 minutes will save you hours of debugging and tuning, with the added bonus of having a better understanding of how HBase works. You will learn things like the write path, bulk loading, HFiles, and more. Read more.
Add to your personal schedule
5:05pm–5:45pm Friday, 10/17/2014
Location: Hall A 23/24
Uri Laserson (Cloudera)
Average rating: *****
(5.00, 1 rating)
Impala provides the ability to easily analyze large, distributed data sets. This talk will cover the impyla package, which aims to make data science easier with Impala by integrating with Python. The impyla package currently supports programmatically interacting with Impala, running distributed machine learning in Impala, and compiling Python UDFs into assembly instructions via LLVM. Read more.
Add to your personal schedule
4:15pm–4:55pm Friday, 10/17/2014
Location: 1 E20/1 E21
Greg Rahn (Cloudera)
Average rating: ****.
(4.80, 5 ratings)
In the last two years we've seen the introduction of several open-source SQL engines for Hadoop. There have been numerous marketing claims around SQL-on-Hadoop performance but what should you believe? How do these different engines compare on functionality? This talk will compare and contrast Hive, Impala, and Presto all from an non-vendor, unsponsored, independent point of view. Read more.