Presented By O'Reilly and Cloudera
Make Data Work
Sept 29–Oct 1, 2015 • New York, NY

Simplifying Hadoop: RecordService, a secure and unified data access path for compute frameworks

Lenni Kuff (Facebook), Nong Li (Cloudera), Stephen Romanoff (Capital One )
1:15pm–1:55pm Wednesday, 09/30/2015
Hadoop Internals & Development
Location: 1 E16 / 1 E17 Level: Intermediate
Average rating: ****.
(4.05, 21 ratings)
Slides:   1-PPTX 

One of the key values of the Hadoop ecosystem is its flexibility. There is a myriad of components that make up this ecosystem, allowing Hadoop to tackle otherwise intractable problems. However, having so many components provides a significant integration, implementation, and usability burden. Features that ought to work in all the components often require sizable per-component effort to ensure correctness across the stack.

In this talk, we introduce RecordService, a new solution to address this problem. The service provides an API to read data from Hadoop storage managers and return them as canonical records. This eliminates the need for components to support individual file formats, handle security, perform auditing, and implement sophisticated IO scheduling and other common processing that is at the bottom of any computation.

We discuss the architecture of the service and the integration work done for MapReduce and Spark. Many existing applications on those frameworks can take advantage of the service with little to no modification. We demonstrate how this provides fine grain (column level and row level) security, through Sentry integration, and improves performance for existing MapReduce and Spark applications by up to 5×. We conclude by discussing how this architecture can enable significant future improvements to the Hadoop ecosystem.

Photo of Lenni Kuff

Lenni Kuff

Facebook

Lenni Kuff is an engineering manager at Facebook within Core Systems Infrastructure. Before joining Facebook, he worked at Cloudera for 5 years on Impala, Hive, and Sentry. Prior to Cloudera, Lenni was a Software Engineer at Microsoft on a number of projects including SQL Server storage engine, SQL Azure, and Hadoop on Azure. Lenni graduated from the University of Wisconsin-Madison with degrees in computer science and computer engineering.

Photo of Nong Li

Nong Li

Cloudera

Nong Li is a software engineer at Cloudera working on the RecordService and Impala projects. Before joining Cloudera, he worked at Microsoft developing new APIs for the Windows graphics system (DirectX). Nong holds a Sc.B. in computer science from Brown University.

Photo of Stephen Romanoff

Stephen Romanoff

Capital One

Stephen Romanoff is a director in Capital One’s Technology organization. He leads teams in developing data management solutions for Capital One’s big data initiatives. Before joining Capital One, he was a consultant specializing in big data capabilities—development, architecture, and strategy—for numerous federal government agencies. He has degrees from Emory University and the University of Virginia.