One of the key values of the Hadoop ecosystem is its flexibility. There is a myriad of components that make up this ecosystem, allowing Hadoop to tackle otherwise intractable problems. However, having so many components provides a significant integration, implementation, and usability burden. Features that ought to work in all the components often require sizable per-component effort to ensure correctness across the stack.
In this talk, we introduce RecordService, a new solution to address this problem. The service provides an API to read data from Hadoop storage managers and return them as canonical records. This eliminates the need for components to support individual file formats, handle security, perform auditing, and implement sophisticated IO scheduling and other common processing that is at the bottom of any computation.
We discuss the architecture of the service and the integration work done for MapReduce and Spark. Many existing applications on those frameworks can take advantage of the service with little to no modification. We demonstrate how this provides fine grain (column level and row level) security, through Sentry integration, and improves performance for existing MapReduce and Spark applications by up to 5×. We conclude by discussing how this architecture can enable significant future improvements to the Hadoop ecosystem.
Lenni Kuff is an engineering manager at Facebook within Core Systems Infrastructure. Before joining Facebook, he worked at Cloudera for 5 years on Impala, Hive, and Sentry. Prior to Cloudera, Lenni was a Software Engineer at Microsoft on a number of projects including SQL Server storage engine, SQL Azure, and Hadoop on Azure. Lenni graduated from the University of Wisconsin-Madison with degrees in computer science and computer engineering.
Nong Li is a software engineer at Cloudera working on the RecordService and Impala projects. Before joining Cloudera, he worked at Microsoft developing new APIs for the Windows graphics system (DirectX). Nong holds a Sc.B. in computer science from Brown University.
Stephen Romanoff is a director in Capital One’s Technology organization. He leads teams in developing data management solutions for Capital One’s big data initiatives. Before joining Capital One, he was a consultant specializing in big data capabilities—development, architecture, and strategy—for numerous federal government agencies. He has degrees from Emory University and the University of Virginia.
©2015, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.