One of the key values of the Hadoop ecosystem is its flexibility. A myriad of components make up this ecosystem, allowing Hadoop to tackle otherwise intractable problems. However, having so many components provides a significant integration, implementation, and usability burden. Features that ought to work in all the components often require sizable per-component effort to ensure correctness across the stack.
Chao Sun and Alex Leblang explore RecordService, a new solution that provides an API to read data from Hadoop storage managers and return them as canonical records. This eliminates the need for components to support individual file formats, handle security, perform auditing, and implement sophisticated IO scheduling and other common processing that is at the bottom of any computation.
Chao and Alex discuss the architecture of the service and the integration work done for MapReduce and Spark. Many existing applications on those frameworks can take advantage of the service with little to no modification. Chao and Alex demonstrate how this provides fine-grain (column-level and row-level) security through Sentry integration and improves performance for existing MapReduce and Spark applications by up to 5×. They conclude by explaining how this architecture can enable significant future improvements to the Hadoop ecosystem.
Chao Sun is currently a software engineer at Cloudera working on the RecordService project. Before that, Chao worked on the Hive on Spark project. He holds a PhD in computer science from the University of Wisconsin-Milwaukee, where he focused on type systems and programming languages.
Alex Leblang is an engineer at Cloudera on the RecordService team. Previously, Alex was an Apache Impala (incubating) engineer and interned at Vertica. He holds a bachelor’s degree from Brown University with concentrations in computer science and Latin American studies.
©2016, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.