Complex types (structs, arrays, and maps) and the resulting nested schemas initially gained prominence with XML as a niche solution for document-based data. However, over the past few years they have become mainstream in Hadoop-based data modeling and storage: virtually all modern serialization and storage formats (JSON, Protocol Buffers, Avro, Thrift, Parquet, ORC) now support complex types, and most Hadoop-based analytic frameworks allow the user to interact with nested schemas.
In this talk, we will explain how data scientists use nested data structures in order to increase their analytic productivity. We will use two well-known relational schemas – TPC-H and Twitter – to demonstrate how to simplify data science workloads with nested schemas. As part of that, we will outline best practices for converting flat relational schemas into nested schemas, and give examples of data science-style analysis utilizing Impala’s recently added support for complex types in SQL. Aside from their expressive power, nested schemas, when married to modern columnar formats such as Parquet, also enhance productivity through performance gains, which we will again demonstrate by comparing and contrasting the nested and flat relational versions of the TPC-H and Twitter schemas.
Marcel Kornacker is a tech lead at Cloudera and the architect of Apache Impala (incubating). Marcel has held engineering jobs at a few database-related startup companies and at Google, where he worked on several ad-serving and storage infrastructure projects. His last engagement was as the tech lead for the distributed query engine component of Google’s F1 project. Marcel holds a PhD in databases from UC Berkeley.
Josh Wills is director of data science at Cloudera, where he works with customers and engineers to develop Hadoop-based solutions across a wide range of industries. Prior to joining Cloudera, Josh was at Google where he worked on the ad auction system and then led the development of the analytics infrastructure used in Google+. He earned his bachelor’s degree in mathematics from Duke University and his master’s in operations research from the University of Texas-Austin.
Alex Behm is a software engineer at Cloudera, working on the Impala team. He holds a PhD in computer science from UC Irvine.
Comments on this page are now closed.
©2015, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.