Get the free Ebook:
Private and Open Data in Asia: A Regional Guide.
Complex types (structs, arrays, and maps) and the resulting nested schemas initially gained prominence with XML as a niche solution for document-based data, but over the past few years have become mainstream in Hadoop-based data modeling and storage: virtually all modern serialization and storage formats (JSON, Protocol Buffers, Avro, Thrift, Parquet, ORC) now support complex types, and most Hadoop-based analytic frameworks allow the user to interact with nested schemas.
In this talk, we will explain how data scientists use nested data structures in order to increase their analytic productivity. We will use two well-known relational schemas – TPC-H and Twitter – to demonstrate how to simplify data science workloads with nested schemas. As part of that, we will outline best practices for converting flat relational schemas into nested schemas, and give examples of data science-style analysis utilizing Impala’s recently added support for complex types in SQL. Aside from their expressive power, nested schemas, when married to modern columnar formats such as Parquet, also enhance productivity through performance gains, which we will again demonstrate by comparing and contrasting the nested and flat relational versions of the TPC-H and Twitter schemas.
Marcel Kornacker is a tech lead at Cloudera and the architect of Apache Impala (incubating). Marcel has held engineering jobs at a few database-related startup companies and at Google, where he worked on several ad-serving and storage infrastructure projects. His last engagement was as the tech lead for the distributed query engine component of Google’s F1 project. Marcel holds a PhD in databases from UC Berkeley.
Skye Wanderman-Milne is an engineer on the Impala team at Cloudera.
©2015, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.