Complex types (structs, arrays, and maps) and the resulting nested schemas initially gained prominence with XML as a niche solution for document-based data. However, over the past few years, they have become mainstream in Hadoop-based data modeling and storage: virtually all modern serialization and storage formats (JSON, Protocol Buffers, Avro, Thrift, Parquet, ORC, etc.) now support complex types, and most Hadoop-based analytic frameworks allow the user to interact with nested schemas.
Marcel Kornacker explains how nested data structures can increase analytic productivity, using the well-known TPC-H schema to demonstrate how to simplify analytic workloads with nested schemas. Marcel also covers best practices for converting flat relational schemas into nested schemas and explores examples of data science-style analysis utilizing Apache Impala’s (incubating) support for complex types in SQL.
Marcel Kornacker is a tech lead at Cloudera and the architect of Apache Impala (incubating). Marcel has held engineering jobs at a few database-related startup companies and at Google, where he worked on several ad-serving and storage infrastructure projects. His last engagement was as the tech lead for the distributed query engine component of Google’s F1 project. Marcel holds a PhD in databases from UC Berkeley.
©2016, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.