Presented By O'Reilly and Cloudera
Make Data Work
31 May–1 June 2016: Training
1 June–3 June 2016: Conference
London, UK

Data modeling for data science: Simplify your workload with complex types

Marcel Kornacker (Cloudera)
14:05–14:45 Friday, 3/06/2016
Data science & advanced analytics
Location: Capital Suite 8/9 Level: Intermediate
Average rating: ****.
(4.33, 6 ratings)

Prerequisite knowledge

Attendees should have a basic understanding of SQL and analytics.

Description

Complex types (structs, arrays, and maps) and the resulting nested schemas initially gained prominence with XML as a niche solution for document-based data. However, over the past few years, they have become mainstream in Hadoop-based data modeling and storage: virtually all modern serialization and storage formats (JSON, Protocol Buffers, Avro, Thrift, Parquet, ORC, etc.) now support complex types, and most Hadoop-based analytic frameworks allow the user to interact with nested schemas.

Marcel Kornacker explains how nested data structures can increase analytic productivity, using the well-known TPC-H schema to demonstrate how to simplify analytic workloads with nested schemas. Marcel also covers best practices for converting flat relational schemas into nested schemas and explores examples of data science-style analysis utilizing Apache Impala’s (incubating) support for complex types in SQL.

Photo of Marcel Kornacker

Marcel Kornacker

Cloudera

Marcel Kornacker is a tech lead at Cloudera and the architect of Apache Impala (incubating). Marcel has held engineering jobs at a few database-related startup companies and at Google, where he worked on several ad-serving and storage infrastructure projects. His last engagement was as the tech lead for the distributed query engine component of Google’s F1 project. Marcel holds a PhD in databases from UC Berkeley.