Presented By O'Reilly and Cloudera
Make Data Work
December 1–3, 2015 • Singapore

Data modeling for data science: Simplify your workload with complex types

Marcel Kornacker (Cloudera), Skye Wanderman-Milne (Cloudera)
1:30pm–2:10pm Wednesday, 12/02/2015
Data Science and Advanced Analytics
Location: 321-322 Level: Intermediate
Average rating: ***..
(3.90, 10 ratings)

Prerequisite Knowledge

Basic knowledge of data modeling and analytics

Description

Complex types (structs, arrays, and maps) and the resulting nested schemas initially gained prominence with XML as a niche solution for document-based data, but over the past few years have become mainstream in Hadoop-based data modeling and storage: virtually all modern serialization and storage formats (JSON, Protocol Buffers, Avro, Thrift, Parquet, ORC) now support complex types, and most Hadoop-based analytic frameworks allow the user to interact with nested schemas.

In this talk, we will explain how data scientists use nested data structures in order to increase their analytic productivity. We will use two well-known relational schemas РTPC-H and Twitter Рto demonstrate how to simplify data science workloads with nested schemas. As part of that, we will outline best practices for converting flat relational schemas into nested schemas, and give examples of data science-style analysis utilizing Impala’s recently added support for complex types in SQL. Aside from their expressive power, nested schemas, when married to modern columnar formats such as Parquet, also enhance productivity through performance gains, which we will again demonstrate by comparing and contrasting the nested and flat relational versions of the TPC-H and Twitter schemas.

Photo of Marcel Kornacker

Marcel Kornacker

Cloudera

Marcel Kornacker is a tech lead at Cloudera and the architect of Apache Impala (incubating). Marcel has held engineering jobs at a few database-related startup companies and at Google, where he worked on several ad-serving and storage infrastructure projects. His last engagement was as the tech lead for the distributed query engine component of Google’s F1 project. Marcel holds a PhD in databases from UC Berkeley.

Skye Wanderman-Milne

Cloudera

Skye Wanderman-Milne is an engineer on the Impala team at Cloudera.