Presented By O'Reilly and Cloudera
Make Data Work
Sept 29–Oct 1, 2015 • New York, NY

Data modeling for data science: Simplify your workload with complex types

Marcel Kornacker (Cloudera), Josh Wills (Cloudera), Alexander Behm (Cloudera)
4:35pm–5:15pm Wednesday, 09/30/2015
Data Science & Advanced Analytics
Location: 1 E8 / 1 E9 Level: Intermediate
Average rating: ***..
(3.25, 12 ratings)
Slides:   1-PPTX 

Complex types (structs, arrays, and maps) and the resulting nested schemas initially gained prominence with XML as a niche solution for document-based data. However, over the past few years they have become mainstream in Hadoop-based data modeling and storage: virtually all modern serialization and storage formats (JSON, Protocol Buffers, Avro, Thrift, Parquet, ORC) now support complex types, and most Hadoop-based analytic frameworks allow the user to interact with nested schemas.

In this talk, we will explain how data scientists use nested data structures in order to increase their analytic productivity. We will use two well-known relational schemas РTPC-H and Twitter Рto demonstrate how to simplify data science workloads with nested schemas. As part of that, we will outline best practices for converting flat relational schemas into nested schemas, and give examples of data science-style analysis utilizing Impala’s recently added support for complex types in SQL. Aside from their expressive power, nested schemas, when married to modern columnar formats such as Parquet, also enhance productivity through performance gains, which we will again demonstrate by comparing and contrasting the nested and flat relational versions of the TPC-H and Twitter schemas.

Photo of Marcel Kornacker

Marcel Kornacker

Cloudera

Marcel Kornacker is a tech lead at Cloudera and the architect of Apache Impala (incubating). Marcel has held engineering jobs at a few database-related startup companies and at Google, where he worked on several ad-serving and storage infrastructure projects. His last engagement was as the tech lead for the distributed query engine component of Google’s F1 project. Marcel holds a PhD in databases from UC Berkeley.

Photo of Josh Wills

Josh Wills

Cloudera

Josh Wills is director of data science at Cloudera, where he works with customers and engineers to develop Hadoop-based solutions across a wide range of industries. Prior to joining Cloudera, Josh was at Google where he worked on the ad auction system and then led the development of the analytics infrastructure used in Google+. He earned his bachelor’s degree in mathematics from Duke University and his master’s in operations research from the University of Texas-Austin.

Alexander Behm

Cloudera

Alex Behm is a software engineer at Cloudera, working on the Impala team. He holds a PhD in computer science from UC Irvine.

Comments on this page are now closed.

Comments

Alexander Behm
10/05/2015 8:05am EDT

Yuxi, the slides are attached to this page now.

Yuxi He
10/02/2015 7:42am EDT

Thanks for the nice talk! Where can I find your slides?