Presented By O'Reilly and Cloudera
Make Data Work
September 25–26, 2017: Training
September 26–28, 2017: Tutorials & Conference
New York, NY

A brave new world in mutable big data: Relational storage

Todd Lipcon (Cloudera)
2:05pm2:45pm Wednesday, September 27, 2017
Data Engineering & Architecture, Real-time applications
Location: 1A 23/24 Level: Intermediate
Secondary topics:  Streaming
Average rating: *****
(5.00, 3 ratings)

Who is this presentation for?

  • IoT practitioners, DBAs, sysadmins, and IT architects

Prerequisite knowledge

  • Basic knowledge of relational data modeling (SQL tables and basic queries) and big data ecosystem components (e.g., common use cases for Spark Streaming, HDFS, and NoSQL)

What you'll learn

  • Understand the different storage engine characteristics necessary for big data applications and how to evaluate trade-offs between different options available in today’s market
  • Learn how relational storage and the flexibility to choose between NoSQL and SQL access can simplify architectures
  • Explore Google Cloud Spanner and Apache Kudu and gain insight into how to pick a storage engine for a new project

Description

The ever-increasing interest in running fast analytic scans on constantly updating data is stretching the capabilities of HDFS and NoSQL storage. Users want the fast online updates and serving of real-time data that NoSQL offers, as well as the fast scans, analytics, and processing of HDFS. Additionally, users are demanding that big data storage systems integrate natively with their existing BI and analytic technology investments, which typically use SQL as the standard query language of choice. This demand has led big data back to a familiar friend: relationally structured data storage systems.

Todd Lipcon explores the advantages of relational storage and reviews new developments, including Google Cloud Spanner and Apache Kudu, which provide a scalable relational solution for users who have too much data for a legacy high-performance analytic system. Todd explains how to address use cases that fall between HDFS and NoSQL with technologies like Apache Kudu or Google Cloud Spanner and how the combination of relational data models, SQL query support, and native API-based access enables the next generation of big data applications. Along the way, he also covers suggested architectures, the performance characteristics of Kudu and Spanner, and the deployment flexibility each option provides.

Photo of Todd Lipcon

Todd Lipcon

Cloudera

Todd Lipcon is an engineer at Cloudera, where he primarily contributes to open source distributed systems in the Apache Hadoop ecosystem. Previously, he focused on Apache HBase, HDFS, and MapReduce, where he designed and implemented redundant metadata storage for the NameNode (QuorumJournalManager), ZooKeeper-based automatic failover, and numerous performance, durability, and stability improvements. In 2012, Todd founded the Apache Kudu project and has spent the last three years leading this team. Todd is a committer and PMC member on Apache HBase, Hadoop, Thrift, and Kudu, as well as a member of the Apache Software Foundation. Prior to Cloudera, Todd worked on web infrastructure at several startups and researched novel machine learning methods for collaborative filtering. Todd holds a bachelor’s degree with honors from Brown University.