Presented By O’Reilly and Cloudera

San Francisco • London • New York

Make Data Work

September 11, 2018: Training & Tutorials
September 12–13, 2018: Keynotes & Sessions
New York, NY

Hudi: Unifying storage and serving for batch and near-real-time analytics

Nishith Agarwal (Uber), Balaji Varadarajan (Uber), Vinoth Chandar (Apache Hudi)

5:25pm–6:05pm Wednesday, 09/12/2018

Data engineering and architecture, Streaming systems & real-time applications
Location: 1E 07/08 Level: Beginner

Secondary topics: Data Integration and Data Pipelines

Download slides (PPTX)

Who is this presentation for?

Data engineers and architects

Prerequisite knowledge

A basic understanding of the big data ecosystem (Hadoop, Spark, etc.)

What you'll learn

Explore Hudi, an open source analytical storage system created at Uber to manage petabytes of data on HDFS-like distributed storage
Understand the trade-offs in managing analytical storage across different use cases

Description

Hudi (formerly Hoodie) is an open source analytical storage system created at Uber to manage petabytes of data on HDFS-like distributed storage. Hudi enables near-real-time ingestion and provides different views of the data—a read-optimized view for batch analytics, a real-time view for driving dashboards, and an incremental view for powering data pipelines. Hudi also effectively manages files on underlying storage to maximize operational health and reliability.

Nishith Agarwal, Balaji Varadarajan, and Vinoth Chandar outline the design and architecture of merge-on-read storage and explain how it lowers data latency across the board while simultaneously achieving orders of magnitude of efficiency over traditional batch ingestion. They make the case for near-real-time dashboarding on top of Hudi datasets, which can be cheaper than pure streaming architectures, and detail how Uber leverages Hudi for use cases around ingestion, incremental ETL, and GDPR compliance.

Nishith Agarwal

Uber

Nishith Agarwal is a senior software engineer at Uber, where he works on the Hudi project and the Hadoop platform at large. His interests lie in large-scale distributed and data systems.

Balaji Varadarajan

Uber

Balaji Varadarajan is a senior software engineer at Uber, where he works on the Hudi project and oversees data engineering broadly across the network performance monitoring domain. Previously, he was one of the lead engineers on LinkedIn’s databus change capture system as well as the Espresso NoSQL store. Balaji’s interests lie in distributed data systems.

Vinoth Chandar

Apache Hudi

Vinoth Chandar is the cocreator of the Hudi project at Uber and also PMC and lead of Apache Hudi (Incubating). Previously, he was a senior staff engineer at Uber, where he led projects across various technology areas like data infrastructure, data architecture, and mobile and network performance; was the LinkedIn lead on Voldemort; and worked on Oracle Server’s replication engine, HPC, and stream processing . Vinoth has keen interest in unified architectures for data analytics and processing.

Comments on this page are now closed.

Comments

dhiraj nimbalkar |

10/10/2018 2:43pm EDT

How can I get access to Hudi: Unifying storage and serving for batch and near-real-time analytics webinar?

Presented by

Elite Sponsors

Strategic Sponsors

Zettabyte Sponsors

Contributing Sponsors

Exabyte Sponsors

Impact Sponsors

Supporting Sponsors

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com