Bringing stream processing to batch data using Apache Hudi (incubating)
Who is this presentation for?Data engineers, data architects, developers
Data processing is typically viewed from two distinct lenses: batch and stream processing. Over time, separate technologies and communities have evolved for both, with little cross-pollination of efficient techniques and tools. Batch processing remains the robust, scalable workhorse while stream processing has unlocked massive value by producing results quickly. While batch processing can make even complex ETL jobs simple by being stateless, it comes at the cost of inefficient reprocessing even for simple ETLs and pipelines. On the other hand, stream processing has made the simple pipeline dead simple and is evolving toward supporting more complex processing needs.
Apache Hudi (incubating) provides stream processing APIs on large volumes of batch data stored in cloud stores Hadoop distributed file systems (HDFS). Balaji Varadarajan shows you how Hudi helps faster, more efficient batch ETLs and pipelines—simple or complex—adopt the same techniques employed by state-of-the-art stream processing systems, in a manner that’s relatable to data analysts and data engineers authoring batch pipelines. You’ll explore from the ground up the current de facto architecture and how it can be improved across three key classes of batch processing pipelines: BI and roll up aggregation, feature stores for machine learning, and data warehousing.
For developers challenged by complex stream processing pipelines, Balaji outlines how Hudi overcomes some typical challenges in stream processing (e.g., correctness of stream-stream joins with late arriving data) by trading off latency for more completeness of results. Last but not least, he juxtaposes real-world experience from supporting such use cases at Uber.
- A basic understanding of frameworks like Spark and Flink
- Experience authoring 1–2 data pipelines (useful but not required)
What you'll learn
- Learn how Hudi makes batch data pipelines a lot faster and lighter
- Understand the trade-offs between latency and correctness of streaming pipelines
Balaji Varadarajan is a senior software engineer at Uber, where he works on the Hudi project and oversees data engineering broadly across the network performance monitoring domain. Previously, he was one of the lead engineers on LinkedIn’s databus change capture system as well as the Espresso NoSQL store. Balaji’s interests lie in distributed data systems.
Vinoth Chandar is the Co-Creator of the Hudi project at Uber and also PMC/Lead of Apache Hudi (Incubating). Previously, he was a senior staff engineer at Uber, where he led projects across various technology areas like data infrastructure, data architecture & mobile/network performance. Vinoth has keen interest in unified architectures for data analytics and processing. Previously, he was the LinkedIn lead on Voldemort and worked on Oracle Server’s replication engine, HPC, and stream processing.
Leave a Comment or Question
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
Premier Diamond Sponsors
Premier Exhibitor Plus
For conference registration information and customer service
For more information on community discounts and trade opportunities with O’Reilly conferences
For information on exhibiting or sponsoring a conference
For media/analyst press inquires