Presented By O'Reilly and Cloudera
Make Data Work
March 13–14, 2017: Training
March 14–16, 2017: Tutorials & Conference
San Jose, CA

The evolution of massive-scale data processing

Tyler Akidau (Google)
11:00am11:40am Thursday, March 16, 2017
Stream processing and analytics
Location: LL20 C Level: Beginner
Secondary topics:  Streaming
Average rating: ***..
(3.00, 2 ratings)

Who is this presentation for?

  • Anyone interested in the history and evolution of big data, especially in the streaming realm

Prerequisite knowledge

  • Familiarity with basic data processing concepts, both batch and streaming
  • A basic understanding of the high-level concepts presented in Tyler's "Streaming 101" and "Streaming 102" articles

What you'll learn

  • Understand the building blocks of massive-scale data processing systems in general
  • Learn how to choose the right system for your needs
  • Gain a set of insights to apply when engineering your own data processing applications, as well as a list of interesting articles and papers to read

Description

Join Tyler Akidau for a whirlwind tour of the conceptual building blocks of massive-scale data processing systems over the last decade, as Tyler compares and contrasts systems at Google with popular open source systems in use today.

Tyler explores the evolution of massive-scale data processing at Google, from the original MapReduce paradigm to the high-level pipelines of Flume to the streaming approach of MillWheel to the unified streaming/batch model of Cloud Dataflow. Along the way, Tyler examines in detail the basic architectural concepts that underlie the four models—highlighting their similarities, contrasting their differences (particularly regarding traditional batch versus streaming), and providing insight into the use cases that drove the progression of the designs to what exists today—and discusses the similarities and differences with related open source systems, such as Hadoop, Spark, Storm, and Flink.

Photo of Tyler Akidau

Tyler Akidau

Google

Tyler Akidau is a senior staff software engineer at Google Seattle. He leads technical infrastructure’s internal data processing teams in Seattle (MillWheel & Flume), is a founding member of the Apache Beam PMC, and has spent the last seven years working on massive-scale data processing systems. Though deeply passionate and vocal about the capabilities and importance of stream processing, he is also a firm believer in batch and streaming as two sides of the same coin, with the real endgame for data processing systems the seamless merging between the two. He is the author of the 2015 “Dataflow Model” paper and “Streaming 101” and “Streaming 102.” His preferred mode of transportation is by cargo bike, with his two young daughters in tow.

Comments on this page are now closed.

Comments

Rahul Perhar | SR BI ANALYST
03/24/2017 4:48am PDT

Hey, this was a great presentation. Where can I find the slides to this? Would love to go back and revisit!
Cheers!