Presented By O'Reilly and Cloudera
Make Data Work
31 May–1 June 2016: Training
1 June–3 June 2016: Conference
London, UK

The evolution of massive-scale data processing

Tyler Akidau (Google)
11:15–11:55 Friday, 3/06/2016
Data innovations
Location: Capital Suite 14 Level: Intermediate
Tags: real-time, iot
Average rating: ****.
(4.53, 17 ratings)

Prerequisite knowledge

Attendees should be familiar with basic data processing concepts (both batch and streaming). In the streaming realm, please be sure you’re at least minimally acquainted with the high-level topics presented in Tyler Akidau's O'Reilly Radar post "The World beyond Batch: Streaming 101" before attending this talk, as Tyler will discuss concepts that rely on those ideas with far less context than would be otherwise appropriate.


Tyler Akidau explores the evolution of massive-scale data processing at Google, from the original MapReduce paradigm to the high-level pipelines of Flume, the streaming approach of MillWheel, and the unified streaming/batch model of Cloud Dataflow. Tyler examines the basic architectural concepts that underlie the four models in detail, highlighting their similarities, contrasting their differences (particularly regarding traditional batch vs. streaming), and providing insight into the use cases that drove the progression of the designs to what exists today. Along the way, Tyler also highlights similarities and differences with related open source systems such as Hadoop, Spark, Storm, and Flink.

Expect to come out of this talk with a stronger overall understanding of the building blocks of massive-scale data processing systems in general, an improved ability to choose the right system for your needs, and an increased set of insights to apply when engineering your own data processing applications. Plus, you’ll get to hear a few interesting anecdotes about data processing at Google that simply aren’t available anywhere else.

Photo of Tyler Akidau

Tyler Akidau


Tyler Akidau is a senior staff software engineer at Google Seattle, where he leads technical infrastructure internal data processing teams for MillWheel and Flume. Tyler is a founding member of the Apache Beam PMC and has spent the last seven years working on massive-scale data processing systems. Though deeply passionate and vocal about the capabilities and importance of stream processing, he is also a firm believer that batch and streaming are two sides of the same coin and that the real endgame for data processing systems is the seamless merging between the two. He is the author of the 2015 “Dataflow Model” paper and “Streaming 101” and “Streaming 102” blog posts. His preferred mode of transportation is by cargo bike, with his two young daughters in tow.