Presented By O'Reilly and Cloudera
Make Data Work
December 1–3, 2015 • Singapore

The evolution of massive-scale data processing

Tyler Akidau (Google)
4:50pm–5:30pm Wednesday, 12/02/2015
Hadoop & Beyond
Location: 328-329 Level: Intermediate
Average rating: ***..
(3.91, 11 ratings)

Prerequisite Knowledge

Familiarity with basic data processing concepts, both batch and streaming, is recommended. In the streaming realm, please be sure you’re at least minimally acquainted with the high-level topics presented in my O'Reilly Radar posts, The World Beyond Batch: Streaming 101 and The World Beyond Batch: The Dataflow Model (coming in October) ( before attending this talk, as I’ll be discussing concepts that rely on those ideas with far less context than would be otherwise appropriate.


Come explore the evolution of massive-scale data processing over the last decade. The backbone of the talk will follow the progression of systems in use at Google, from the classic MapReduce paradigm, to the high-level pipelines of Flume, to the streaming approach of MillWheel, to the unified streaming/batch model of Cloud Dataflow. I’ll look in detail at the basic architectural concepts that underlie the four models, highlight their similarities, contrast their differences (particularly regarding traditional batch vs streaming), and provide insight into the use cases that drove the refinement of the designs into what exists today. As we go, I’ll also discuss the common patterns and differentiating characteristics found in contemporary open source systems, such as Hadoop, Spark, Storm, Flink, etc.

Photo of Tyler Akidau

Tyler Akidau


Tyler Akidau is a senior staff software engineer at Google Seattle, where he leads technical infrastructure internal data processing teams for MillWheel and Flume. Tyler is a founding member of the Apache Beam PMC and has spent the last seven years working on massive-scale data processing systems. Though deeply passionate and vocal about the capabilities and importance of stream processing, he is also a firm believer that batch and streaming are two sides of the same coin and that the real endgame for data processing systems is the seamless merging between the two. He is the author of the 2015 “Dataflow Model” paper and “Streaming 101” and “Streaming 102” blog posts. His preferred mode of transportation is by cargo bike, with his two young daughters in tow.