Tyler Akidau explores the evolution of massive-scale data processing at Google, from the original MapReduce paradigm to the high-level pipelines of Flume, the streaming approach of MillWheel, and the unified streaming/batch model of Cloud Dataflow. Tyler examines the basic architectural concepts that underlie the four models in detail, highlighting their similarities, contrasting their differences (particularly regarding traditional batch vs. streaming), and providing insight into the use cases that drove the progression of the designs to what exists today. Along the way, Tyler also highlights similarities and differences with related open source systems such as Hadoop, Spark, Storm, and Flink.
Expect to come out of this talk with a stronger overall understanding of the building blocks of massive-scale data processing systems in general, an improved ability to choose the right system for your needs, and an increased set of insights to apply when engineering your own data processing applications. Plus, you’ll get to hear a few interesting anecdotes about data processing at Google that simply aren’t available anywhere else.
Tyler Akidau is a senior staff software engineer at Google Seattle, where he leads technical infrastructure internal data processing teams for MillWheel and Flume. Tyler is a founding member of the Apache Beam PMC and has spent the last seven years working on massive-scale data processing systems. Though deeply passionate and vocal about the capabilities and importance of stream processing, he is also a firm believer that batch and streaming are two sides of the same coin and that the real endgame for data processing systems is the seamless merging between the two. He is the author of the 2015 “Dataflow Model” paper and “Streaming 101” and “Streaming 102” blog posts. His preferred mode of transportation is by cargo bike, with his two young daughters in tow.
©2016, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.