Presented By O'Reilly and Cloudera
December 5-6, 2016: Training
December 6–8, 2016: Tutorials & Conference

Apache Beam: A unified model for batch and streaming data processing

Dan Halperin (Google)
1:45pm–2:25pm Wednesday, December 7, 2016
Production-ready Hadoop
Location: 308/309 Level: Beginner
Average rating: ***..
(3.75, 4 ratings)

Prerequisite Knowledge

  • A basic understanding of the challenges of data processing
  • Experience with time series data, Lambda architectures, and big data systems, such as Apache Hadoop and other Apache projects (useful bit not required)

What you'll learn

  • Understand why Apache Beam is a great programming model to write batch and streaming pipelines and port them to any system—an easy way to support many new users across many platforms
  • Explore how Apache Beam lowers the bar for users to try out and switch to your platform and reduces "lock-in" that would prevent them from switching


Unbounded, unordered, global-scale datasets are increasingly common in day-to-day business, and consumers of these datasets have detailed requirements for latency, cost, and completeness.

Apache Beam (incubating) defines a new data processing programming model that evolved from more than a decade of experience building big data infrastructure within Google, including MapReduce, FlumeJava, MillWheel, and Cloud Dataflow. Beam handles both batch and streaming use cases and neatly separates properties of the data from runtime characteristics, allowing pipelines to be portable across multiple runtime environments, both open source (e.g., Apache Flink and Apache Spark) and proprietary (e.g., Google Cloud Dataflow).

Dan Halperin covers the basics of Apache Beam—its evolution, main concepts in the programming mode, and how it compares to similar systems—as he takes you from a simple scenario to a relatively complex data processing pipeline before finally demonstrating the execution of that pipeline on multiple runtime environments.

Photo of Dan Halperin

Dan Halperin


Dan Halperin is a PPMC member and committer on Apache Beam (incubating). He has worked on Beam and Google Cloud Dataflow for 18 months. Previously, he was the director of research for scalable data analytics at the University of Washington eScience Institute, where he worked on scientific big data problems in oceanography, astronomy, medical informatics, and the life sciences. Dan holds a PhD in computer science and engineering from the University of Washington.