Presented By O'Reilly and Cloudera
Make Data Work
March 13–14, 2017: Training
March 14–16, 2017: Tutorials & Conference
San Jose, CA

Unified, portable, efficient: Batch and stream processing with Apache Beam (incubating)

Kenneth Knowles (Google)
11:00am11:40am Wednesday, March 15, 2017
Stream processing and analytics
Location: LL20 D Level: Advanced
Secondary topics:  Streaming
Average rating: ****.
(4.80, 5 ratings)

Who is this presentation for?

  • Data engineers, data scientists, and technical product managers

Prerequisite knowledge

  • Familiarity with the basic considerations for stream processing (useful but not required—Tyler Akidau’s articles "Streaming 101" and "Streaming 102" are ideal preparation.)
  • Experience with existing big data batch processing concepts and tools, such as Hadoop and Spark, and streaming processing concepts and tools, such as Flink, Samza + Kafka, Spark Streaming, and Storm (useful but not required)

What you'll learn

  • Understand the considerations in processing unbounded, massive throughput, out-of-order data
  • Explore the Apache Beam programming model to address those considerations
  • Gain familiarity with Apache Beam as a project and its vision for the future

Description

The rise of unbounded, out-of-order, global-scale data requires increasingly sophisticated programming models to make stream processing feasible. When computing over an unbounded stream of data, each use case entails its own balance between three factors: completeness (confidence that you have all the data), latency (waiting to learn from the data), and cost (adding compute power to lower latency).

Kenneth Knowles shows how Apache Beam gives you control over this balance in a unified programming model that is portable to any Beam runner. Beam gives you this power by identifying and separating four concerns common to all streaming computations:

  • What are you computing? (Sums? Averages? ML models?)
  • Where is your data distributed in event time? (Fixed windows? User sessions?)
  • When do you want to produce results? (When you think you have all the data? After every new datum?)
  • How do refinements to prior outputs relate? (Are they deltas? Replacements?)

Regardless of backend, these questions must be answered. With Beam, you can answer these questions independently with loosely coupled APIs corresponding to each question: what—reading, transformation, aggregation, and writing; where—event time windowing; when—watermarks and triggers; and how—accumulation modes. With these, you can build a readable and portable pipeline focused on your problem rather than the quirks of your backend, which you can then execute on your runner of choice, including Apache Flink, Apache Spark, Apache Gearpump (also incubating), Apache Apex, or Google Cloud Dataflow.

Photo of Kenneth Knowles

Kenneth Knowles

Google

Kenn Knowles is a founding committer of Apache Beam (incubating). Kenn has been working on Google Cloud Dataflow—Google’s Beam backend—since 2014. Prior to that, he built backends for startups such as Cityspan, Inkling, and Dimagi. Kenn holds a PhD in programming languages from the University of California, Santa Cruz.

Comments on this page are now closed.

Comments

Picture of Kenneth Knowles
Kenneth Knowles | SOFTWARE ENGINEER
03/16/2017 4:05am PDT

Rahul, I’m glad you enjoyed it! The slides can be viewed at https://goo.gl/sRxNxF

Rahul Shrivastava | PRINCIPAL ENGINEER
03/15/2017 11:00am PDT

I attended the talk today. I was pretty good. Can you upload the slides