Presented By O'Reilly and Cloudera
Make Data Work
31 May–1 June 2016: Training
1 June–3 June 2016: Conference
London, UK

Triggers in Apache Beam (incubating): User-controlled balance of completeness, latency, and cost in streaming big data pipelines

Kenneth Knowles (Google)
12:05–12:45 Friday, 3/06/2016
Data innovations
Location: Capital Suite 12 Level: Advanced
Tags: real-time
Average rating: ****.
(4.67, 3 ratings)

Prerequisite knowledge

Attendees must be familiar with the basics of streaming computation. Tyler Akidau’s article "The World beyond Batch: Streaming 101" is the ideal preparation. Familiarity with existing big data batch processing concepts/tools (Hadoop, Spark, etc.) and streaming processing concepts and tools (Flink, Samza plus Kafka, Spark Streaming, Storm, etc.) will be useful, as will a cursory familiarity with Cloud Dataflow. (This will not be a general overview talk; watch Frances Perry’s @Scale presentation if you’re unfamiliar.)

Description

In a streaming data processing system, where data is generally unbounded, triggers specify when each stage of computation should emit output. With a small language of primitive conditions and ways of combining them, triggers provide the flexibility to tailor a streaming pipeline to a variety of use cases and data sources, enabling a practitioner to achieve an appropriate balance between accuracy, latency, and cost. Here are some conditions under which one may choose to “fire,” aka trigger output:

  • Fire after the system believes all data for the current window is processed
  • Fire after at least 1,000 elements have arrived for processing
  • Fire when the first of trigger A and trigger B fires
  • Fire according to trigger A until trigger B fires

To support the variety of streaming systems in existence today and yet to come, as well as the variability built into each one, a foundational semantics for triggers must be based on fundamental aspects of streaming processing. To maintain the unified batch/streaming programming model, you must ensure trigger semantics remain consistent across a number of dimensions, including:

  • Reordering and/or delay of data
  • Small bundles of data where an operation may buffer data until a trigger fires
  • Large bundles of data where an operation processes it all before firing the result to the next stage
  • Arbitrarily good (or bad) approximations of event time
  • Retrying a computation, where processing time and event time may both have progressed, and more data may have arrived, and we’d like to process it all together in large bundles for performance

Drawing on important real-world use cases, Kenneth Knowles delves into the details of the language- and runner-independent semantics developed for triggers in Apache Beam, demonstrating how the semantics support the use cases as well as all of the above variability in streaming systems. Kenneth then describes some of the particular implementations of those semantics in Google Cloud Dataflow.

Photo of Kenneth Knowles

Kenneth Knowles

Google

Kenn Knowles is a founding committer of Apache Beam (incubating). Kenn has been working on Google Cloud Dataflow—Google’s Beam backend—since 2014. Prior to that, he built backends for startups such as Cityspan, Inkling, and Dimagi. Kenn holds a PhD in programming languages from the University of California, Santa Cruz.