Presented By O'Reilly and Cloudera
Make Data Work
September 26–27, 2016: Training
September 27–29, 2016: Tutorials & Conference
New York, NY

Triggers in Apache Beam (incubating)

Kenneth Knowles (Google)
2:55pm–3:35pm Wednesday, 09/28/2016
IoT & real-time
Location: 1 E 12/1 E 13 Level: Advanced
Average rating: *****
(5.00, 2 ratings)

Prerequisite knowledge

  • Familiarity with windowing and watermarks in streaming computation (Tyler Akidau’s article "The World Beyond Batch: Streaming 101" is ideal preparation.)
  • Experience with the Beam model, big data batch processing concepts and tools (Hadoop, Spark, Dataflow, etc.), and big data stream processing concepts and tools (Flink, Samza + Kafka, Spark Streaming, Storm, Dataflow, etc.) (useful but not required)
  • What you'll learn

  • Understand how Apache Beam supports streaming computation in a general manner applicable to many or most systems (including batch mode systems)
  • Learn how triggers are implemented by various backends
  • Description

    In a streaming data processing system, where data is generally unbounded, triggers specify when each stage of computation should emit output. With a small language of primitive conditions and multiple ways of combining them, triggers provide the flexibility to tailor a streaming pipeline to a variety of use cases and data sources, enabling a practitioner to achieve an appropriate balance between accuracy, latency, and cost. (Some conditions under which one may choose to “fire”—aka trigger output—include after the system believes all data for the current window is processed, after at least 1,000 elements have arrived for processing, when the first of trigger A and trigger B fires, or according to trigger A until trigger B fires.)

    To support the variety of streaming systems in existence today and yet to come, as well as the variability built into each one, a foundational semantics for triggers must be based on fundamental aspects of stream processing. Since we also aim to maintain the unified batch/streaming programming model, trigger semantics must remain consistent across a number of dimensions, including reordering and/or delay of data, small bundles of data where an operation may buffer data until a trigger fires, large bundles of data where an operation processes it all before firing the result to the next stage, arbitrarily good (or bad) approximations of event time, and retrying a computation (for example, when processing time and event time may both have progressed, and more data may have arrived, and we’d like to process it all together in large bundles for performance).

    Drawing on important real-world use cases, Kenneth Knowles delves into the details of language- and runner-independent semantics for triggers in Apache Beam and explores real-world implementations in Google Cloud Dataflow.

    Photo of Kenneth Knowles

    Kenneth Knowles

    Google

    Kenn Knowles is a founding committer of Apache Beam (incubating). Kenn has been working on Google Cloud Dataflow—Google’s Beam backend—since 2014. Prior to that, he built backends for startups such as Cityspan, Inkling, and Dimagi. Kenn holds a PhD in programming languages from the University of California, Santa Cruz.

    Comments on this page are now closed.

    Comments

    Picture of Kenneth Knowles
    Kenneth Knowles
    09/29/2016 8:47am EDT

    Hi Giuseppe. Yes! I just added the slides to the session. They are at https://goo.gl/hgIEsz.

    09/29/2016 7:21am EDT

    are slides available? thanks