In a streaming data processing system, where data is generally unbounded, triggers specify when each stage of computation should emit output. With a small language of primitive conditions and multiple ways of combining them, triggers provide the flexibility to tailor a streaming pipeline to a variety of use cases and data sources, enabling a practitioner to achieve an appropriate balance between accuracy, latency, and cost. (Some conditions under which one may choose to “fire”—aka trigger output—include after the system believes all data for the current window is processed, after at least 1,000 elements have arrived for processing, when the first of trigger A and trigger B fires, or according to trigger A until trigger B fires.)
To support the variety of streaming systems in existence today and yet to come, as well as the variability built into each one, a foundational semantics for triggers must be based on fundamental aspects of stream processing. Since we also aim to maintain the unified batch/streaming programming model, trigger semantics must remain consistent across a number of dimensions, including reordering and/or delay of data, small bundles of data where an operation may buffer data until a trigger fires, large bundles of data where an operation processes it all before firing the result to the next stage, arbitrarily good (or bad) approximations of event time, and retrying a computation (for example, when processing time and event time may both have progressed, and more data may have arrived, and we’d like to process it all together in large bundles for performance).
Drawing on important real-world use cases, Kenneth Knowles delves into the details of language- and runner-independent semantics for triggers in Apache Beam and explores real-world implementations in Google Cloud Dataflow.
Kenn Knowles is a founding committer of Apache Beam (incubating). Kenn has been working on Google Cloud Dataflow—Google’s Beam backend—since 2014. Prior to that, he built backends for startups such as Cityspan, Inkling, and Dimagi. Kenn holds a PhD in programming languages from the University of California, Santa Cruz.
Comments on this page are now closed.
©2016, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.