Presented By O'Reilly and Cloudera
Make Data Work
September 26–27, 2016: Training
September 27–29, 2016: Tutorials & Conference
New York, NY

Watermarks: Time and progress in Apache Beam (incubating) and beyond

Slava Chernyak (Google)
2:05pm–2:45pm Wednesday, 09/28/2016
IoT & real-time
Location: 1 E 12/1 E 13 Level: Intermediate
Average rating: ****.
(4.60, 10 ratings)

Prerequisite knowledge

  • Familiarity with existing big data stream processing concepts and tools (Samza + Kafka, Spark Streaming, Storm, etc.)
  • Experience with MillWheel, Google Cloud Dataflow, or Apache Beam (This will not be a general overview talk; ideally attend the Beam or Dataflow overview talk first.)
  • What you'll learn

  • Understand the challenges in delivering on the promise of correct low-latency results in a streaming system
  • Explore a practical set of tools for understanding watermarks and time in out-of-order stream processing pipelines
  • Description

    Moving from batch to streaming involves changing how we think about time. Streaming data is neither bounded nor typically well ordered in time. However, to make streaming systems useful and deliver on the promise of low-latency results, we often want to know when we have all the data relevant to emitting a correct aggregation. Watermarks provide the foundation for making such decisions, enabling streaming systems to emit timely, correct results when processing out-of-order data.

    Given the trend toward out-of-order processing in existing streaming systems, understanding watermarks is an increasingly important skill when designing pipelines. This methodology, first discussed in the MillWheel paper and further explored in the Dataflow model paper, is now referred to as the Beam model. This approach is not limited to just Google’s stream processing efforts; rather, it is a solution to a general problem that must be addressed by any system that wishes to provide timely out-of-order distributed stream processing and has since been pursued by others such as Flink and Qubit (which built a watermark tracking system on top of Spark Streaming for its own internal use).

    Based on his experience developing and using watermarks at Google, Slava Chernyak discusses details of how watermarks are applied, as well as their strengths and limitations, and explores real-world use cases, providing a practical set of tools for understanding watermarks and time in out-of-order stream processing pipelines. Along the way, Slava also outlines some of the implementation challenges for computing watermarks with low latency in a highly distributed system.

    Photo of Slava Chernyak

    Slava Chernyak


    Slava Chernyak is a senior software engineer at Google. Slava spent over five years working on Google’s internal massive-scale streaming data processing systems and has since become involved with designing and building Google Cloud Dataflow Streaming from the ground up. Slava is passionate about making massive-scale stream processing available and useful to a broader audience. When he is not working on streaming systems, Slava is out enjoying the natural beauty of the Pacific Northwest.

    Comments on this page are now closed.


    Picture of André Morrow
    André Morrow
    10/04/2016 1:28pm EDT

    All Strata + Hadoop World 2016 slide presentations have now been posted if they were made available to us.

    09/29/2016 4:10pm EDT

    the slides are here:

    09/29/2016 7:19am EDT

    are slides available? thanks

    Ben Hsu
    09/29/2016 7:12am EDT

    I learned a lot. Please let us know when the slides are available