Presented By O'Reilly and Cloudera
Make Data Work
31 May–1 June 2016: Training
1 June–3 June 2016: Conference
London, UK

Watermarks: Time and progress in streaming dataflow and beyond

Slava Chernyak (Google)
16:35–17:15 Thursday, 2/06/2016
IoT & real-time
Location: Capital Suite 14 Level: Intermediate
Tags: real-time, iot
Average rating: ****.
(4.44, 9 ratings)

Prerequisite knowledge

Attendees must be familiar with existing big data streaming processing concepts and tools (e.g., Samza plus Kafka, Spark Streaming, Storm, etc.). Familiarity with MillWheel or Google Cloud Dataflow will be useful but is not required.

Description

Moving from batch to streaming involves changing how we think about time. Streaming data is neither bounded nor typically well ordered in time. However, to make streaming systems useful and deliver on the promise of low-latency results, we often want to know when we have all the data relevant to emitting a correct aggregation. Watermarks provide the foundation for making such decisions, enabling streaming systems to emit timely, correct results when processing out-of-order data.

Given the trend toward out-of-order processing in existing streaming systems, understanding watermarks is an increasingly important skill when designing pipelines. This methodology was first discussed in the MillWheel paper and further explored in the Dataflow Model paper, but this approach is not limited to Google’s stream processing efforts. Rather, it is a general problem that must be addressed by any system that wishes to provide timely out-of-order distributed stream processing; solutions have since been pursued by others, including Flink and Qubit (which built a watermark tracking system on top of Spark Streaming for their own internal use).

Drawing on his experience developing and using watermarks at Google, Slava Chernyak discusses the details of how watermarks are applied, explains what their strengths and limitations are, and explores real-world use cases. Slava also hints at some of the implementation challenges for computing watermarks with low latency in a highly distributed system. This should provide a practical set of tools for understanding watermarks and time in out-of-order stream processing pipelines.

Photo of Slava Chernyak

Slava Chernyak

Google

Slava Chernyak is a senior software engineer at Google. Slava spent over five years working on Google’s internal massive-scale streaming data processing systems and has since become involved with designing and building Google Cloud Dataflow Streaming from the ground up. Slava is passionate about making massive-scale stream processing available and useful to a broader audience. When he is not working on streaming systems, Slava is out enjoying the natural beauty of the Pacific Northwest.