Presented By O'Reilly and Cloudera
Make Data Work
5–7 May, 2015 • London, UK

Say goodbye to batch

Tyler Akidau (Google)
16:15–16:55 Thursday, 7/05/2015
Hadoop & Beyond
Location: Buckingham Room - Palace Suite
Average rating: ****.
(4.14, 7 ratings)
Slides:   external link

Prerequisite Knowledge

Basic familiarity with existing Big Data processing concepts/tools (Hadoop, Spark, etc. ) is necessary. Familiarity with streaming concepts/tools (Samza+Kafka, Spark Streaming, Storm, etc.) is helpful. Familiarity with the Lambda Architecture is also useful.

Description

History has shown the limitations of existing streaming systems with respect to reliability, flexibility, and ease of use. The industry has responded in turn with the Lambda Architecture, a clever confederation of batch and streaming systems that provides low-latency, eventually-correct results, while maintaining the ability to respond to changes in upstream data. Lambda proponents have long argued that it’s not possible to have all these things at once within a single streaming system. We respectfully disagree. :-)

We believe it is possible to build a streaming system you can rely on, making the Lambda Architecture unnecessary. In this talk, I’ll cover:

  • The fundamental differences between batch and streaming, and how the Lambda Architecture combines them to great effect.
  • A survey of the ways streaming systems can be used to process data, including uses currently filled by the Lambda Architecture.
  • A detailed look at the problems of correctness and changes in upstream data when relying solely on a streaming system.
  • The APIs and semantics we provide in Google Cloud Dataflow that make it tractable to solve those problems within a single streaming system, along with best-practice examples for dealing with real-world use cases.

This talk is, at the same time, both high-level and quite technical. There are varying opinions about what streaming is, and this talk attempts to give an overview of what the different existing approaches are. It then covers in detail the streaming use case that no other general streaming system has yet conquered: that of providing low-latency, correct results with the flexibility to adjust to changes in source data, all at a massive scale. We hope to provide the audience an understanding of the issues they might face in building standalone streaming pipelines, regardless of the architecture used, with an eye toward the features of Google Cloud Dataflow that make it particularly well-suited to that problem domain.

Photo of Tyler Akidau

Tyler Akidau

Google

Tyler Akidau is a staff software engineer at Google. The current tech lead for internal streaming data processing systems (e.g. MillWheel), he’s spent five years working on massive-scale streaming data processing systems. He passionately believes in streaming data processing as the more general model of large-scale computation. His preferred mode of transportation is by cargo bike, with his two young daughters in tow.