Presented By O'Reilly and Cloudera
Make Data Work
22–23 May 2017: Training
23–25 May 2017: Tutorials & Conference
London, UK

Unified stateful big data processing in Apache Beam (incubating)

Aljoscha Krettek (Ververica)
14:0514:45 Thursday, 25 May 2017
Stream processing and analytics
Location: Capital Suite 8/9
Level: Advanced
Average rating: ***..
(3.50, 2 ratings)

Who is this presentation for?

  • Big data engineers and stream processing engineers

Prerequisite knowledge

  • Familiarity with the basic considerations for stream processing (useful but not required—Tyler Akidau’s articles "Streaming 101" and "Streaming 102" are ideal preparation.)
  • Experience with existing big data batch processing concepts and tools, such as Hadoop and Spark, and stream processing concepts and tools, such as Flink, Samza + Kafka, Spark Streaming, and Storm (useful but not required)

What you'll learn

  • Explore Beam's portable model and library for state and timers
  • Learn how to write backend-agnostic per-key workflows and other fine-grained processing, particularly for streaming applications

Description

Apache Beam lets you process unbounded, out-of-order, global-scale data with portable high-level pipelines, but not all use cases are pipelines of simple “map” and “combine” operations. Aljoscha Krettek introduces Beam’s new State API, which brings scalability and consistency to fine-grained stateful processing while interoperating with Beam’s other features such as consistent event-time windowing and windowed side inputs—all while remaining portable to any Beam runner, including Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow. Aljoscha covers the new state and timer features in Beam and shows how to use them to express common real-world use cases in a backend-agnostic manner.

Examples of new use cases unlocked by Beam’s new mutable state and timers include:

  • Microservice-like streaming applications such as new user account verification and digital ordering
  • Complex aggregations that cannot easily be expressed as an efficient associative combiner
  • Output based on customized conditions, such as limiting to only “significant” changes in a learned model (resulting in potentially large cost savings in subsequent processing)
  • Fine control over retrieval and storage of intermediate values during aggregation
  • Reading from and writing to external systems with detailed management of the nature and size of requests
Photo of Aljoscha Krettek

Aljoscha Krettek

Ververica

Aljoscha Krettek is a cofounder and software engineer at Ververica. Previously, he worked at IBM Germany and at the IBM Almaden Research Center in San Jose. Aljoscha is a PMC member at Apache Beam and Apache Flink, where he mainly works on the streaming API and designed and implemented the most recent additions to the windowing and state APIs. He studied computer science at TU Berlin.