Presented By O’Reilly and Cloudera
Make Data Work
21–22 May 2018: Training
22–24 May 2018: Tutorials & Conference
London, UK

Stream scaling in Pravega

Flavio Junqueira (Dell EMC)
16:3517:15 Thursday, 24 May 2018

Who is this presentation for?

  • Data engineers, architects, and decision makers in the area of data infrastructure and data-driven applications

Prerequisite knowledge

  • A basic understanding of distributed system and messaging system concepts

What you'll learn

  • Learn how to build an elastic end-to-end stream pipeline with Pravega
  • Understand the guarantees Pravega provides in the presence of scaling and how applications need to code against it and how Pravega implements stream scaling

Description

Stream processing consists of ingesting and processing continuously generated data, often from end users in web applications or from more challenging settings where devices such as servers and sensors generate events at a high rate. Such scenarios often demand the use of a software stack that is able to scale and accommodate changes to the characteristics of the application.

One of the major challenges with processing data streams is adapting to workload variations (e.g., due to daily cycles or the growth of the population of sources). Systems to ingest stream data typically parallelize it by sharding the incoming messages and events according to a routing key. Having the ability to parallelize ingestion is very effective, but future changes to the workload (which are very often unknown beforehand) might make the initial choice for the degree of parallelism inadequate for even short-term spikes. Consequently, the ability to scale by adapting parallelism according to workload while preserving important API properties, such as per-key order, is highly desirable to handle mission-critical workloads.

Flavio Junqueira explains how to accommodate changes to workloads in and with Pravega, an open source stream store built to ingest and serve stream data. Pravega primarily manipulates and stores segments (append-only byte sequences), forming streams by creating and composing segments, which it uses to enable the scaling of streams. Stream scaling in Pravega is automatic and transparent to the application, but such a change to the ingestion volume might also require the application to follow and scale its resources downstream (e.g., the operators of an Apache Flink job) to accommodate the new ingestion volume. Pravega signals such changes to the application so that it can react accordingly. The cooperation between Pravega and the downstream application is crucial for building an effective stream data pipeline.

Photo of Flavio Junqueira

Flavio Junqueira

Dell EMC

Flavio Junqueira is senior director of software engineering at Dell EMC, where he leads the Pravega team. He is interested in various aspects of distributed systems, including distributed algorithms, concurrency, and scalability. Previously, Flavio held an engineering position with Confluent and research positions with Yahoo Research and Microsoft Research. He is an active contributor to Apache projects, including Apache ZooKeeper (as PMC and committer), Apache BookKeeper (as PMC and committer), and Apache Kafka. Flavio coauthored the O’Reilly ZooKeeper book. He holds a PhD in computer science from the University of California, San Diego.