Stream processing is becoming more relevant as many applications provide low-latency response time and new application domains emerge that naturally demand data to be processed in motion. One particularly attractive characteristic of the stream processing paradigm is that it conceptually unifies batch processing (bounded/static historic data) and continuous near-real-time data processing (unbounded streaming event data).
However, in practice, implementing a unified batch and streaming data architecture is not seamless: near-real-time event data and bulk historic data use different storage systems (messages queues or logs versus filesystems or object stores). Consequently, running the same analysis now and at some arbitrary time in the future (e.g., months, possibly years ahead) means dealing with different data sources and APIs. Few systems are capable of handling both near-real-time streaming workloads and large batch workloads at the same time. And streaming workloads tend to be inherently dynamic, requiring both storage and compute to adjust continuously for maximum resource efficiency.
Flavio Junqueira and Fabian Hueske detail an open source streaming data stack consisting of Pravega (stream storage) and Apache Flink (computation on streams) that offers an unprecedented way of handling “everything as a stream” that includes unbounded streaming storage and unified batch and streaming abstraction and dynamically accommodates workload variations in a novel way.
Pravega enables the ingestion capacity of a stream to grow and shrink according to workload and sends signals downstream to enable Flink to scale accordingly; it also offers a permanent streaming storage, exposing an API than enables applications to access data in either near real time or at any arbitrary time in the future in a uniform fashion. Apache Flink’s SQL and streaming APIs provide a common interface for processing continuous near-real-time data and a set of historic data, or combinations of both. A deep integration between these two systems provides end-to-end exactly once semantics for pipelines of streams and stream processing and lets both systems jointly scale and adjust automatically to changing data rates.
Fabian Hueske is a committer and PMC member of the Apache Flink project. He was one of the three original authors of the Stratosphere research system, from which Apache Flink was forked in 2014. Fabian is a cofounder of data Artisans, a Berlin-based startup devoted to fostering Flink, where he works as a software engineer and contributes to Apache Flink. He holds a PhD in computer science from TU Berlin and is currently spending a lot of his time writing a book, Stream Processing with Apache Flink.
Flavio Junqueira is senior director of software engineering at Dell EMC, where he leads the Pravega team. He is interested in various aspects of distributed systems, including distributed algorithms, concurrency, and scalability. Previously, Flavio held an engineering position with Confluent and research positions with Yahoo Research and Microsoft Research. He is an active contributor to Apache projects, including Apache ZooKeeper (as PMC and committer), Apache BookKeeper (as PMC and committer), and Apache Kafka. Flavio coauthored the O’Reilly ZooKeeper book. He holds a PhD in computer science from the University of California, San Diego.
©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org