Exactly-once delivery is the holy grail of streaming analytics. Having duplicates of events processed in a streaming job is inconvenient and often undesirable depending on the nature of the application. For example, if billing applications miss an event or process an event twice, they could lose revenue or overcharge customers. Guaranteeing that such scenarios never happen is difficult; any project seeking such a property will need to make some choices with respect to availability and consistency. One main difficulty stems from the fact that a streaming pipeline might have multiple stages, and exactly-once delivery needs to happen at each stage. Another difficulty is that intermediate computations could potentially affect the final computation. And once results are exposed, retracting them causes problems.
These observations lead to the conclusion that providing a general solution is very difficult, if not impossible. For some classes of applications, however, it is possible to obtain exactly-once behavior. If senders are willing to retry an unbounded number of times and the delivery is idempotent, then we can create a behavior equivalent to exactly-once semantics. One way to achieve the goal of making delivery idempotent is to rely on a distributed consensus primitive to have agreement on what has been delivered. For example, if the output has a key and a value, we can make sure that the key has been produced already. Assuming we rely on persistent queues to propagate events, we can use the queue position to also disambiguate updates using this consensus primitive.
Flavio Junqueira explores what’s reasonable for streaming analytics to achieve when targeting exactly-once semantics, using examples based on systems that provide commit-log abstractions (like Apache Kafka and Apache BookKeeper) to demonstrate when it is possible to guarantee strong delivery semantics and how. Flavio shows that for a simple, but important, class of applications, it’s possible to provide strong delivery guarantees along with efficient implementations.
Flavio Junqueira is senior director of software engineering at Dell EMC, where he leads the Pravega team. He is interested in various aspects of distributed systems, including distributed algorithms, concurrency, and scalability. Previously, Flavio held an engineering position with Confluent and research positions with Yahoo Research and Microsoft Research. He is an active contributor to Apache projects, including Apache ZooKeeper (as PMC and committer), Apache BookKeeper (as PMC and committer), and Apache Kafka. Flavio coauthored the O’Reilly ZooKeeper book. He holds a PhD in computer science from the University of California, San Diego.
©2016, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.