The world is going real-time. MapReduce, SQL-on-Hadoop and similar batch processing tools are fine for analyzing and processing data after the fact — but sometimes you need to process data continuously as it comes in, and react to it within a few seconds or less. How do you do that at Hadoop scale?
Apache Samza is an open source stream processing framework designed to solve these kinds of problems. It is built upon YARN/Hadoop 2.0 and Apache Kafka. You can think of Samza as a real-time, continuously running version of MapReduce.
Samza has some unique features that make it powerful. It provides high performance for stateful processing jobs, including aggregation and joins between many input streams. It is designed to support an ecosystem of many different jobs written by different teams, and it isolates them from each other, so that one badly behaved job can’t affect the others.
At LinkedIn, we have been using Samza in production for some time, both for internal analytics purposes and for data products that are served on the live site. In this talk, we’ll discuss our experience of working with Samza. You’ll learn about:
Martin is committer on Apache Samza (a distributed stream processing framework), software engineer at LinkedIn, and author at O’Reilly (currently writing a book on designing data-intensive applications). Previously he co-founded and sold two startups, Rapportive and Go Test It. His technical blog is at martin.kleppmann.com, and he’s @martinkl on Twitter.