For a long time, a substantial portion of the data processing that companies did ran as big batch jobs—CSV files dumped out of databases, log files collected at the end of the day, etc. But businesses operate in real time, and the software they run is catching up. Rather than processing data only at the end of the day, why not react to it continuously as the data arrives? This is the emerging world of stream processing.
But stream processing only becomes possible when the fundamental data capture is done in a streaming fashion; after all, you can’t process a daily batch of CSV dumps as a stream. This shift toward stream processing has driven the popularity of Apache Kafka. Making all an organization’s data available centrally as free-flowing streams enables business logic to be represented as stream processing operations. Essentially, applications are stream processors in this new world of stream processing.
Neha Narkhede explains how Apache Kafka serves as a foundation to streaming data applications that consume and process real-time data streams and introduces Kafka Connect, a system for capturing continuous data streams, and Kafka Streams, a lightweight stream processing library. Neha also describes the lessons companies like LinkedIn learned building massive streaming data architectures.
Neha Narkhede is the cofounder and head of engineering at Confluent, a company backing the popular Apache Kafka messaging system. Prior to founding Confluent, Neha led streams infrastructure at LinkedIn, where she was responsible for LinkedIn’s petabyte-scale streaming infrastructure built on top of Apache Kafka and Apache Samza. Neha specializes in building and scaling large distributed systems and is one of the initial authors of Apache Kafka. A distributed systems engineer by training, Neha works with data scientists, analysts, and business professionals to move the needle on results.
©2016, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.