Apache Kafka is a popular open source message broker for high-throughput real-time event data, such as user activity logs or IoT sensor data. It originated at LinkedIn, where it reliably handles around a trillion messages per day.
What is less widely known: Kafka is also well suited for extracting data from existing databases, and making it available for analysis or for building data products. Unlike slow batch-oriented ETL, Kafka can make database data available to consumers in real time, while also allowing efficient archiving to HDFS, for use in Spark, Hadoop, or data warehouses.
When data science and product teams can process operational data in real time, and combine it with user activity logs or sensor data, it is a potent mixture. Having all the data centrally available in a stream data platform is an exciting enabler for data-driven innovation.
In this talk, we will discuss what a Kafka-based stream data platform looks like, and how it is useful:
Martin Kleppmann is a researcher in distributed systems at the University of Cambridge. Previously, he cofounded and sold two startups and worked on large-scale data infrastructure at internet companies including LinkedIn. Martin is the author of Designing Data-Intensive Applications from O’Reilly.
©2015, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.