Presented By O'Reilly and Cloudera
Make Data Work
Sept 29–Oct 1, 2015 • New York, NY

Building a real-time analytics stack with Kafka, Samza, and Druid

Fangjin Yang (Imply), Gian Merlino (Imply)
2:55pm–3:35pm Thursday, 10/01/2015
IoT & Real-time
Location: 3D 02/11 Level: Intermediate
Average rating: ***..
(3.75, 8 ratings)

The maturation and development of open source technologies has made it easier than ever for companies to derive insights from vast quantities of data. In this session, we will cover how to build a real-time analytics stack using Kafka, Samza, and Druid.

Analytics pipelines running purely on Hadoop can suffer from hours of data lag. Initial attempts to solve this problem often lead to inflexible solutions, where the queries must be known ahead of time; or fragile solutions where the integrity of the data cannot be assured. Combining Hadoop with Kafka, Samza, and Druid can guarantee system availability, maintain data integrity, and support fast and flexible queries.

In the described system, Kafka provides a fast message bus and is the delivery point for machine-generated event streams. Samza and Hadoop work together to load data into Druid. Samza handles near-real-time data, and Hadoop handles historical data and data corrections. Druid provides flexible, highly available, low-latency queries.

Photo of Fangjin Yang

Fangjin Yang

Imply

Fangjin Yang is a coauthor of the open source Druid project and a cofounder of Imply, a data analytics startup based in San Francisco. Previously, Fangjin held senior engineering positions at Metamarkets and Cisco Systems. Fangjin has a BASc in electrical engineering and an MASc in computer engineering from the University of Waterloo, Canada.

Photo of Gian Merlino

Gian Merlino

Imply

Gian Merlino is CTO and cofounder of Imply and is one of the original committers of the Druid project. Previously, he worked at Metamarkets and Yahoo. Gian holds a BS in computer science from the California Institute of Technology.