Presented By O'Reilly and Cloudera
Make Data Work
March 28–29, 2016: Training
March 29–31, 2016: Conference
San Jose, CA

Building data pipelines with Apache Kafka (Half Day)

Joseph Adler (Facebook), Ewen Cheslack-Postava (Confluent), Ian Wrigley (StreamSets)
1:30pm–5:00pm Tuesday, 03/29/2016
IoT and Real-time

Location: 210 A/E
Tags: real-time
Average rating: ***..
(3.06, 32 ratings)

Prerequisite knowledge

Participants should be familiar with databases, ETL, and other data pipelines and comfortable working in SQL, coding in Java or Python, and using a Linux VM.

Materials or downloads needed in advance

We'll provide a VM image (VirtualBox and VMWare) containing all the software and data needed for the tutorial. Attendees will need to download the VM prior to the tutorial. We will provide more details on hardware requirements prior to the session.

Description

Top companies like LinkedIn, Uber, and Airbnb have built production data pipelines with Apache Kafka. Joseph Adler, Ewen Cheslack, and Ian Wrigley guide participants through building a secure, scalable, reliable, high-speed ETL pipeline that moves data from production systems to analytical and reporting databases.

Topics include:

  • Copycat: a Kafka feature that lets you build data pipelines that continually move data between systems (or Kafka clusters)
  • Kafka Streams: a Kafka feature that lets your write scalable and efficient streaming applications
  • Kafka security: with Kafka 0.9, Kafka now includes features for enterprise-grade authentication, access control, and encryption
Photo of Joseph Adler

Joseph Adler

Facebook

Joseph Adler has many years of experience in data mining and data analysis at companies including DoubleClick, Verisign, and LinkedIn. Currently, he is director of product management and data science at Confluent. He is the holder of several patents for computer security and cryptography and the author of Baseball Hacks and R in a Nutshell. He graduated from MIT with a BSc and MEng in computer science and electrical engineering.

Photo of Ewen Cheslack-Postava

Ewen Cheslack-Postava

Confluent

Ewen Cheslack-Postava is an engineer at Confluent building a stream data platform based on Apache Kafka to help organizations reliably and robustly capture and leverage all their real-time data. Ewen received his PhD from Stanford University, where he developed Sirikata, an open source system for massive virtual environments. His dissertation defined a novel type of spatial query giving significantly improved visual fidelity and described a system for efficiently processing these queries at scale.

Photo of Ian Wrigley

Ian Wrigley

StreamSets

Ian Wrigley is a Technical Director at StreamSets, the company behind the industry’s first data operations platform. Over his 25-year career, Ian has taught tens of thousands of students subjects ranging from C programming to Hadoop development and administration.

Comments on this page are now closed.

Comments

04/07/2016 1:46am PDT

I attended this tutorial session. It was very informative. Can I get copy of the slides presented during the tutorial? Thanks

sanjeev taran
03/29/2016 7:12am PDT

Please post the url to presentation slides

Picture of Joseph Adler
Joseph Adler
03/25/2016 1:59am PDT

Some links to help out:

You can get VirtualBox here: https://www.virtualbox.org/wiki/Downloads

You will need a Java 1.7+ JDK

On a Mac, we recommend installing HomeBrew (http://brew.sh) to help install other components.

Picture of Ian Wrigley
Ian Wrigley
03/24/2016 9:09am PDT

You’ll need a machine with 4GB or more of RAM, and a few GB of free hard disk space. You’ll also need VirtualBox installed (it’s free). You should get an email in the next couple of days with a link for the VM to download, but if by any chance you don’t, we’ll have a few USB sticks with the VM on them at the event.

03/22/2016 5:27am PDT

What is the minimum configuration for a Macbook needed for this tutorial? Does any software need to be pre-installed?