Presented By O'Reilly and Cloudera
Make Data Work
5–7 May, 2015 • London, UK

Systems that enable data agility: Lessons from LinkedIn

Martin Kleppmann (University of Cambridge)
13:45–14:25 Wednesday, 6/05/2015
Hadoop & Beyond
Location: Buckingham Room - Palace Suite
Average rating: ****.
(4.71, 14 ratings)
Slides:   1-PDF 

Prerequisite Knowledge

A general understanding of large-scale data processing, such as MapReduce, Pig, or similar tools, is useful. Prior experience of stream processing frameworks is not needed.

Description

Congratulations, you’ve got a lot of data! Now what? How do you enable your organisation to create value from that data? What tools do your data scientists need in order to create data-driven products? How do you empower your teams to experiment, to innovate, and to be agile in response to customer needs?

In this session we will discuss LinkedIn’s approach to solving these problems, and the open source tools that were created at LinkedIn to support data agility in a large organisation. The approach boils down to a few simple ideas:

  1. Make all data available centrally, in real time. If it’s too difficult to access data across silos, you can’t derive value from it. For this purpose, LinkedIn created Apache Kafka, which can be the data exchange backbone of your organisation.
  2. Make it easy to analyse and process that data. You’ve hired smart people, now empower them to easily try out new ideas for data-driven products, and rapidly get them into production if they are good. To support this, LinkedIn created Apache Samza, a stream processing framework that provides powerful, reliable tools for working with data in Kafka.

Since Kafka and Samza are open source, you can apply these lessons and start implementing your own agile data pipeline today. In this talk you’ll learn about:

  • How Kafka and Samza reliably scale to millions of messages per second
  • What kinds of real-time data problems you can solve with Samza
  • How Samza compares to other stream processing frameworks
  • How data streams support collaboration between different data science, product, and engineering teams within an organisation
  • Lessons learned on how to move fast without breaking things
Photo of Martin Kleppmann

Martin Kleppmann

University of Cambridge

Martin is a software engineer and entrepreneur, specialising in the data infrastructure of Internet companies. His last startup, Rapportive, was acquired by LinkedIn in 2012. He is a committer for Apache Samza and Apache Avro, and author of the O’Reilly book Designing Data-Intensive Applications. His technical blog is at martin.kleppmann.com.