Presented By O'Reilly and Cloudera
Make Data Work
March 13–14, 2017: Training
March 14–16, 2017: Tutorials & Conference
San Jose, CA

Processing millions of events per second without breaking the bank

Kartik Paramasivam (LinkedIn)
2:40pm3:20pm Wednesday, March 15, 2017
Data engineering and architecture, Real-time applications
Location: LL20 A Level: Intermediate
Secondary topics:  Architecture, Media, Streaming
Average rating: *****
(5.00, 2 ratings)

Who is this presentation for?

  • Software engineers, architects, and CTOs

Prerequisite knowledge

  • Basic knowledge of messaging systems (Kafka, Kinesis, etc.) and data- and stream processing systems
  • General experience with big data applications

What you'll learn

  • Understand best practices for running large-scale stream processing applications and the trade-offs between various architectural patterns
  • Gain insights into running Kafka at scale

Description

LinkedIn ingests more than a trillion messages per day into Apache Kafka. In addition, billions of update/change events per day are captured for further processing. LinkedIn uses Apache Samza for processing this deluge of events. As you can imagine, keeping in control of the hardware cost of ingesting all of this data in Kafka and processing it in real time is of the utmost importance to LinkedIn.

As always, it comes down to how efficiently resources are used. Kafka has always excelled at optimizing network usage by compressing data at source and pushing the network to its limit. When it comes to disks and CPUs, it is not that simple. How much data you need to store and can pack into every disk will typically decide how big your cluster is going to be. Depending on your hardware specifications, CPU utilization can also be a consideration for your Kafka clusters.

Kartik Paramasivam discusses some of the key improvements to Apache Kafka that are critical in keeping costs in control, drawing on his experience running hundreds of stream processing applications at LinkedIn. When it comes to processing millions of events per second, how efficiently your stream processor accesses state and data heavily influences the amount of resources you need to run your application. Kartik shares performance data showing how accessing state that is local to (embedded in) your stream processor can have a huge benefit over the more common pattern of accessing state directly from databases. (However, although local state is fantastic for performance, it is very hard to make it reliable in a 24/7 production environment.) Kartik also explores the challenges LinkedIn has faced and outlines how the company uses Samza in conjunction with its change capture systems and Kafka to achieve top performance without compromising on reliability and stability—and without breaking the bank.

Photo of Kartik Paramasivam

Kartik Paramasivam

LinkedIn

Kartik Paramasivam is a senior software engineering leader at LinkedIn. Kartik specializes in cloud computing, distributed systems, enterprise and cloud messaging, stream processing, the internet of things, web services, middleware platforms, application hosting, and enterprise application integration (EAI). He has authored a number of patents. Kartik holds a bachelor of engineering from the Maharaja Sayajirao University of Baroda and an MS in computer science from Clemson University.

Comments on this page are now closed.

Comments

Aditya Verma | SENIOR SOLUTIONS DELIVERY MANAGER
03/28/2017 12:37am PDT

Hello Kartik,

Thanks for the great session. Would you be able to share the material used for presentation?

Senthilnathan Doraiswamy | SR APPS PROGRAMMER
03/27/2017 4:37am PDT

Is it possible for you to upload the slide decks from your session