LinkedIn ingests more than a trillion messages per day into Apache Kafka. In addition, billions of update/change events per day are captured for further processing. LinkedIn uses Apache Samza for processing this deluge of events. As you can imagine, keeping in control of the hardware cost of ingesting all of this data in Kafka and processing it in real time is of the utmost importance to LinkedIn.
As always, it comes down to how efficiently resources are used. Kafka has always excelled at optimizing network usage by compressing data at source and pushing the network to its limit. When it comes to disks and CPUs, it is not that simple. How much data you need to store and can pack into every disk will typically decide how big your cluster is going to be. Depending on your hardware specifications, CPU utilization can also be a consideration for your Kafka clusters.
Kartik Paramasivam discusses some of the key improvements to Apache Kafka that are critical in keeping costs in control, drawing on his experience running hundreds of stream processing applications at LinkedIn. When it comes to processing millions of events per second, how efficiently your stream processor accesses state and data heavily influences the amount of resources you need to run your application. Kartik shares performance data showing how accessing state that is local to (embedded in) your stream processor can have a huge benefit over the more common pattern of accessing state directly from databases. (However, although local state is fantastic for performance, it is very hard to make it reliable in a 24/7 production environment.) Kartik also explores the challenges LinkedIn has faced and outlines how the company uses Samza in conjunction with its change capture systems and Kafka to achieve top performance without compromising on reliability and stability—and without breaking the bank.
Kartik Paramasivam is a senior software engineering leader at LinkedIn. Kartik specializes in cloud computing, distributed systems, enterprise and cloud messaging, stream processing, the internet of things, web services, middleware platforms, application hosting, and enterprise application integration (EAI). He has authored a number of patents. Kartik holds a bachelor of engineering from the Maharaja Sayajirao University of Baroda and an MS in computer science from Clemson University.
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
©2017, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.