Presented By O'Reilly and Cloudera
Make Data Work
22–23 May 2017: Training
23–25 May 2017: Tutorials & Conference
London, UK

Speeding up Twitter Heron streaming by 5x

Sanjeev Kulkarni (Streamlio), Maosong Fu (Twitter)
17:2518:05 Wednesday, 24 May 2017
Stream processing and analytics
Location: Capital Suite 8/9
Level: Beginner

Who is this presentation for?

  • Data engineers, data scientists, and technology leaders

Prerequisite knowledge

  • A basic understanding of streaming analytics (useful but not required)

What you'll learn

  • Explore optimization opportunities for streaming infrastructure and learn how they are implemented in Heron
  • Understand how to use these optimizations to achieve low latency and high throughput and obtain a balance of both

Description

Twitter is all about real-time data at scale. Twitter’s data centers continuously process billions of events per day the instant the data is generated. To achieve real-time performance, Twitter has developed and deployed Heron, a next-generation cloud streaming engine. Heron provides unparalleled performance at large scale and has been successfully meeting Twitter’s strict performance requirements for various streaming applications.

Heron, in production at Twitter for more than two and a half years, has proven to be scalable and reliable and is now an open source project with contributors from various institutions. However, until now, Heron has not been optimized from a performance perspective. (The performance numbers reported in the Twitter Heron paper, published in SIGMOD 2015, were without any optimizations.)

Sanjeev Kulkarni and Maosong Fu share several optimizations implemented in Heron to improve throughput by 5x and reduce latency by 50–60%, describing in detail how they identified optimization opportunities with detailed profiling that indicated several issues, including multiple serializations/deserialization, eager serialization/deserialization, and immutable design. Based on these observations, Sanjeev and Maosong came up with several techniques to mitigate these costs. Along the way, Sanjeev and Maosong show how certain parameters, such as max spout pending and cache drain frequency, affect throughput and latency, and how a careful choice of these parameters can achieve latencies as low as 12 ms.

Photo of Sanjeev Kulkarni

Sanjeev Kulkarni

Streamlio

Sanjeev Kulkarni is the cofounder of Streamlio, a company focused on building a next-generation real-time stack. Previously, he was the technical lead for real-time analytics at Twitter, where he cocreated Twitter Heron; worked at Locomatix handling the company’s engineering stack; and led several initiatives for the AdSense team at Google. Sanjeev holds an MS in computer science from the University of Wisconsin-Madison.

Photo of Maosong Fu

Maosong Fu

Twitter

Maosong Fu is the technical lead for ​Heron and ​real-time analytics at Twitter and the author of ​few publications in the distributed area​. Maosong holds a master’s degree from Carnegie Mellon University and bachelor’s from Huazhong University of Science and Technology.