Twitter is all about real time at scale. Twitter’s data centers continuously process billions of events per day the instant the data is generated. To achieve real-time performance, Twitter has developed and deployed Heron, a next-generation cloud streaming engine that provides unparalleled performance at large-scale. Heron has been successfully meeting Twitter’s strict performance requirements for various streaming applications and is now an open source project with contributors from various institutions.
Bill Graham, Avrilia Floratau, and Ashvin Agrawal explore how Twitter and Microsoft have collaborated to transform Heron into a truly elastic system. The amount of data that needs to be processed in Twitter’s data centers changes significantly due to expected and unexpected global events. For example, during the Super Bowl, there are spikes of tweets that all need to be processed in real time. Similarly, unexpected events such as natural disasters can generate very large volumes of data. Twitter’s infrastructure must be robust enough to deliver real-time performance when such events take place. Heron solves this problem by supporting dynamic scaling of Heron applications, achieved by increasing or decreasing the number of Heron instances that form a topology.
Providing truly elastic scaling in the context of Heron has several challenges. First, during scaling, the processing of events should not be disrupted. Heron must be able to handle running topologies while causing minimal disruption during scaling. Second, Heron has a modular and extensible architecture that allows it to integrate with various schedulers and incorporate various resource management policies. As a result, Heron’s scaling framework must be generic enough to operate efficiently on top of diverse components. Finally, Heron must be able to efficiently allocate resources during scaling so that it makes optimal use of available cluster resources. Bill, Avrilia, and Ashvin explain how Heron addresses the challenges discussed above to gracefully handle highly varying loads without sacrificing real-time performance and conclude by sharing future plans for scaling that current and future Heron contributors could tackle.
Bill Graham is a staff engineer on the Real Time Compute team at Twitter. Bill’s primary areas of focus are data processing applications and analytics infrastructure. Previously, he was a principal engineer at CBS Interactive and CNET Networks, where he worked on ad targeting and content publishing infrastructure, and a senior engineer at Logitech focusing on webcam streaming and messaging applications. Bill contributes to a number of open source projects, including HBase, Hive, and Presto, and he’s a Heron and Pig committer.
Avrilia Floratau is a senior scientist at Microsoft’s Cloud and Information Services Lab, where her research is focused on scalable real-time stream processing systems. She is also an active contributor to Heron, collaborating with Twitter. Previously, Avrilia was a research scientist at IBM Research working on SQL-on-Hadoop systems. She holds a PhD in data management from the University of Wisconsin-Madison.
Ashvin Agrawal is a senior research engineer at Microsoft, where he works on streaming systems and contributes to the Twitter Heron project. Ashvin is a software engineer with more than 10+ years experience. He specializes in developing large-scale distributed systems. Previously, he worked at VMware, Yahoo, and Mojo Networks. Ashvin holds an MTech in computer science from IIT Kanpur, India.
©2017, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.