Many industry segments have been grappling with fast data (high-volume, high-velocity data). The enterprises in these industry segments need to process this fast data just in time to derive insights and act upon it quickly. Such tasks include enriching data with additional contextual information, filtering and reducing noise in the data, leveraging machine learning and deep learning models to provide continuous insights on business operations, and sharing these insights with customers. In order to realize these results, an enterprise needs to build an end-to-end data processing system, from data acquisition, data ingestion, data processing, and model building to serving and sharing the results. This presents a significant challenge due to the presence of multiple messaging frameworks and several streaming computing frameworks and storage frameworks for real-time data.
Arun Kejariwal and Karthik Ramasamy walk you through the state-of-the-art systems for each stage of an end-to-end data processing pipeline—messaging, compute, and storage—for real-time data and algorithms to extract insights (e.g., heavy hitters and quantiles) from data streams. More importantly, they present a framework to guide decision making with regard to the selection of a framework for the different stages of an end-to-end data processing pipeline. Along the way, they share concrete case studies from the IoT, gaming, and healthcare industries as well as their experience operating these systems at internet scale at Twitter and Yahoo.
Layering an intelligence layer on top of an end-to-end data processing pipeline is paramount from a business perspective. To this end, Arun and Karthik outline the spectrum of different classes of data sketches for different use cases and highlight the trade-offs, such as approximate versus exact, speed versus accuracy, and generalizability versus interpretability.
They conclude by offering perspectives on how advances in hardware technology and the emergence of new applications will impact the evolution of messaging systems, streaming compute systems, and storage systems for streaming the data of tomorrow that will power fast processing and analysis of a large (potentially of the order of hundreds of millions) set of data streams.
Arun Kejariwal is an independent lead engineer. Previously, he was he was a statistical learning principal at Machine Zone (MZ), where he led a team of top-tier researchers and worked on research and development of novel techniques for install-and-click fraud detection and assessing the efficacy of TV campaigns and optimization of marketing campaigns, and his team built novel methods for bot detection, intrusion detection, and real-time anomaly detection; and he developed and open-sourced techniques for anomaly detection and breakout detection at Twitter. His research includes the development of practical and statistically rigorous techniques and methodologies to deliver high performance, availability, and scalability in large-scale distributed clusters. Some of the techniques he helped develop have been presented at international conferences and published in peer-reviewed journals.
Karthik Ramasamy is the cofounder of Streamlio, a company building next-generation real-time processing engines. Karthik has more than two decades of experience working in parallel databases, big data infrastructure, and networking. Previously, he was engineering manager and technical lead for real-time analytics at Twitter, where he was the cocreator of Heron; cofounded Locomatix, a company that specialized in real-time stream processing on Hadoop and Cassandra using SQL (acquired by Twitter); worked briefly on parallel query scheduling at Greenplum (acquired by EMC for more than $300M); and designed and delivered platforms, protocols, databases, and high-availability solutions for network routers at Juniper. He’s the author of several patents, publications, and one best-selling book, Network Routing: Algorithms, Protocols, and Architectures. Karthik holds a PhD in computer science from the University of Wisconsin–Madison with a focus on databases, where he worked extensively in parallel database systems, query processing, scale-out technologies, storage engines, and online analytical systems. Several of these research projects were spun out as a company later acquired by Teradata.
©2019, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org