Presented By
O’Reilly + Cloudera
Make Data Work
29 April–2 May 2019
London, UK

Model serving via Pulsar functions

Arun Kejariwal (Independent), Karthik Ramasamy (Streamlio)
14:5515:35 Wednesday, 1 May 2019
Average rating: ***..
(3.00, 1 rating)

Level

Intermediate

What you'll learn

  • Learn how to approach model serving in production

Description

Machine learning (ML) and AI is invading every enterprise. While a lot of progress toward model training has been made over the years, model serving has not received much attention. Today, model serving is carried out in an ad hoc fashion, where the model is often encoded in the application code. When a new model is available, the application must be changed, debugged, and redeployed. This increases the iteration time of getting new models into production and getting feedback to improve the model. Furthermore, the problem is compounded by the fact that the responsibility of model training and model serving are often with different teams.

Apache Pulsar provides native support for serverless functions where the data is processed as soon as it arrives in a streaming fashion and that provides flexible deployment options (thread, process, container). Arun Kejariwal and Karthik Ramasamy walk you through an architecture in which models are served in real time and the models are updated, using Apache Pulsar, without restarting the application at hand. You’ll learn how these functions make data engineering easier, especially for common tasks in data transformation, data extraction, content routing, and content filtering. Arun and Karthik then describe how to apply Pulsar functions to support two example use—sampling and filtering—and explore a concrete case study of the same.

As serverless applications grow more complex, function composition, or the ability for functions to call each other, becomes increasingly important. The event processing guarantees also become interesting and challenging in function composition and workflow. These are typically known as at-least-once, at-most-once, and exactly once guarantees in the data processing space. Arun and Karthik conclude by addressing the challenge of how to provide a fault-tolerant way for composing serverless functions.

Photo of Arun Kejariwal

Arun Kejariwal

Independent

Arun Kejariwal is an independent lead engineer. Previously, he was he was a statistical learning principal at Machine Zone (MZ), where he led a team of top-tier researchers and worked on research and development of novel techniques for install-and-click fraud detection and assessing the efficacy of TV campaigns and optimization of marketing campaigns, and his team built novel methods for bot detection, intrusion detection, and real-time anomaly detection; and he developed and open-sourced techniques for anomaly detection and breakout detection at Twitter. His research includes the development of practical and statistically rigorous techniques and methodologies to deliver high performance, availability, and scalability in large-scale distributed clusters. Some of the techniques he helped develop have been presented at international conferences and published in peer-reviewed journals.

Photo of Karthik Ramasamy

Karthik Ramasamy

Streamlio

Karthik Ramasamy is the cofounder of Streamlio, a company building next-generation real-time processing engines. Karthik has more than two decades of experience working in parallel databases, big data infrastructure, and networking. Previously, he was engineering manager and technical lead for real-time analytics at Twitter, where he was the cocreator of Heron; cofounded Locomatix, a company that specialized in real-time stream processing on Hadoop and Cassandra using SQL (acquired by Twitter); worked briefly on parallel query scheduling at Greenplum (acquired by EMC for more than $300M); and designed and delivered platforms, protocols, databases, and high-availability solutions for network routers at Juniper. He’s the author of several patents, publications, and one best-selling book, Network Routing: Algorithms, Protocols, and Architectures. Karthik holds a PhD in computer science from the University of Wisconsin–Madison with a focus on databases, where he worked extensively in parallel database systems, query processing, scale-out technologies, storage engines, and online analytical systems. Several of these research projects were spun out as a company later acquired by Teradata.