Streaming engines like Apache Flink are redefining ETL and data processing. Data can be extracted, transformed, filtered, and written out in real time with an ease matching that of batch processing. However, the real challenge of matching the prowess of batch ETL remains in doing joins, maintaining state, and dynamically pausing or resting the data.
At Netflix, microservices serve and record many different kinds of user interactions with the product. Some of these live services generate millions of events per second, all carrying meaningful but often partial information. Things start to get exciting when the company wants to combine the events coming from one high-traffic microservice to another. Joining these raw events generates rich datasets that are used to train the machine learning models that serve Netflix recommendations.
Historically, Netflix has done this joining of large volume datasets in batch. Recently, the company asked, If the data is being generated in real time, why can’t it be processed downstream in real time? Why wait a full day to get information from an event that was generated a few minutes ago?
Sonali Sharma and Shriya Arora explain how they solved a complex join of two high-volume event streams at Netflix using Flink. You’ll learn about exploring keyed state for maintaining large state, fault tolerance of a stateful application, and strategies for failure recovery.
Sonali Sharma a data engineer on the data personalization team at Netflix, which, among other things, delivers recommendations made for each user. The team is responsible for the data that goes into training and scoring of the various machine learning models that power the Netflix home page. They have been working on moving some of the company’s core datasets from being processed in a once-a-day daily batch ETL to being processed in near real time using Apache Flink. A UC Berkeley graduate, Sonali has worked on a variety of problems involving big data. Previously, she worked on the mail monetization and data insights engineering team at Yahoo, where she focused on building great data-driven products to do large-scale unstructured data extractions, recommendation systems, and audience insights for targeting using technologies like Spark, the Hadoop ecosystem (Pig, Hive, MapReduce), Solr, Druid, and Elasticsearch.
Shirya Arora works on the data engineering team for personalization at Netflix, which, among other things, delivers recommendations made for each user. The team is responsible for the data that goes into training and scoring of the various machine learning models that power the Netflix home page. They have been working on moving some of the company’s core datasets from being processed in a once-a-day daily batch ETL to being processed in near real time using Apache Flink. Previously, she helped build and architect the new generation of item setup at Walmart Labs, moving from batch processing to stream. They used Storm and Kafka to enable a microservices architecture that allows products to be updated near real time as opposed to once a day on the legacy framework.
©2019, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org