Presented By O'Reilly and Cloudera
Make Data Work
March 13–14, 2017: Training
March 14–16, 2017: Tutorials & Conference
San Jose, CA

Going real time: Creating online datasets for personalization

Christopher Colburn (Netflix), Monal Daxini (Netflix)
11:50am12:30pm Wednesday, March 15, 2017
Data engineering and architecture
Location: LL20 A Level: Intermediate
Secondary topics:  Architecture, Data Platform, Media
Average rating: ****.
(4.00, 3 ratings)

Who is this presentation for?

  • Data engineers, data scientists, and engineering managers

Prerequisite knowledge

  • Basic knowledge of ETL and online/offline architecture

What you'll learn

  • Explore an example of stream processing large personalization datasets at scale at Netflix
  • Understand how to make the switch from batch to stream processing and the costs and requirements for making the transition successfully
  • Gain exposure to some of the technical challenges you should expect along the way


In the past, typical real-time data processing was reserved for answering operational questions and very basic analytical questions, but with better processing frameworks and more-capable hardware, the streaming context can now enable personalization applications. However, The decision to switch from processing data in an offline/batch ETL application to an online/streaming ETL application requires that you consider, at the very least, the architecture of your company, the scope and requirements of your business problem, the cost and benefits of competing solutions, and the technical composition of your team.

Christopher Colburn and Monal Daxini explore the challenges faced when building a streaming application at scale at Netflix. Christopher and Monal share their experience with stream processing unbounded datasets in the personalization space. These datasets are both massive and product facing—they directly affect the customer’s personalized experience—which means that the impact is high and tolerance for failure is low. Christopher and Monal outline the experiments they did to compare Spark and Flink, the impact that their work had on Netflix’s customers, and, most importantly, the places they failed.

More specifically, Christopher and Monal discuss two event-based, unbounded datasets they created at Netflix. The first contains the set of playback events that are used as feedback for all personalization algorithms. These are plays that exhibit specific, interesting behaviors that are particularly predictive of a customer’s “enjoyment” of their service. To build this stream, they made a variety of REST calls to other online services, which introduce interesting complications in the Netflix environment. The second stream joins this enriched playback dataset with data logged from a variety of other backend servers. Ultimately, it’s this second stream that is consumed by the algorithms at Netflix to improve personalization.

Photo of Christopher Colburn

Christopher Colburn


Christopher Colburn is just another data scientist at Netflix.

Photo of Monal Daxini

Monal Daxini


Monal Daxini is an engineering manager at Netflix, where he is building a scalable and multitenant event processing pipeline and leads the infrastructure for stream processing as a service. He has worked on Netflix’s Cassandra and Dynamite infrastructure and was instrumental in developing the encoding compute infrastructure for all Netflix content. Monal has 15 years of experience building distributed systems at organizations like Netflix,, and Cisco.

Comments on this page are now closed.


03/28/2017 12:46am PDT

Hello Chris,

Thanks for the great session. Would you be able to share the material used for presentation?