In the past, typical real-time data processing was reserved for answering operational questions and very basic analytical questions, but with better processing frameworks and more-capable hardware, the streaming context can now enable personalization applications. However, The decision to switch from processing data in an offline/batch ETL application to an online/streaming ETL application requires that you consider, at the very least, the architecture of your company, the scope and requirements of your business problem, the cost and benefits of competing solutions, and the technical composition of your team.
Christopher Colburn and Monal Daxini explore the challenges faced when building a streaming application at scale at Netflix. Christopher and Monal share their experience with stream processing unbounded datasets in the personalization space. These datasets are both massive and product facing—they directly affect the customer’s personalized experience—which means that the impact is high and tolerance for failure is low. Christopher and Monal outline the experiments they did to compare Spark and Flink, the impact that their work had on Netflix’s customers, and, most importantly, the places they failed.
More specifically, Christopher and Monal discuss two event-based, unbounded datasets they created at Netflix. The first contains the set of playback events that are used as feedback for all personalization algorithms. These are plays that exhibit specific, interesting behaviors that are particularly predictive of a customer’s “enjoyment” of their service. To build this stream, they made a variety of REST calls to other online services, which introduce interesting complications in the Netflix environment. The second stream joins this enriched playback dataset with data logged from a variety of other backend servers. Ultimately, it’s this second stream that is consumed by the algorithms at Netflix to improve personalization.
Christopher Colburn is just another data scientist at Netflix.
Monal Daxini is an engineering manager at Netflix, where he is building a scalable and multitenant event processing pipeline and leads the infrastructure for stream processing as a service. He has worked on Netflix’s Cassandra and Dynamite infrastructure and was instrumental in developing the encoding compute infrastructure for all Netflix content. Monal has 15 years of experience building distributed systems at organizations like Netflix, NFL.com, and Cisco.
Comments on this page are now closed.
©2017, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.