Mar 15–18, 2020

Cost-effective, real-time operational insights into production systems at Netflix

Jeff Chao (Netflix)
11:50am12:30pm Wednesday, March 18, 2020
Location: LL21B
Secondary topics:  Streaming and IoT

Who is this presentation for?

Data engineers, data architects, developers




Netflix has experienced an unprecedented global increase in membership over the last several years. Not only does it see more members globally, more members are consuming more Netflix. This means that production outages today have far greater impact in much less time than it did years before.

In order to continue providing great experiences for its members, Netflix has to make sure the sophistication of its systems outpace the growth and engagement of its members. Concretely, its mean time to detect (MTTD) and mean time to resolve (MTTR) needs to decrease much quicker than Netflix membership and consumption increases. Netflix’s approach to accomplishing this is by having access to highly granular, real-time operational insights into its streaming and studio systems.

However, having this level of visibility into its production systems is great, but it could quickly become cost prohibitive. It’s equally important that these systems don’t end up costing more than the actual streaming and studio systems. To this end, Netflix has built and open-sourced Mantis to fulfill all of these needs—a platform that makes it easy for developers to build real-time, cost-effective, operations-focused applications.

Jeff Chao shares technical details about Mantis and provides examples of how Netflix uses Mantis to operate its production systems more effectively.

Mantis has been live in production for several years and has given the company tremendous value in operating tier-1 critical systems. It processes trillions of events and petabytes worth of data every day, which enables Netflix to derive meaningful operational insights from its streaming and studio systems, ultimately reducing production impact on its members.

With Mantis, Netflix is able to economically ask and answer new questions in real time about its systems without having to add new instrumentation. The company can answer questions like “Which members are seeing playback issues for Stranger Things, season 3, episode 1 on iPhone in Canada?” without incurring heavy costs to its infrastructure bill.

Prerequisite knowledge

  • A basic understanding of approaches to observability: aggregated logging, metrics, and traces (useful but not required)
  • Familiarity with stream processing (useful but not required)

What you'll learn

  • Understand how you can reduce MTTD and MTTR in your production systems by having accessible real-time, granular operational insights without it becoming cost prohibitive
Photo of Jeff Chao

Jeff Chao


Jeff Chao is a senior software engineer at Netflix, where he works on stream processing engines and observability platforms. Jeff builds and maintains Mantis, an open source platform that makes it easy for developers to build cost-effective, real-time, operations-focused applications. Previously, he was at Heroku, offering a fully managed Apache Kafka service.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)

Contact us

For conference registration information and customer service

For more information on community discounts and trade opportunities with O’Reilly conferences

Become a sponsor

For information on exhibiting or sponsoring a conference

For media/analyst press inquires