Mar 15–18, 2020

Building a self-service platform for continuous, real-time feature generation.

Sherin Thomas (Lyft)
12:00pm12:30pm Monday, March 16, 2020
Location: LL20A

Who is this presentation for?

Data engineers, data architects, developers




At Lyft, all our systems, including client applications generate many millions of events per second. These events are ingested by the event ingestion pipeline and streamed through Kinesis and Kafka and also available in persistent stores such as Hive for offline consumption.

This data can be used to generate features for ML models as well as for any other form of real time decision making. Our Research Scientists and Data Scientists come up with algorithms to get features from data. However, the challenge lies in doing this quickly, correctly, effectively and reliably and at scale. For this we have built a self service platform using Flink, Beam and Kubernetes that can be used to write, prototype and deploy stateful computations on high throughput streaming data.

With this platform we have tried to abstract out the challenges of dealing with provisioning, data discovery, bootstrapping, skew, late arriving and unordered events, downtime etc, so that our experts can focus on what they do best without having to worry about managing and scaling a distributed system.

Computations can be expressed in terms of SQL and Python and prototyped in an interactive interface, making it easy for even someone with no programming background to hit the ground running on Day 1.

In this talk I will be covering the challenges of building such a system, common pitfalls, lessons learned as well as wins!

Prerequisite knowledge

Rudimentary knowledge of Machine Learning and its applications

What you'll learn

1. Stream Processing 2. Building scaleable solutions for doing stateful computations on realtime as well as offline data 3. Dealing with entropy in data systems such as skew, unordered events etc.
Photo of Sherin Thomas

Sherin Thomas


Sherin is a Software Engineer at Lyft. In her career spanning 8 years, she has worked on most parts of the tech stack, but enjoys the challenges in Data Science and Machine Learning the most. Most recently she has been focussed on building products that would facilitate advances in Artificial Intelligence and Machine Learning through Streaming.

She is passionate about getting more people, especially women, interested in this field and has been trying her best to share her work with the community through tech talks and panel discussions. Most recently she gave a talk about Machine Learning Infra and Streaming, at Beam Summit as well as Flink Forward in Berlin.

In her free time she loves to read and paint. She is also the president of the Russian Hill book club based in San Francisco and loves to organize events for her local library.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)

Contact us

For conference registration information and customer service

For more information on community discounts and trade opportunities with O’Reilly conferences

Become a sponsor

For information on exhibiting or sponsoring a conference

For media/analyst press inquires