Presented By
O’Reilly + Cloudera
Make Data Work
March 25-28, 2019
San Francisco, CA
Please log in

Managing Uber's data workflows at scale

Alex Kira (Uber)
4:20pm5:00pm Wednesday, March 27, 2019
Average rating: ****.
(4.00, 13 ratings)

Who is this presentation for?

  • Data engineers, data infrastructure engineers, distributed systems engineers, product managers, platform software engineers, and data leaders

Level

Intermediate

Prerequisite knowledge

  • General knowledge of the data ecosystem and distributed systems

What you'll learn

  • Understand the considerations when choosing a data workflow system
  • Learn how to apply distributed systems concepts to solve data challenges
  • Explore Uber’s data workflow system

Description

Uber operates at scale, with thousands of microservices serving millions of rides a day, leading to 100+ PB of data. This data powers multiple business use cases, such as machine learning, model training, data preparation, traditional business intelligence, visualization and reporting, but it first needs to be ingested, transformed, and dispersed in order to provide value to the business.

To democratize data pipelines, Uber needed a central tool that provides a way to author, manage, schedule, and deploy data workflows at scale. Alex Kira details Uber’s journey toward a unified and scalable data workflow system used to manage this data and shares the challenges faced and how the company has rearchitected several components of the system—such as scheduling and serialization—to make them highly available and more scalable. Alex also outlines future plans for making the workflow platform more streamlined and easier to use.

Topics include:

  • How Uber converged on a single data workflow system and how it leveraged existing open source tools to achieve this goal
  • Why Uber chose a centrally deployed infrastructure model and what the trade-offs were
  • How choosing a system that provides easy pipeline authoring, programmatic pipeline generation, visualization, and built-in logging allowed Uber to successfully democratize data pipeline creation
  • The importance of system isolation from user code and how Uber achieved this through metadata serialization
  • How Uber rearchitected its scheduler using Zookeeper to provide high availability and horizontal scalability
Photo of Alex Kira

Alex Kira

Uber

Alex Kira is an engineering tech lead at Uber, where he works on the data workflow management team. His team provides a data infrastructure platform for thousands of engineers, data scientists, and city ops, thereby empowering them to own and manage their data pipelines. During his 19-year career, he’s had experience across several software disciplines, including distributed systems, data infrastructure, and full stack development, giving him a holistic systems view of his projects. He holds an undergraduate degree in computer science from the University of Miami and a master’s degree from the Georgia Institute of Technology. In his free time, Alex enjoys hiking around the Bay Area, rock climbing, and traveling internationally.