Sep 23–26, 2019

Creating an extensible 100+ PB real-time big data platform by unifying storage and serving

Reza Shiftehfar (Uber Technologies)
2:05pm2:45pm Thursday, September 26, 2019
Location: 1A 23

Who is this presentation for?

All executives, managers, infra/data architects, data engineers, software developers working at companies with expanding datasets who needs to build an infrastructure that should last them in future with extremely large data size, real-time latency and ML applications




This talk will reflect on the design and architecture of an effective big data platform that can ingest, store, and serve 100+ PB of data with minute level latency while efficiently utilizing the hardware. We’ll dive into the technical aspects of how the ingestion platform can be re-architected to bring in 10+ trillion events/day at minute-level latency, how the storage platform can be scaled, and how the processing platform can be redesigned to efficiently serve millions of queries and jobs/day. The audience will leave the talk with greater insight into how things work in an extensible modern Big Data platform and will be inspired to re-envision their own data platform to make it more generic and flexible for future new requirements.

The motivation for this talk is Uber’s business needs for real-time Big data. Uber’s mission is to ignite opportunities by setting the world in motion. To fulfill this mission, Uber relies heavily on making data-driven decisions in every product area and we need to store and process an ever-increasing amount of data. To this end, we had redesigned traditional Big Data platform solutions to provide faster, more reliable, and more-performant access by adding a few critical technologies that overcome their limitations. In this talk, we will provide a behind-the-scenes look at the current Big data technology landscape, including various existing open-source technologies (e.g. Hadoop, Spark, Hive, Presto, Kafka, Avro) as well as what we had to build at Uber and open-source to fill the gaps and push the boundaries such as Hudi and Marmaray.

Hudi is an open-source analytical storage system created at Uber to manage petabytes of data on HDFS-like distributed storage. Hudi provides near real-time ingestion and provides different views of the data – read optimized view for batch analytics, real-time view for driving dashboards, incremental view for powering data pipelines. Hudi also effectively manages files on underlying storage to maximize operational health & reliability. In this talk, We’ll dive into the technical aspect of how Hudi lowers data latency across the board while simultaneously achieving orders of magnitude of efficiency over traditional batch ingestion. We will make the case for near real-time dashboards built on top of Hudi datasets, that can be cheaper than pure streaming architectures.

Marmaray is an open-source plug-in based pipeline platform connecting any arbitrary data source to any data sink. It allows unified and efficient ingestion of raw data from a variety of sources to Hadoop as well as the dispersal of the derived analysis result out of Hadoop to any online data store. This talk will also look into how we built and designed a common set of abstractions to handle both the ingestion and dispersal use cases, the challenges and lessons learned both from developing the core library as well as setting up an on-demand self-service workflow, and how we scaled the platform to move around billions of record per day.

Prerequisite knowledge

High-level familiarity with Big Data ecosystem and challenges when data grows beyond a few PetaBytes

What you'll learn

- The audience will learn how to build a modern Big Data platform that expands beyond 100+ PetaBytes of data while providing real-time access - The audience will understand the internal design and architectural limitations of many popular existing open-source Big Data solutions (i.e. Hadoop, Spark, Hive, Presto, Kafka, Avro, Parquet) and how to overcome them to scale their data platform - The audience will learn about the internals of some of the open-sourced technologies from Uber (i.e. Hudi and Marmaray) and how they fit in the existing open-source Big Data ecosystem to help push the boundaries on speed and scale of traditional Big Data platforms.
Photo of Reza Shiftehfar

Reza Shiftehfar

Uber Technologies

Reza Shiftehfar currently leads Uber’s Hadoop Platform teams. His teams help build and grow Uber’s reliable and scalable Big Data platform that serves petabytes of data utilizing technologies such as Apache Hadoop, Apache Hive, Apache Kafka, Apache Spark, and Presto. Reza is one of the founding engineers of Uber’s Data team and helped scale Uber’s data platform from a few terabytes to over 100 petabytes while reducing the data latency from 24+ hours to minutes. Reza holds a Ph.D. in Computer Science from the University of Illinois, Urbana-Champaign with focus on building Mobile Hybrid Cloud applications.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)

Contact us

For conference registration information and customer service

For more information on community discounts and trade opportunities with O’Reilly conferences

For information on exhibiting or sponsoring a conference

Contact list

View a complete list of Strata Data Conference contacts