Sep 23–26, 2019
Please log in

Creating an extensible 100+ PB real-time big data platform by unifying storage and serving

2:05pm2:45pm Thursday, September 26, 2019
Location: 1A 23/24
Average rating: *****
(5.00, 3 ratings)

Who is this presentation for?

  • Executives, managers, infra/data architects, data engineers, and software developers

Level

Intermediate

Description

Uber relies heavily on making data-driven decisions in every product area and needs to store and process an ever-increasing amount of data. But building a reliable big data platform is extremely challenging when it has to store and serve hundreds of petabytes of data in real time. The company redesigned traditional big data platform solutions to provide faster, more reliable, and more performant access by adding a few critical technologies that overcome their limitations.

Reza Shiftehfar reflects on the challenges faced and proposes architectural solutions to scale a big data platform to ingest, store, and serve 100+ PB of data with minute-level latency while efficiently utilizing the hardware and meeting security needs. You’ll get a behind-the-scenes look at the current big data technology landscape, including various existing open source technologies (e.g., Hadoop, Spark, Hive, Presto, Kafka, and Avro) as well as what Uber’s tools such as Hudi and Marmaray.

Hudi is an open source analytical storage system created at Uber to manage petabytes of data on HDFS-like distributed storage. Hudi provides near-real-time ingestion and provides different views of the data: a read-optimized view for batch analytics, a real-time view for driving dashboards, and an incremental view for powering data pipelines. Hudi also effectively manages files on underlying storage to maximize operational health and reliability. Reza details how Hudi lowers data latency across the board while simultaneously achieving orders of magnitude of efficiency over traditional batch ingestion. He then makes the case for near-real-time dashboards built on top of Hudi datasets, which can be cheaper than pure streaming architectures.

Marmaray is an open source plug-in based pipeline platform connecting any arbitrary data source to any data sink. It allows unified and efficient ingestion of raw data from a variety of sources to Hadoop as well as the dispersal of the derived analysis result out of Hadoop to any online data store. Reza explains how Uber built and designed a common set of abstractions to handle both the ingestion and dispersal use cases, along with the challenges and lessons learned from developing the core library and setting up an on-demand self-service workflow. Along the way, you’ll see how Uber scaled the platform to move around billions of records per day.

You’ll also dive into the technical aspects of how to rearchitect the ingestion platform to bring in 10+ trillion events per day at minute-level latency, how to scale the storage platform, and how to redesign the processing platform to efficiently serve millions of queries and jobs per day. You’ll leave with greater insight into how things work in an extensible modern big data platform and inspired to reenvision your own data platform to make it more generic and flexible for future new requirements.

Prerequisite knowledge

  • A high-level familiarity with the big data ecosystem
  • Familiarity with the challenges when data grows beyond a few petabytes

What you'll learn

  • Learn how to build a modern big data platform that expands beyond 100+ petabytes of data while providing real-time access
  • Explore the internal design and architectural limitations of many popular existing open source big data solutions and how to overcome them to scale your data platform
  • Discover Uber's open-sourced technologies Hudi and Marmaray and how they help push the boundaries on speed and scale of traditional big data platforms
Photo of Reza Shiftehfar

Reza Shiftehfar

Uber

Reza Shiftehfar leads the Hadoop platform teams at Uber, which help build and grow Uber’s reliable and scalable big data platform that serves petabytes of data utilizing technologies such as Apache Hadoop, Apache Hive, Apache Kafka, Apache Spark, and Presto. Reza is one of the founding engineers of Uber’s data team and helped scale Uber’s data platform from a few terabytes to over 100 petabytes while reducing the data latency from 24+ hours to minutes. Reza holds a PhD in computer science from the University of Illinois, Urbana-Champaign focused on building mobile hybrid cloud applications.

  • Cloudera
  • O'Reilly
  • Google Cloud
  • IBM
  • Cisco
  • Dataiku
  • Intel
  • Io-Tahoe
  • MemSQL
  • Microsoft Azure
  • Oracle Cloud Infrastructure
  • SAS
  • Arcadia Data
  • BMC Software
  • Hazelcast
  • SAP
  • Amazon Web Services
  • Anaconda
  • Esri
  • Infoworks.io, Inc.
  • Kyligence
  • Pitney Bowes
  • Talend
  • Google Cloud
  • Confluent
  • DataStax
  • Dremio
  • Immuta
  • Impetus Technologies Inc.
  • Keyence
  • Kyvos Insights
  • StreamSets
  • Striim
  • Syncsort
  • SK holdings C&C

    Contact us

    confreg@oreilly.com

    For conference registration information and customer service

    partners@oreilly.com

    For more information on community discounts and trade opportunities with O’Reilly conferences

    strataconf@oreilly.com

    For information on exhibiting or sponsoring a conference

    pr@oreilly.com

    For media/analyst press inquires