Creating an extensible 100+ PB real-time big data platform by unifying storage and serving
Who is this presentation for?All executives, managers, infra/data architects, data engineers, software developers working at companies with expanding datasets who needs to build an infrastructure that should last them in future with extremely large data size, real-time latency and ML applications
This talk will reflect on the design and architecture of an effective big data platform that can ingest, store, and serve 100+ PB of data with minute level latency while efficiently utilizing the hardware. We’ll dive into the technical aspects of how the ingestion platform can be re-architected to bring in 10+ trillion events/day at minute-level latency, how the storage platform can be scaled, and how the processing platform can be redesigned to efficiently serve millions of queries and jobs/day. The audience will leave the talk with greater insight into how things work in an extensible modern Big Data platform and will be inspired to re-envision their own data platform to make it more generic and flexible for future new requirements.
The motivation for this talk is Uber’s business needs for real-time Big data. Uber’s mission is to ignite opportunities by setting the world in motion. To fulfill this mission, Uber relies heavily on making data-driven decisions in every product area and we need to store and process an ever-increasing amount of data. To this end, we had redesigned traditional Big Data platform solutions to provide faster, more reliable, and more-performant access by adding a few critical technologies that overcome their limitations. In this talk, we will provide a behind-the-scenes look at the current Big data technology landscape, including various existing open-source technologies (e.g. Hadoop, Spark, Hive, Presto, Kafka, Avro) as well as what we had to build at Uber and open-source to fill the gaps and push the boundaries such as Hudi and Marmaray.
Hudi is an open-source analytical storage system created at Uber to manage petabytes of data on HDFS-like distributed storage. Hudi provides near real-time ingestion and provides different views of the data – read optimized view for batch analytics, real-time view for driving dashboards, incremental view for powering data pipelines. Hudi also effectively manages files on underlying storage to maximize operational health & reliability. In this talk, We’ll dive into the technical aspect of how Hudi lowers data latency across the board while simultaneously achieving orders of magnitude of efficiency over traditional batch ingestion. We will make the case for near real-time dashboards built on top of Hudi datasets, that can be cheaper than pure streaming architectures.
Marmaray is an open-source plug-in based pipeline platform connecting any arbitrary data source to any data sink. It allows unified and efficient ingestion of raw data from a variety of sources to Hadoop as well as the dispersal of the derived analysis result out of Hadoop to any online data store. This talk will also look into how we built and designed a common set of abstractions to handle both the ingestion and dispersal use cases, the challenges and lessons learned both from developing the core library as well as setting up an on-demand self-service workflow, and how we scaled the platform to move around billions of record per day.
Prerequisite knowledgeHigh-level familiarity with Big Data ecosystem and challenges when data grows beyond a few PetaBytes
What you'll learn
Reza Shiftehfar currently leads Uber’s Hadoop Platform teams. His teams help build and grow Uber’s reliable and scalable Big Data platform that serves petabytes of data utilizing technologies such as Apache Hadoop, Apache Hive, Apache Kafka, Apache Spark, and Presto. Reza is one of the founding engineers of Uber’s Data team and helped scale Uber’s data platform from a few terabytes to over 100 petabytes while reducing the data latency from 24+ hours to minutes. Reza holds a Ph.D. in Computer Science from the University of Illinois, Urbana-Champaign with focus on building Mobile Hybrid Cloud applications.
Leave a Comment or Question
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
For conference registration information and customer service
For more information on community discounts and trade opportunities with O’Reilly conferences
For information on exhibiting or sponsoring a conference
View a complete list of Strata Data Conference contacts