Presented By O’Reilly and Cloudera
Make Data Work
September 11, 2018: Training & Tutorials
September 12–13, 2018: Keynotes & Sessions
New York, NY

Marmaray: A generic, scalable, and pluggable Hadoop data ingestion and dispersal framework

Danny Chen (Uber Technologies), Omkar Joshi (Uber), Eric Sayle (Uber Technologies)
2:05pm–2:45pm Wednesday, 09/12/2018
Data engineering and architecture
Location: 1A 23/24 Level: Intermediate
Secondary topics:  Data Integration and Data Pipelines
Average rating: ***..
(3.80, 5 ratings)

Who is this presentation for?

  • Data engineers, data scientists, and software engineers

Prerequisite knowledge

  • A basic understanding of Apache Spark
  • Knowledge of Hive and storage systems like Cassandra and MySQL (useful but not required)

What you'll learn

  • Explore Marmaray, a generic Hadoop ingestion and dispersal framework recently released to production at Uber
  • Understand the importance of a generic any-source-to-any-sink data pipeline for ensuring that data resides where it makes the most business sense at scale, of being able to disperse raw data from a low-latency online store, and of building a single library to ingest data from multiple sources


Danny Chen, Omkar Joshi, and Eric Sayle offer an overview of Marmaray—a plug-in-based platform and library built and designed from the ground up by Uber, which will eventually support ingesting data from any source and dispersing it to any sink leveraging the use of Apache Spark. “Marmaray” refers to a tunnel in Turkey that connects Europe and Asia by rail. In the same way, Marmaray was envisioned within Uber as a pipeline connecting raw data from a variety of sources to Hadoop/Hive and connecting both raw and derived datasets from Hive to a variety of sinks depending on SLA, latency, and other customer requirements. The team also added a framework around the core library to support fully self-serve onboarding to lower the difficulty of barrier of entry onto the platform. They also added automated integration with Uber’s workflow management system, which orchestrates and executes ingestion and dispersal jobs on a regular specified cadence.

Many data users (e.g., Uber Eats and Uber’s machine learning platform, Michelangelo) use Hadoop in concert with other tools to build and train their machine learning models to ultimately produce derived datasets of immense additional value to drive Uber’s business toward greater efficiency and profitability. In order to maximize the usefulness of these derived datasets, the need arose to disperse this data to online datastores, often with much lower latency semantics than what existed in the Hadoop ecosystem, in order to serve live traffic. Marmaray was envisioned and designed to fulfill this need and to complete the Hadoop ecosystem to provide the means to transfer Hadoop data out to any online data store.

Along the same lines, Uber’s business needs necessitated the ingestion of raw data from a variety of data sources into its Hadoop data lake, which required running and maintaining multiple data pipelines in production. This proved to be cumbersome over time, as the size of the data increased proportionally with Uber’s business growth. The Hadoop platform team at Uber envisioned and designed Marmaray to define a common set of abstractions and provided a framework to unify the ingestion pipelines into one that will prove to be much more maintainable and resource efficient as Uber’s business continues to mature.

You’ll learn how the Marmaray team built and designed a common set of abstractions to handle both the ingestion and dispersal use cases, the challenges and lessons learned both from developing the core library and setting up an on-demand self-service workflow, and how the team leveraged Apache Spark to ensure the platform can scale to handle Uber’s growing data needs. Danny, Omkar, and Eric also explain how its common ingestion framework helped Uber meet GDPR requirements.

Uber plans to open-source the framework in 2018.

Photo of Danny Chen

Danny Chen

Uber Technologies

Danny Chen is a software engineer on the Hadoop platform team at Uber, where he works on large-scale data ingestion and dispersal pipelines and libraries leveraging Apache Spark. Previously, he was the tech lead at Uber Maps building data pipelines to produce metrics to help analyze the quality of mapping data. Before joining Uber, Danny was at Twitter and an original member of the core team building Manhattan, a key-value store powering Twitter’s use cases. Danny holds a BS in computer science from UCLA and an MS in computer science from USC.

Photo of Omkar Joshi

Omkar Joshi


Omkar Joshi is a senior software engineer on Uber’s Hadoop platform team, where he’s architecting Marmaray. Omkar has a keen interest in solving large-scale distributed systems problems. Previously, he led object store and NFS solutions at Hedvig and was an initial contributor to Hadoop’s YARN scheduler.

Photo of Eric Sayle

Eric Sayle

Uber Technologies

Eric Sayle is a senior software engineer at Uber, where he works with the large volume of geospatial data helping people move in countries around the world. Eric has worked in the data space for the past 10 years, starting with call center performance analytics at Merced Systems.

Comments on this page are now closed.


Karim Hammouda |
09/14/2018 2:17pm EDT

May I get access to this presentation/materials ?