Presented By O’Reilly and Cloudera

San Francisco • London • New York

Make Data Work

September 11, 2018: Training & Tutorials
September 12–13, 2018: Keynotes & Sessions
New York, NY

Marmaray: A generic, scalable, and pluggable Hadoop data ingestion and dispersal framework

Danny Chen (Uber Technologies), Omkar Joshi (Uber), Eric Sayle (Uber Technologies)

2:05pm–2:45pm Wednesday, 09/12/2018

Data engineering and architecture
Location: 1A 23/24 Level: Intermediate

Secondary topics: Data Integration and Data Pipelines

Average rating:

(3.80, 5 ratings)

Download slides (PDF)

Who is this presentation for?

Data engineers, data scientists, and software engineers

Prerequisite knowledge

A basic understanding of Apache Spark
Knowledge of Hive and storage systems like Cassandra and MySQL (useful but not required)

What you'll learn

Explore Marmaray, a generic Hadoop ingestion and dispersal framework recently released to production at Uber
Understand the importance of a generic any-source-to-any-sink data pipeline for ensuring that data resides where it makes the most business sense at scale, of being able to disperse raw data from a low-latency online store, and of building a single library to ingest data from multiple sources

Description

Danny Chen, Omkar Joshi, and Eric Sayle offer an overview of Marmaray—a plug-in-based platform and library built and designed from the ground up by Uber, which will eventually support ingesting data from any source and dispersing it to any sink leveraging the use of Apache Spark. “Marmaray” refers to a tunnel in Turkey that connects Europe and Asia by rail. In the same way, Marmaray was envisioned within Uber as a pipeline connecting raw data from a variety of sources to Hadoop/Hive and connecting both raw and derived datasets from Hive to a variety of sinks depending on SLA, latency, and other customer requirements. The team also added a framework around the core library to support fully self-serve onboarding to lower the difficulty of barrier of entry onto the platform. They also added automated integration with Uber’s workflow management system, which orchestrates and executes ingestion and dispersal jobs on a regular specified cadence.

Many data users (e.g., Uber Eats and Uber’s machine learning platform, Michelangelo) use Hadoop in concert with other tools to build and train their machine learning models to ultimately produce derived datasets of immense additional value to drive Uber’s business toward greater efficiency and profitability. In order to maximize the usefulness of these derived datasets, the need arose to disperse this data to online datastores, often with much lower latency semantics than what existed in the Hadoop ecosystem, in order to serve live traffic. Marmaray was envisioned and designed to fulfill this need and to complete the Hadoop ecosystem to provide the means to transfer Hadoop data out to any online data store.

Along the same lines, Uber’s business needs necessitated the ingestion of raw data from a variety of data sources into its Hadoop data lake, which required running and maintaining multiple data pipelines in production. This proved to be cumbersome over time, as the size of the data increased proportionally with Uber’s business growth. The Hadoop platform team at Uber envisioned and designed Marmaray to define a common set of abstractions and provided a framework to unify the ingestion pipelines into one that will prove to be much more maintainable and resource efficient as Uber’s business continues to mature.

You’ll learn how the Marmaray team built and designed a common set of abstractions to handle both the ingestion and dispersal use cases, the challenges and lessons learned both from developing the core library and setting up an on-demand self-service workflow, and how the team leveraged Apache Spark to ensure the platform can scale to handle Uber’s growing data needs. Danny, Omkar, and Eric also explain how its common ingestion framework helped Uber meet GDPR requirements.

Uber plans to open-source the framework in 2018.

Danny Chen

Uber Technologies

Danny Chen is a software engineer on the Hadoop platform team at Uber, where he works on large-scale data ingestion and dispersal pipelines and libraries leveraging Apache Spark. Previously, he was the tech lead at Uber Maps building data pipelines to produce metrics to help analyze the quality of mapping data. Before joining Uber, Danny was at Twitter and an original member of the core team building Manhattan, a key-value store powering Twitter’s use cases. Danny holds a BS in computer science from UCLA and an MS in computer science from USC.

Website

Omkar Joshi

Uber

Omkar Joshi is a senior software engineer on Uber’s Hadoop platform team, where he’s architecting Marmaray. Omkar has a keen interest in solving large-scale distributed systems problems. Previously, he led object store and NFS solutions at Hedvig and was an initial contributor to Hadoop’s YARN scheduler.

Eric Sayle

Uber Technologies

Eric Sayle is a senior software engineer at Uber, where he works with the large volume of geospatial data helping people move in countries around the world. Eric has worked in the data space for the past 10 years, starting with call center performance analytics at Merced Systems.

Comments on this page are now closed.

Comments

Karim Hammouda |

09/14/2018 2:17pm EDT

May I get access to this presentation/materials ?

Presented by

Elite Sponsors

Strategic Sponsors

Zettabyte Sponsors

Contributing Sponsors

Exabyte Sponsors

Impact Sponsors

Supporting Sponsors

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com