Danny Chen, Omkar Joshi, and Eric Sayle offer an overview of Marmaray—a plug-in-based platform and library built and designed from the ground up by Uber, which will eventually support ingesting data from any source and dispersing it to any sink leveraging the use of Apache Spark. “Marmaray” refers to a tunnel in Turkey that connects Europe and Asia by rail. In the same way, Marmaray was envisioned within Uber as a pipeline connecting raw data from a variety of sources to Hadoop/Hive and connecting both raw and derived datasets from Hive to a variety of sinks depending on SLA, latency, and other customer requirements. The team also added a framework around the core library to support fully self-serve onboarding to lower the difficulty of barrier of entry onto the platform. They also added automated integration with Uber’s workflow management system, which orchestrates and executes ingestion and dispersal jobs on a regular specified cadence.
Many data users (e.g., Uber Eats and Uber’s machine learning platform, Michelangelo) use Hadoop in concert with other tools to build and train their machine learning models to ultimately produce derived datasets of immense additional value to drive Uber’s business toward greater efficiency and profitability. In order to maximize the usefulness of these derived datasets, the need arose to disperse this data to online datastores, often with much lower latency semantics than what existed in the Hadoop ecosystem, in order to serve live traffic. Marmaray was envisioned and designed to fulfill this need and to complete the Hadoop ecosystem to provide the means to transfer Hadoop data out to any online data store.
Along the same lines, Uber’s business needs necessitated the ingestion of raw data from a variety of data sources into its Hadoop data lake, which required running and maintaining multiple data pipelines in production. This proved to be cumbersome over time, as the size of the data increased proportionally with Uber’s business growth. The Hadoop platform team at Uber envisioned and designed Marmaray to define a common set of abstractions and provided a framework to unify the ingestion pipelines into one that will prove to be much more maintainable and resource efficient as Uber’s business continues to mature.
You’ll learn how the Marmaray team built and designed a common set of abstractions to handle both the ingestion and dispersal use cases, the challenges and lessons learned both from developing the core library and setting up an on-demand self-service workflow, and how the team leveraged Apache Spark to ensure the platform can scale to handle Uber’s growing data needs. Danny, Omkar, and Eric also explain how its common ingestion framework helped Uber meet GDPR requirements.
Uber plans to open-source the framework in 2018.
Danny Chen is a software engineer on the Hadoop platform team at Uber, where he works on large-scale data ingestion and dispersal pipelines and libraries leveraging Apache Spark. Previously, he was the tech lead at Uber Maps building data pipelines to produce metrics to help analyze the quality of mapping data. Before joining Uber, Danny was at Twitter and an original member of the core team building Manhattan, a key-value store powering Twitter’s use cases. Danny holds a BS in computer science from UCLA and an MS in computer science from USC.
Omkar Joshi is a senior software engineer on Uber’s Hadoop platform team, where he’s architecting Marmaray. Omkar has a keen interest in solving large-scale distributed systems problems. Previously, he led object store and NFS solutions at Hedvig and was an initial contributor to Hadoop’s YARN scheduler.
Eric Sayle is a senior software engineer at Uber, where he works with the large volume of geospatial data helping people move in countries around the world. Eric has worked in the data space for the past 10 years, starting with call center performance analytics at Merced Systems.
Comments on this page are now closed.
For exhibition and sponsorship opportunities, email strataconf@oreilly.com
For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com
View a complete list of Strata Data Conference contacts
©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com
Comments
May I get access to this presentation/materials ?