Sep 23–26, 2019

How to performance tune Spark applications in large clusters?

Omkar Joshi (Uber Technologies), Bo Yang (uber inc)
1:15pm1:55pm Thursday, September 26, 2019
Location: 1A 23/24
Secondary topics:  Deep dive into specific tools, platforms, or frameworks, Transportation and Logistics

Who is this presentation for?

Software developer and managers

Level

Intermediate

Prerequisite knowledge

Basic understanding of running spark application.

What you'll learn

How to understand bottlenecks in your spark applications Cache or not to cache your spark dag to avoid re-reading your input data. How to effectively use Accumulators to avoid unnecessary spark actions How to inspect your heap & non-heap memory usage across 100s of executors. How you can change layout of your data to save long term storage cost How to effectively use serializers & compression to save network & disk traffic. How to reduce amortize cost of your application by multiplexing your jobs.

Description

Omkar Joshi and Bo Yang offer an overview on how performance challenges were addressed at Uber while rolling out its newly built flagship ingestion system, Marmaray (open sourced) for data ingestion from various sources like kafka, mysql, cassandra to hadoop. This system is rolled out in production and is running for over a year now with more ingestion systems being onboarded on top of it. For our analysis we heavily used jvm-profiler to give us valuable insights.
This new system is built using spark framework for doing data ingestion. It is designed to ingest billions of kafka messages per topic from thousands of topics every 30 minutes. The amount of data handled by the pipeline is of the order 100s of TBs. Therefore at this scale savings of every byte and milliseconds count. We will talk about how methodically we can tackle such problems and will also share insights into the optimizations already done in production for this. Some key highlights in it are
How to understand your bottlenecks in spark applications
Cache or not to cache your spark dag to avoid re-reading your input data.
How to effectively use Accumulators to avoid unnecessary spark actions
How to inspect your heap & non-heap memory usage across 100s of executors.
How you can change layout of your data to save long term storage cost
How to effectively use serializers & compression to save network & disk traffic.
How to reduce amortize cost of your application by multiplexing your jobs.
We used different techniques for reducing memory footprint, runtime and on disk usage for our running applications. In terms of savings we were able to significantly (~10-40%) reduce memory footprint, runtime and disk usage.

Photo of Omkar Joshi

Omkar Joshi

Uber Technologies

Omkar Joshi is a software engineer on Uber’s Hadoop platform team, where he is currently architecting Marmaray. Omkar has a keen interest in solving large-scale distributed systems problems. Previously, he led object store and NFS solutions at Hedvig Inc. and was an initial contributor to Hadoop’s YARN scheduler.

Photo of Bo Yang

Bo Yang

uber inc

Software engineer at Uber

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)

Contact us

confreg@oreilly.com

For conference registration information and customer service

partners@oreilly.com

For more information on community discounts and trade opportunities with O’Reilly conferences

strataconf@oreilly.com

For information on exhibiting or sponsoring a conference

Contact list

View a complete list of Strata Data Conference contacts