Sep 23–26, 2019
Please log in

How to performance-tune Spark applications in large clusters

Omkar Joshi (Uber), Bo Yang (Uber)
1:15pm1:55pm Thursday, September 26, 2019
Location: 1A 23/24
Average rating: *****
(5.00, 2 ratings)

Who is this presentation for?

  • Software developers and managers

Level

Intermediate

Description

Omkar Joshi and Bo Yang offer an overview on how performance challenges were addressed at Uber while rolling out its newly built flagship ingestion system, Marmaray (open-sourced) for data ingestion from various sources like Kafka, MySQL, Cassandra, and Hadoop. This system is rolled out in production and has been running for over a year now, with more ingestion systems onboarded on top of it. Omar and Bo heavily used jvm-profiler during their analysis to give them valuable insights.

This new system is built using the Spark framework for data ingestion. It’s designed to ingest billions of Kafka messages per topic from thousands of topics every 30 minutes. The amount of data handled by the pipeline is of the order hundreds of TBs. At this scale, every byte and millisecond saved counts. Omar and Bo detail how to tackle such problems and insights into the optimizations already done in production. Some key highlights are how to understand your bottlenecks in Spark applications, to cache or not to cache your Spark DAG to avoid rereading your input data, how to effectively use accumulators to avoid unnecessary Spark actions, how to inspect your heap and nonheap memory usage across hundreds of executors, how you can change the layout of your data to save long-term storage cost, how to effectively use serializers and compression to save network and disk traffic, and how to reduce amortize the cost of your application by multiplexing your jobs.

They used different techniques for reducing memory footprint, runtime, and on-disk usage for the running applications. In terms of savings, CGI was able to significantly (~10%–40%) reduce memory footprint, runtime, and disk usage.

Prerequisite knowledge

  • A basic understanding of running Spark applications

What you'll learn

  • Learn how to understand bottlenecks in your Spark applications, to cache or not to cache your Spark DAG to avoid rereading your input data, how to effectively use accumulators to avoid unnecessary Spark actions, how to inspect your heap and nonheap memory usage across hundreds of executors, how you can change the layout of your data to save long-term storage cost, how to effectively use serializers and compression to save network and disk traffic, and how to reduce amortize the cost of your application by multiplexing your jobs
Photo of Omkar Joshi

Omkar Joshi

Uber

Omkar Joshi is a senior software engineer on Uber’s Hadoop platform team, where he’s architecting Marmaray. Omkar has a keen interest in solving large-scale distributed systems problems. Previously, he led object store and NFS solutions at Hedvig and was an initial contributor to Hadoop’s YARN scheduler.

Photo of Bo Yang

Bo Yang

Uber

Bo Yang is a software engineer at Uber.

  • Cloudera
  • O'Reilly
  • Google Cloud
  • IBM
  • Cisco
  • Dataiku
  • Intel
  • Io-Tahoe
  • MemSQL
  • Microsoft Azure
  • Oracle Cloud Infrastructure
  • SAS
  • Arcadia Data
  • BMC Software
  • Hazelcast
  • SAP
  • Amazon Web Services
  • Anaconda
  • Esri
  • Infoworks.io, Inc.
  • Kyligence
  • Pitney Bowes
  • Talend
  • Google Cloud
  • Confluent
  • DataStax
  • Dremio
  • Immuta
  • Impetus Technologies Inc.
  • Keyence
  • Kyvos Insights
  • StreamSets
  • Striim
  • Syncsort
  • SK holdings C&C

    Contact us

    confreg@oreilly.com

    For conference registration information and customer service

    partners@oreilly.com

    For more information on community discounts and trade opportunities with O’Reilly conferences

    strataconf@oreilly.com

    For information on exhibiting or sponsoring a conference

    pr@oreilly.com

    For media/analyst press inquires