Presented By O’Reilly and Cloudera

San Jose • London • New York

Make Data Work

March 5–6, 2018: Training
March 6–8, 2018: Tutorials & Conference
San Jose, CA

Metrics-driven tuning of Apache Spark at scale

Edwina Lu (LinkedIn), Ye Zhou (LinkedIn), Min Shen (LinkedIn)

11:50am–12:30pm Wednesday, March 7, 2018

Data engineering and architecture
Location: 230 C

Average rating:

(4.00, 4 ratings)

View slides

Who is this presentation for?

Spark developers, data scientists, software engineers, and cluster administrators

Prerequisite knowledge

A basic familiarity with Spark

What you'll learn

Explore a fast, reliable, and automated process used at LinkedIn for tuning Spark applications

Description

Tuning Spark can be complex and difficult, since there are many different configuration parameters and metrics. Edwina Lu, Ye Zhou, and Min Shen outline a fast, reliable, and automated process used at LinkedIn for tuning Spark applications, enabling users to quickly identify and fix problems.

As the Spark applications running on LinkedIn’s clusters become more diverse and numerous, it’s no longer feasible for a small Spark team to help individual users debug and tune their Spark applications. Users need to be able to get advice quickly and iterate on their development, and any problems need to be caught promptly to keep the cluster healthy. LinkedIn leverages Spark History Server (SHS) to gather application metrics, but as the number of Spark applications and size of individual applications have increased, the SHS has not been able to keep up. It can fall hours behind during peak usage. Edwina, Ye, and Min discuss changes to the SHS to improve efficiency, performance and stability, enabling SHS to analyze a large amount of logs.

Another challenge is the lack of proper metrics related to Spark application performance. Edwina, Ye, and Min share new metrics added to Spark that can precisely report resource usage during runtime and explain how these are used in heuristics to identify problems. Based on this analysis, custom recommendations are provided to help users tune their applications. They conclude by detailing the impact made by these tuning recommendations, including improvements in application performance itself and the overall cluster utilization.

Edwina Lu

Edwina Lu is a software engineer on LinkedIn’s Hadoop infrastructure development team, currently focused on supporting Spark on the company’s clusters. Previously, she worked at Oracle on database replication.

Ye Zhou

Ye Zhou is a software engineer in LinkedIn’s Hadoop infrastructure development team, mostly focusing on Hadoop Yarn and Spark related projects. Ye holds a master’s degree in computer science from Carnegie Mellon University.

Min Shen

Min Shen is a tech lead at LinkedIn. His team’s focus is to build and scale LinkedIn’s general purpose batch compute engine based on Apache Spark. The team empowers multiple use cases at LinkedIn ranging from data explorations, data engineering, and ML model training. Prior to this, Min mainly worked on Apache YARN. He holds a PhD degree in Computer Science from University of Illinois at Chicago.

Presented by

Elite Sponsors

Strategic Sponsors

Zettabyte Sponsor

Contributing Sponsors

Exabyte Sponsors

Impact Sponsors

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com