Mar 15–18, 2020

Get a CLUE: Optimizing big data compute efficiency

Zhe Zhang (LinkedIn), Huangming Xie (LinkedIn)
4:15pm4:55pm Wednesday, March 18, 2020
Location: LL20A

Who is this presentation for?

Data engineers, data architects, developers

Level

Intermediate

Description

Compute efficiency optimization is of critical importance in the big data era, as data science and ML algorithms become increasingly complex and data size increases exponentially over time. Opportunities exist throughout the resource use funnel, which Zhe Zhang and Huangming Xie characterize using the framework CLUE: capacity of resources (all resources available) → loaded resources (resources that applications requested from Hadoop) → used resources → effective resources (resources spent on effective or useful work).

Zhe and Huangming highlight highlight initiatives from the past year and share the lessons they learned, including: C → L optimization with smart scheduling: they applied machine learning to figure out the best start time for scheduled flows while maintaining business SLA to decrease >20% capacity during peak hours and reduce latency for ad hoc jobs; L → U optimization with YARN overcommit: they analyzed CPU and memory usage for >7,000 nodes with different types of SKUs to evaluate the opportunity for YARN to reclaim requested but unused memory resources from applications (aka overcommit); U → E optimization with Spark SQL optimizations: they developed efficient algorithms to join large datasets in Spark SQL, which is a common pattern in processing LinkedIn member graph data and generating features for ML algorithms. You’ll get to view their investigation and experience with adaptive execution, cost-based optimization, and other SQL execution optimizations.

These initiatives allow LinkedIn to improve compute efficiency, save hundreds of millions of dollars, and boost developers’ productivity. Its framework, strategy, and lessons learned from compute efficiency optimization can be leveraged by other companies to improve their own resource intelligence strategy.

Prerequisite knowledge

  • General knowledge of big data, Spark, and Hadoop

What you'll learn

  • Learn how LinkedIn improved compute efficiency, saved hundreds of millions of dollars, and boosted developers’ productivity
  • Discover how to leverage LinkedIn's framework, strategy, and the lessons they learned to improve your resource intelligence strategy
Photo of Zhe Zhang

Zhe Zhang

LinkedIn

Zhe Zhang is a senior manager of core big data infrastructure at LinkedIn, where he leads an excellent engineering team to provide big data services (Hadoop distributed file system (HDFS), YARN, Spark, TensorFlow, and beyond) to power LinkedIn’s business intelligence and relevance applications. Zhe’s an Apache Hadoop PMC member; he led the design and development of HDFS Erasure Coding (HDFS-EC).

Photo of Huangming Xie

Huangming Xie

LinkedIn

Huangming Xie is a senior manager of data science at LinkedIn, where he leads the infrastructure data science team to drive resource intelligence, optimize compute and storage efficiency, and automate capacity forecasting for better scalability, as well as improve site availability for a pleasant member and customer experience. Huangming is an expert at converting data into actionable recommendations that impact strategy and generate direct business impact. Previously, he lead initiatives to enable data-driven product decisions at scale and build a great product for more than 600 million LinkedIn members worldwide.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)

Contact us

confreg@oreilly.com

For conference registration information and customer service

partners@oreilly.com

For more information on community discounts and trade opportunities with O’Reilly conferences

Become a sponsor

For information on exhibiting or sponsoring a conference

pr@oreilly.com

For media/analyst press inquires