San FranciscoLondon New York

Presented By
O’Reilly + Cloudera

Make Data Work

March 25-28, 2019
San Francisco, CA

Please log in

Add to Your Schedule

Scaling Apache Spark on Kubernetes at Lyft

Li Gao (Lyft), Bill Graham (Lyft)

3:50pm–4:30pm Thursday, March 28, 2019

Data Engineering & Architecture
Location: 2001

Secondary topics: Data Integration and Data Pipelines, Data Platforms

Average rating:

(4.00, 2 ratings)

Who is this presentation for?

Data engineers and software developers

Level

Intermediate

Prerequisite knowledge

A basic understanding of Kubernetes, Spark, and distributed computing

What you'll learn

Learn how Lyft built a production grade multicluster Kubernetes environment to run and scale Apache Spark
Explore solutions developed at Lyft to handle the challenges of scaling Spark on a native Kubernetes environment with multicluster dispatching, monitoring, resource isolation, and HA support

Description

Lyft is on a mission to improve people’s lives with the world’s best transportation. As part of this mission, Lyft invests heavily in open source infrastructure and tooling. Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads, while Apache Spark supports both machine learning and large-scale ETL workloads. By combining the flexibility of Kubernetes with the data-processing power of Apache Spark, Lyft is able to drive ETL data processing to a different level.

Li Gao and Bill Graham discuss the challenges the Lyft team faced and solutions they developed to support Apache Spark on Kubernetes in production and at scale.

Topics include:

Key traits of Apache Spark on Kubernetes
Deep dive into Lyft’s multicluster setup and operationality to handle petabytes of production data
How Lyft extends and enhances Apache Spark to support capabilities such as Spark pod lifecycle metrics and state management, resource prioritization, and queuing and throttling
Dynamic job scale estimation and runtime dynamic job configuration
How Lyft powers internal data scientists, business analysts, and data engineers via a multicluster setup

Li Gao

Lyft

Li Gao is the tech lead for the Cloud Native Spark Compute Initiative at Lyft. Previously, Li held technical leadership positions focusing on cloud native and hybrid cloud data platforms at scale at Salesforce, Fitbit, Marin Software, and a few startups. Besides Spark, Li has scaled and productionized open source projects including Presto, Apache HBase, Apache Phoenix, Apache Kafka, Apache Airflow, and Apache Hive.

Website

Bill Graham

Lyft

Bill Graham is an architect on the data platform team at Lyft. Bill’s primary area of focus is on data processing applications and analytics infrastructure. Previously, he was a staff engineer on the data platform team at Twitter, where he built streaming compute, interactive query, batch query, ETL, and data management systems; a principal engineer at CBS Interactive and CNET Networks, where he developed ad targeting and content publishing infrastructure; and a senior engineer at Logitech focusing on webcam streaming and messaging applications. He’s contributed to a number of open source projects, including Apache HBase, Apache Hive, and Presto and is an Apache Pig and Apache Heron (incubating) PMC member.

Website

Presented by

Strategic Sponsors

Zettabyte Sponsor

Contributing Sponsors

Exabyte Sponsors

Impact Sponsors

Supporting Sponsor

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2019, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com