San FranciscoLondon New York

Presented By
O’Reilly + Cloudera

Make Data Work

March 25-28, 2019
San Francisco, CA

Please log in

Add to Your Schedule

Adaptive ETL to optimize query performance at Lyft

James Taylor (Lyft)

2:40pm–3:20pm Wednesday, March 27, 2019

Data Engineering & Architecture
Location: 2001

Secondary topics: Data Integration and Data Pipelines, Transportation and Logistics

Average rating:

(3.56, 9 ratings)

Who is this presentation for?

Data infrastructure engineers and data architects

Level

Advanced

Prerequisite knowledge

Familiarity with big data ETL, querying, and optimization

What you'll learn

Learn how Lyft optimizes big data queries

Description

In today’s big data world, it’s difficult to manually adjust the structure of your data to achieve optimal query performance. In a company-wide ad hoc query environment, many different types of users access the same data in ever-changing ways, gleaning business insights and asking new questions—leading to new query patterns. With a typical skew of 20% of the queries accounting for 80% of the load, it quickly becomes infeasible to analyze and optimize this manually as a one-time task.

To help with this issue, Lyft developed an automated feedback loop to adapt its ETL based on monitoring the cost of queries run on the system by identifying the most common columns being filtered on in the most expensive queries against the highest cardinality values across all tables; feeding this information into a recommendation engine to determine the optimal column sorting to use at ingest time to achieve the maximum partition pruning by Presto’s Parquet reader; and driving column sorting and distribution during ingest using these recommendations.

James Taylor explores the system and the impact it has had on cluster load over time and discusses future work to enhance the system through the use of materialized views to reduce the number of ad hoc joins and sorting performed by the most expensive queries by transparently rewriting queries when possible.

James Taylor

Lyft

James Taylor is a software engineer in the Data Infrastructure Group at Lyft, where he works on big data systems. Previously, he was an architect at Salesforce, where he founded the Apache Phoenix project and led its development, and worked on federated query processing systems and event-driven programming platforms at BEA Systems.

Presented by

Strategic Sponsors

Zettabyte Sponsor

Contributing Sponsors

Exabyte Sponsors

Impact Sponsors

Supporting Sponsor

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2019, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com