Sep 23–26, 2019

Scaling Apache Spark at Facebook

Sameer Agarwal (Facebook Inc.)
1:15pm1:55pm Thursday, September 26, 2019
Location: 1A 06/07

Who is this presentation for?

Software Engineers, Data Engineers, Data Scientists

Level

Intermediate

Description

Spark started at Facebook as an experiment when the project was still in its early phases. Spark’s appeal stemmed from its ease of use and an integrated environment to run SQL, MLlib, and custom applications. At that time the system was used by a handful of people to process small amounts of data. However, we’ve come a long way since then. Currently, Spark is one of the primary SQL engines at Facebook in addition to being the primary system for writing custom batch applications. This talk will cover the story of how we optimized, tuned and scaled Apache Spark at Facebook to run on clusters of tens of thousands of machines, processing hundreds of petabytes of data, and used by thousands of data scientists, engineers and product analysts every day. Specifically, we’ll focus on three areas:

1. Scaling Compute: How Facebook runs Spark efficiently and reliably on tens of thousands of heterogenous machines in disaggregated (shared-storage) clusters.
2. Optimizing Core Engine: How we continuously tune, optimize and add features to the core engine in order to maximize the useful work done per second.
3. Scaling Users: How we make Spark easy to use, and faster to debug to seamlessly onboard new users.

Prerequisite knowledge

SQL, Spark, Databases

What you'll learn

This talk will cover the story of how we optimized, tuned and scaled Apache Spark at Facebook to run on clusters of tens of thousands of machines, processing hundreds of petabytes of data, and used by thousands of data scientists, engineers and product analysts every day.

Sameer Agarwal

Facebook Inc.

Sameer Agarwal is an Apache Spark Committer and a Software Engineer at Facebook where he works as part of the Data Warehouse team on building distributed systems and databases that scale across clusters of tens of thousands of machines. He received his PhD in Databases from UC Berkeley AMPLab where he worked on BlinkDB, an approximate query engine for Spark.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)

Contact us

confreg@oreilly.com

For conference registration information and customer service

partners@oreilly.com

For more information on community discounts and trade opportunities with O’Reilly conferences

strataconf@oreilly.com

For information on exhibiting or sponsoring a conference

Contact list

View a complete list of Strata Data Conference contacts