Sep 23–26, 2019
Please log in

Scaling Apache Spark at Facebook

Sameer Agarwal (Facebook), Ankit Agarwal (Facebook Inc.)
1:15pm1:55pm Thursday, September 26, 2019
Location: 1A 06/07
Average rating: ****.
(4.50, 2 ratings)

Who is this presentation for?

  • Software engineers, data engineers, and data scientists




Spark started at Facebook as an experiment when the project was still in its early phases. Spark’s appeal stemmed from its ease of use and an integrated environment to run SQL, MLlib, and custom applications. At that time, the system was used by a handful of people to process small amounts of data.

However, Facebook has come a long way since then. Currently, Spark is one of Facebook’s primary SQL engines in addition to being the primary system for writing custom batch applications. Sameer Agarwal dives into the story of how Facebook optimized, tuned, and scaled Apache Spark to run on clusters of tens of thousands of machines, processing hundreds of petabytes of data, and being used by thousands of data scientists, engineers, and product analysts every day. You’ll specifically hear about scaling compute, or how Facebook runs Spark efficiently and reliably on tens of thousands of heterogenous machines in disaggregated (shared storage) clusters; optimizing core engine, or how Facebook continuously tunes, optimizes, and adds features to the core engine in order to maximize the useful work done per second; and scaling users, or how Facebook makes Spark easy to use and faster to debug to seamlessly onboard new users.

Prerequisite knowledge

  • Familiarity with SQL, Spark, and databases

What you'll learn

  • Discover how Facebook optimized, tuned, and scaled Apache Spark to run on clusters of tens of thousands of machines, processing hundreds of petabytes of data, and used by thousands of data scientists, engineers and product analysts every day
Photo of Sameer Agarwal

Sameer Agarwal


Sameer Agarwal is an Apache Spark committer and a software engineer at Facebook, where he works as part of the data warehouse team on building distributed systems and databases that scale across clusters of tens of thousands of machines. He received his PhD in databases from UC Berkeley AMPLab where he worked on BlinkDB, an approximate query engine for Spark.

Photo of Ankit Agarwal

Ankit Agarwal

Facebook Inc.

- Production Engineering Manager at Facebook (Data Warehouse Team)
- Data Infrastructure Team at Facebook since 2012
- Previously worked on the search team at Yahoo!

Comments on this page are now closed.


Anushka Jadhav | sr software engineer
10/09/2019 4:42pm EDT

Hi, can you please post the slides for this talk

  • Cloudera
  • O'Reilly
  • Google Cloud
  • IBM
  • Cisco
  • Dataiku
  • Intel
  • Io-Tahoe
  • MemSQL
  • Microsoft Azure
  • Oracle Cloud Infrastructure
  • SAS
  • Arcadia Data
  • BMC Software
  • Hazelcast
  • SAP
  • Amazon Web Services
  • Anaconda
  • Esri
  •, Inc.
  • Kyligence
  • Pitney Bowes
  • Talend
  • Google Cloud
  • Confluent
  • DataStax
  • Dremio
  • Immuta
  • Impetus Technologies Inc.
  • Keyence
  • Kyvos Insights
  • StreamSets
  • Striim
  • Syncsort
  • SK holdings C&C

    Contact us

    For conference registration information and customer service

    For more information on community discounts and trade opportunities with O’Reilly conferences

    For information on exhibiting or sponsoring a conference

    For media/analyst press inquires