Presented By O'Reilly and Cloudera
Make Data Work
22–23 May 2017: Training
23–25 May 2017: Tutorials & Conference
London, UK

A deep dive into Spark SQL's Catalyst optimizer

11:1511:55 Thursday, 25 May 2017
Level: Intermediate
Average rating: *****
(5.00, 1 rating)

Who is this presentation for?

  • Spark SQL (power) users

Prerequisite knowledge

  • A basic understanding of Spark SQL

What you'll learn

  • Explore the inner workings of Spark SQL
  • Understand query optimization, query planning, and tree manipulation


Catalyst is becoming one of the most important components of Apache Spark, as it underpins all the major new APIs in Spark 2.0, from DataFrames and Datasets to Streaming. At its core, Catalyst is a general library for manipulating trees.

Herman van Hövell tot Westerflier explores a modular compiler frontend for Spark based on this library that includes a query analyzer, optimizer, and an execution planner. Herman offers a deep dive into Spark SQL’s Catalyst optimizer, introducing the core concepts of Catalyst and demonstrating how new and upcoming features such as CBO are implemented using Catalyst. You’ll leave with a deeper understanding of how Spark analyzes, optimizes, and plans a user’s query.

Herman van Hövell tot Westerflier


Herman van Hövell tot Westerflier is a Spark committer working on Spark SQL at Databricks. Previously, Herman was a consultant working for clients in banking, manufacturing, and logistics. His interests include database systems, optimization, and simulation.