Catalyst is becoming one of the most important components of Apache Spark, as it underpins all the major new APIs in Spark 2.0, from DataFrames and Datasets to Streaming. At its core, Catalyst is a general library for manipulating trees.
Herman van Hövell tot Westerflier explores a modular compiler frontend for Spark based on this library that includes a query analyzer, optimizer, and an execution planner. Herman offers a deep dive into Spark SQL’s Catalyst optimizer, introducing the core concepts of Catalyst and demonstrating how new and upcoming features such as CBO are implemented using Catalyst. You’ll leave with a deeper understanding of how Spark analyzes, optimizes, and plans a user’s query.
Herman van Hövell tot Westerflier is a Spark committer working on Spark SQL at Databricks. Previously, Herman was a consultant working for clients in banking, manufacturing, and logistics. His interests include database systems, optimization, and simulation.
©2017, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com