In this talk Michael will describe Spark SQL, the newest component of the Apache Spark stack, which adds native support for querying structured data using SQL. Unlike previous efforts to execute queries on Spark, Spark SQL is included in the distribution and is closely coupled with the rest of the ecosystem. Additionally, it has a brand new query optimizer that is specialized for the RDD computation model.
The tight coupling of Spark SQL’s interfaces (available in Scala, Java, and Python) lets developers natively query data stored in both existing RDDs as well as data stored in external sources. Specifically, a key feature of Spark SQL is the ability to blur the lines between relational tables and RDDs, making it easy for developers to intermix SQL commands that query structured data with complex analytics in imperative or functional languages. This functionality makes it possible to run SQL commands and directly pipe the results into powerful libraries such as MLLib or numpy, all in a single program.
In addition to Spark SQL, Michael will also talk about the Catalyst optimizer framework, which allows Spark SQL to automatically rewrite query plans to execute more efficiently.
Michael Armbrust is the lead developer of the Spark SQL project at Databricks. He received his PhD from UC Berkeley in 2013, and was advised by Michael Franklin, David Patterson and Armando Fox. His thesis focused on building systems that allow developers to rapidly build scalable interactive applications, and specifically defined the notion of scale independence. His interests broadly include distributed systems, large-scale structured storage and query optimization.
Comments on this page are now closed.