Exploratory Data Analysis with Apache Spark

Hossein Falaki (Databricks Inc.)
Data Science
Location: 113
Average rating: ****.
(4.07, 14 ratings)

The first step of any data science project is exploring the data. When data is large, exploration becomes a challenge. Long processing times prohibit interactive exploration, and visualization tools are difficult to use because there can be more data points than there are pixels on any visual medium. We will demonstrate how to solve these problems with Apache Spark. By caching data in memory, Spark can bring query latency down to the range of human interactions. We will also demonstrate how to combine popular visualization tools, such as ggplot and matplotlib, with Spark to facilitate big data visualization. We use Spark to summarize, sample, and model the data. The result of these steps can be readily consumed by many visualization tools to aid exploration. Spark’s unified programming model and diverse programming interfaces enable applying these techniques in a single environment to easily get insight from data. We will use a real big dataset to demonstrate these techniques in a live demo.

Photo of Hossein Falaki

Hossein Falaki

Databricks Inc.

Hossein Falaki is a software engineer at Databricks working on the next big thing. Prior to that he was a data scientist at Apple’s personal assistant, Siri. He graduated with Ph.D. in Computer Science from UCLA, where he was a member of the Center for Embedded Networked Sensing (CENS).