Visual exploration of big data is challenging. Processing times prohibit interactive visualization, and there can be more data points than there are pixels on any visual medium. We will demonstrate how to use Apache Spark along with popular visual tools such as IPython Notebook and ggplot to overcome these challenges. First, by caching data in memory, we will bring query latency down to the range of human interactions. Second, we will demonstrate how to combine visual tools with Spark to apply three specific techniques to overcome the second problem: a) “Summarize and visualize” is a technique used by many BI tools. We will show how to do it rapidly and interactively with Spark. b) “Sample and visualize” has been used by statisticians for a long time. We will show how Spark supports different sampling techniques and what the challenges and solutions are when sampling large datasets. c) “Model and visualize” is possible with Spark’s MLLib module. Spark’s unified programming model and diverse programming interfaces enable combining these techniques in a single environment to get insight from data. We will use a real big dataset, such as wikipedia traffic logs, to demonstrate these techniques in a live demo.
Hossein Falaki is a software engineer at Databricks working on the next big thing. Prior to that he was a data scientist at Apple’s personal assistant, Siri. He graduated with Ph.D. in Computer Science from UCLA, where he was a member of the Center for Embedded Networked Sensing (CENS).