The first step of any data science project is exploring the data. When data is large, exploration becomes a challenge. Long processing times prohibit interactive exploration, and visualization tools are difficult to use because there can be more data points than there are pixels on any visual medium. We will demonstrate how to solve these problems with Apache Spark. By caching data in memory, Spark can bring query latency down to the range of human interactions. We will also demonstrate how to combine popular visualization tools, such as ggplot and matplotlib, with Spark to facilitate big data visualization. We use Spark to summarize, sample, and model the data. The result of these steps can be readily consumed by many visualization tools to aid exploration. Spark’s unified programming model and diverse programming interfaces enable applying these techniques in a single environment to easily get insight from data. We will use a real big dataset to demonstrate these techniques in a live demo.
Hossein Falaki is a software engineer at Databricks working on the next big thing. Prior to that he was a data scientist at Apple’s personal assistant, Siri. He graduated with Ph.D. in Computer Science from UCLA, where he was a member of the Center for Embedded Networked Sensing (CENS).
For exhibition and sponsorship opportunities, email firstname.lastname@example.org
For information on trade opportunities with O'Reilly conferences, email email@example.com
For media-related inquiries, contact Maureen Jennings at firstname.lastname@example.org
View a complete list of Strata + Hadoop World contacts
©2015, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.