Presented By O'Reilly and Cloudera
Make Data Work
March 13–14, 2017: Training
March 14–16, 2017: Tutorials & Conference
San Jose, CA

Exploration and visualization of large, complex datasets with R, Hadoop, and Spark

Stephen Elston (Quantia Analytics, LLC), Ryan Hafen (Hafen Consulting)
9:00am12:30pm Tuesday, March 14, 2017
Secondary topics:  R
Average rating: ****.
(4.12, 8 ratings)

Who is this presentation for?

  • Data scientists, analysts, and architects

Prerequisite knowledge

  • Experience using R
  • Materials or downloads needed in advance

    • A laptop (Windows, Linux, or MacOS) with 8 GB of RAM

    What you'll learn

    • Develop skills in exploration and visualization of large, complex datasets using R, Hadoop, and Spark

    Description

    Exploration and visualization of large, complex datasets presents a significant challenge for data scientists. Divide and recombine (D&R) techniques provide scalable methods for exploration and visualization of otherwise intractable datasets. D&R divides data into meaningful subsets, performs embarrassingly parallel computations on the subsets, and combines results in a statistically valid manner. The most important and meaningful chunks of massive datasets are then visualized.

    Stephen Elston and Ryan Hafen lead a series of hands-on exercises to help you develop skills in exploration and visualization of large, complex datasets using R, Hadoop, and Spark. The D&R approach is implemented in the DeltaRho project—a collection of R packages that provide a frontend and connectors to specify D&R analytic and visualization operations on a cluster. The datadr package provides a highly abstracted interface for performing D&R operations, enabling users to easily interact with distributed parallel backend computation environments such as Hadoop and Spark. The Trelliscope package provides a D&R approach for detailed, flexible, and interactive visualization of large, complex data.

    Topics include:

    • How to apply the divide and recombine approach to partition data into meaningful subsets, perform computations on the subsets, and combine the results in a statistically valid manner
    • How to create interactive visualizations based on partitions of large datasets that enable deep exploration and discovery of the most important aspects of the data by interactively ordering and filtering the multiple views
    • How to apply these methods on a local machine or as embarrassingly parallel computations to data subsets on a cluster
    Photo of Stephen Elston

    Stephen Elston

    Quantia Analytics, LLC

    Stephen Elston is an experienced big data geek, data scientist, and software business leader. Steve is principal consultant at Quantia Analytics, LLC, where he leads the building of new business lines, manages P&L, and takes software products from concept and financing through development, intellectual property protection, sales, customer shipment, and support. Steve is also an instructor for the University of Washington data science program. Steve has over two decades of experience in visualization, predictive analytics and machine learning, at scales from small to massive, using many platforms including Hadoop, Spark, R, S/SPLUS, and Python. He has created solutions in fraud detection, capital markets, wireless systems, law enforcement, and streaming analytics for the IoT.

    Photo of Ryan Hafen

    Ryan Hafen

    Hafen Consulting

    Ryan Hafen is an independent statistical consultant and an adjunct assistant professor in the Statistics Department at Purdue University. Ryan’s research focuses on methodology, tools, and applications in exploratory analysis, statistical model building, and machine learning on large, complex datasets. He is the developer of the datadr and Trelliscope components of the Tessera project (now DeltaRho) as well as the rbokeh visualization package. Ryan’s applied work on analyzing large, complex data has spanned many domains, including power systems engineering, nuclear forensics, high-energy physics, biology, and cybersecurity. Ryan holds a BS in statistics from Utah State University, an MStat in mathematics from the University of Utah, and a PhD in statistics from Purdue University.