Exploration and visualization of large, complex datasets presents a significant challenge for data scientists. Divide and recombine (D&R) techniques provide scalable methods for exploration and visualization of otherwise intractable datasets. D&R divides data into meaningful subsets, performs embarrassingly parallel computations on the subsets, and combines results in a statistically valid manner. The most important and meaningful chunks of massive datasets are then visualized.
Stephen Elston and Ryan Hafen lead a series of hands-on exercises to help you develop skills in exploration and visualization of large, complex datasets using R, Hadoop, and Spark. The D&R approach is implemented in the DeltaRho project—a collection of R packages that provide a frontend and connectors to specify D&R analytic and visualization operations on a cluster. The datadr package provides a highly abstracted interface for performing D&R operations, enabling users to easily interact with distributed parallel backend computation environments such as Hadoop and Spark. The Trelliscope package provides a D&R approach for detailed, flexible, and interactive visualization of large, complex data.
Stephen Elston is an experienced big data geek, data scientist, and software business leader. Steve is principal consultant at Quantia Analytics, LLC, where he leads the building of new business lines, manages P&L, and takes software products from concept and financing through development, intellectual property protection, sales, customer shipment, and support. Steve is also an instructor for the University of Washington data science program. Steve has over two decades of experience in visualization, predictive analytics and machine learning, at scales from small to massive, using many platforms including Hadoop, Spark, R, S/SPLUS, and Python. He has created solutions in fraud detection, capital markets, wireless systems, law enforcement, and streaming analytics for the IoT.
Ryan Hafen is an independent statistical consultant and an adjunct assistant professor in the Statistics Department at Purdue University. Ryan’s research focuses on methodology, tools, and applications in exploratory analysis, statistical model building, and machine learning on large, complex datasets. He is the developer of the datadr and Trelliscope components of the Tessera project (now DeltaRho) as well as the rbokeh visualization package. Ryan’s applied work on analyzing large, complex data has spanned many domains, including power systems engineering, nuclear forensics, high-energy physics, biology, and cybersecurity. Ryan holds a BS in statistics from Utah State University, an MStat in mathematics from the University of Utah, and a PhD in statistics from Purdue University.
©2017, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.