Data science is a process of abstraction. In order to explain or to predict a real phenomena, the process starts with acquiring and refining the data. It then moves between the three layers of abstraction: transformations (data abstraction), visualizations (visual abstraction), and modeling (symbolic abstraction). All three layers of abstraction together build a truer (or closer) representation of the real phenomena.
Data visualization (data-vis) helps us to understand the portrait and the shape of the data. The science of data-vis for exploratory data analysis is well developed for both static graphics (scatter-plot matrices, glyph-based approaches, geometric transforms like parallel coordinates) and interactive graphics (layering, brushing and linking, projections and tours). (For more information, see Amit Kapoor’s Strata + Hadoop World Singapore talk, Visualizing Multidimensional Data.) Though visualization is used in data science to understand the shape of the data, it’s not widely used for statistical models, which are evaluated based on numerical summaries.
Amit Kapoor demonstrates extending visualization to the statistical model (model-vis), which aids in understanding the shape of the model, the impact of parameters and input data on the model, the fit of the model, and where it can be improved. Model visualization can help us to understand the shape of the model and compare it to the shape of the data. It allows us to see the fit of the model and understand where the fit can be improved. It also allows us to better understand the parameters in the model and how the model changes when the parameters change as well as how the parameters changes when the input data changes.
The science and tools for model-vis are still very underdeveloped. Amit looks at practical examples of doing model-vis in regression (linear, lasso), classification (logistic, trees, LDA), and clustering (hierarchical) problems that can help us better understand the model. This includes exploring model-vis approaches that:
Integrating these approaches for model-vis as a part of model evaluation strengthens a data scientist’s understanding of the model and leads to better model building, complementing data-vis for fitting better models as well as communicating the insight from the data science process.
Amit Kapoor is a data storyteller at narrativeViz, where he uses storytelling and data visualization as tools for improving communication, persuasion, and leadership through workshops and trainings conducted for corporations, nonprofits, colleges, and individuals. Interested in learning and teaching the craft of telling visual stories with data, Amit also teaches storytelling with data for executive courses as a guest faculty member at IIM Bangalore and IIM Ahmedabad. Amit’s background is in strategy consulting, using data-driven stories to drive change across organizations and businesses. Previously, he gained more than 12 years of management consulting experience with A.T. Kearney in India, Booz & Company in Europe, and startups in Bangalore. Amit holds a BTech in mechanical engineering from IIT, Delhi, and a PGDM (MBA) from IIM, Ahmedabad.
©2016, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.