Presented By O'Reilly and Cloudera
Make Data Work
March 13–14, 2017: Training
March 14–16, 2017: Tutorials & Conference
San Jose, CA

The dangers of statistical significance when studying weak effects in big data: From natural experiments to p-hacking

Robert Grossman (University of Chicago)
2:40pm3:20pm Wednesday, March 15, 2017
Data science & advanced analytics
Location: 230 C Level: Intermediate
Secondary topics:  Hardcore Data Science, Healthcare
Average rating: ***..
(3.73, 11 ratings)

Who is this presentation for?

  • Practicing data scientists who build statistical models and use machine learning

Prerequisite knowledge

  • General experience building statistical and analytic models over large datasets
  • A basic understanding of p-values

What you'll learn

  • Understand some of the dangers when building analytic models to discover weak effects in large datasets and some practical approaches that can work


When there is a strong signal in a large dataset, many machine-learning algorithms will find it. On the other hand, when the effect is weak and the data is large, there are many ways to discover an effect that is in fact nothing more than noise. Robert Grossman shares best practices by exploring three case studies to make it a bit less likely that you will be accused of p-hacking.

The first case study concerns mutations in breast cancer and some of the complexities of understanding rare mutations and combinations of rare mutations. In the second case study, Robert dives into different methods for understanding whether there is an effect on the health of newborns when pregnant women are exposed to particulate matter (solid and liquid particles suspended in air). The third case study looks at a well-known published paper offering evidence for ESP. Robert extracts several techniques from these three case studies that have consistently proved useful and discusses how best these techniques can be used in practice.

Topics include:

  • Effect size, variance, and statistical power
  • Correcting for multiple experiments
  • p-hacking and the problems of forking paths
  • Randomized and natural experiments
  • The complexities of subgroups
  • Saying no to p-values: Hierarchical Bayesian models
  • BIC and AIC
  • The role of theoretical models, causal models, and simulation
  • Why small sample sizes still occur with big data
Photo of Robert Grossman

Robert Grossman

University of Chicago

Robert Grossman is a faculty member and the chief research informatics officer in the Biological Sciences Division of the University of Chicago. Robert is the director of the Center for Data Intensive Science (CDIS) and a senior fellow at both the Computation Institute (CI) and the Institute for Genomics and Systems Biology (IGSB). He is also the founder and a partner of the Open Data Group, which specializes in building predictive models over big data. Robert has led the development of open source software tools for analyzing big data (Augustus), distributed computing (Sector), and high-performance networking (UDT). In 1996, he founded Magnify, Inc., which provides data-mining solutions to the insurance industry and was sold to ChoicePoint in 2005. He is also the chair of the Open Cloud Consortium, a not-for-profit that supports the research community by operating cloud infrastructure, such as the Open Science Data Cloud. He blogs occasionally about big data, data science, and data engineering at

Comments on this page are now closed.


Wilmer Masterson | DATA SCIENTIST
03/15/2017 3:34pm PDT

Where is the link for the slides?