When there is a strong signal in a large dataset, many machine-learning algorithms will find it. On the other hand, when the effect is weak and the data is large, there are many ways to discover an effect that is in fact nothing more than noise. Robert Grossman shares best practices by exploring three case studies to make it a bit less likely that you will be accused of p-hacking.
The first case study concerns mutations in breast cancer and some of the complexities of understanding rare mutations and combinations of rare mutations. In the second case study, Robert dives into different methods for understanding whether there is an effect on the health of newborns when pregnant women are exposed to particulate matter (solid and liquid particles suspended in air). The third case study looks at a well-known published paper offering evidence for ESP. Robert extracts several techniques from these three case studies that have consistently proved useful and discusses how best these techniques can be used in practice.
Robert Grossman is a faculty member and the chief research informatics officer in the Biological Sciences Division of the University of Chicago. Robert is the director of the Center for Data Intensive Science (CDIS) and a senior fellow at both the Computation Institute (CI) and the Institute for Genomics and Systems Biology (IGSB). He is also the founder and a partner of the Open Data Group, which specializes in building predictive models over big data. Robert has led the development of open source software tools for analyzing big data (Augustus), distributed computing (Sector), and high-performance networking (UDT). In 1996, he founded Magnify, Inc., which provides data-mining solutions to the insurance industry and was sold to ChoicePoint in 2005. He is also the chair of the Open Cloud Consortium, a not-for-profit that supports the research community by operating cloud infrastructure, such as the Open Science Data Cloud. He blogs occasionally about big data, data science, and data engineering at Rgrossman.com.
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
©2017, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.