Presented By O'Reilly and Cloudera
Make Data Work
March 13–14, 2017: Training
March 14–16, 2017: Tutorials & Conference
San Jose, CA

When is data science a house of cards? Replicating data science conclusions

June Andrews (Wise / GE Digital), Frances Haugen (Pinterest)
11:00am11:40am Thursday, March 16, 2017
Data science & advanced analytics
Location: LL21 A Level: Beginner
Average rating: *****
(5.00, 5 ratings)

Who is this presentation for?

  • Data scientists, data professionals, machine-learning engineers, product managers, business strategists, and executives

What you'll learn

  • Understand what aspects of analysis cause differences in conclusions

Description

Clustering problems appear frequently across fields as diverse as marketing and server farm operations, and we have known since the most basic clustering algorithms were first developed that slight changes in parameter selection could yield different results. ML clustering problems requiring human judgement introduce an opportunity for bias to influence final results. When we think of this in a positive light, we call it expertise; when we examine it in a negative light, we call it error.

A constrained experiment conducted at Pinterest provided nine veteran data scientists with a tool allowing them to cluster the same set of data in a reproducible way that also required human judgement to write a small number of rules. The results were shocking. The nine data scientists each wrote well-justified rules that helped guide the clustering, but the final results were dramatically divergent. Is this the nonacademic world’s hidden replication crisis?

The results of Pinterest’s experiment have tremendous implications for the data science community. June Andrews and Frances Haugen explore the aspects of analysis that cause differences in conclusions and call into question techniques that are intended to improve reproducibility. This problem may be larger than originally thought, as most companies do not have a high-enough concentration of data scientists (nor sufficient staffing or time to allocate multiple data scientists to the same projects) to encounter divergent results like these.

Join June and Frances to learn how to begin addressing this problem.

Photo of June Andrews

June Andrews

Wise / GE Digital

June Andrews is a Principal Data Scientist at Wise/GE Digital working on a machine learning and data science platform for the Industrial Internet of Things, which includes aviation, trains, and power plants. Previously, she worked at Pinterest spearheading the Data Trustworthiness and Signals Program to create a healthy data ecosystem for machine learning. She has also lead efforts at LinkedIn on growth, engagement, and social network analysis to increase economic opportunity for professionals. June holds degrees in applied mathematics, computer science, and electrical engineering from UC Berkeley and Cornell.

Photo of Frances Haugen

Frances Haugen

Pinterest

Frances Haugen is a data product manager at Pinterest focusing on ranking content in the Home Feed and Related Pins and the challenges of driving immediate user engagement without harming the long-term health of the Pinterest content ecosystem. Previously, Frances worked at Google, where she founded the Google+ Search team and built the first non-“Quality”-based search experience at Google. (It was time based with light spam filtering.) She also cofounded the Google Boston Search team. Frances loves user-facing big data applications and finding ways to make mountains of information useful and delightful to the user. She was a member of the founding class of Olin College and holds a master’s degree from Harvard.