Clustering problems appear frequently across fields as diverse as marketing and server farm operations, and we have known since the most basic clustering algorithms were first developed that slight changes in parameter selection could yield different results. ML clustering problems requiring human judgement introduce an opportunity for bias to influence final results. When we think of this in a positive light, we call it expertise; when we examine it in a negative light, we call it error.
A constrained experiment conducted at Pinterest provided nine veteran data scientists with a tool allowing them to cluster the same set of data in a reproducible way that also required human judgement to write a small number of rules. The results were shocking. The nine data scientists each wrote well-justified rules that helped guide the clustering, but the final results were dramatically divergent. Is this the nonacademic world’s hidden replication crisis?
The results of Pinterest’s experiment have tremendous implications for the data science community. June Andrews and Frances Haugen explore the aspects of analysis that cause differences in conclusions and call into question techniques that are intended to improve reproducibility. This problem may be larger than originally thought, as most companies do not have a high-enough concentration of data scientists (nor sufficient staffing or time to allocate multiple data scientists to the same projects) to encounter divergent results like these.
Join June and Frances to learn how to begin addressing this problem.
June Andrews is a Principal Data Scientist at Wise/GE Digital working on a machine learning and data science platform for the Industrial Internet of Things, which includes aviation, trains, and power plants. Previously, she worked at Pinterest spearheading the Data Trustworthiness and Signals Program to create a healthy data ecosystem for machine learning. She has also lead efforts at LinkedIn on growth, engagement, and social network analysis to increase economic opportunity for professionals. June holds degrees in applied mathematics, computer science, and electrical engineering from UC Berkeley and Cornell.
Frances Haugen is a data product manager at Pinterest focusing on ranking content in the Home Feed and Related Pins and the challenges of driving immediate user engagement without harming the long-term health of the Pinterest content ecosystem. Previously, Frances worked at Google, where she founded the Google+ Search team and built the first non-“Quality”-based search experience at Google. (It was time based with light spam filtering.) She also cofounded the Google Boston Search team. Frances loves user-facing big data applications and finding ways to make mountains of information useful and delightful to the user. She was a member of the founding class of Olin College and holds a master’s degree from Harvard.
©2017, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.