This talk will use the example of sentiment analysis to show that supervised machine learning has the potential to amplify the voices of the most privileged people in society.
A sentiment analysis algorithm is considered ‘table stakes’ for any serious text analytics platform in social media, finance, or security. In order for the problem to be tractable and the results to be interpretable, these algorithms reduce the ‘sentiment’ of a text to a one-dimensional classification (very positive, fairly negative, etc.). As an example of supervised machine learning, I’ll review briefly how these algorithms are trained. I’ll explain this process qualitatively so you develop an intuition for what is going on, but I’ll also show Python code that will give you practical techniques you can apply to your own data.
This one-dimensional, supervised approach means that sentiment analysis algorithms fail to measure what they claim to measure, but they don’t measure nothing. Rather they learn to spot unsubtle expressions of extreme emotion. In fact, the words a simple algorithm learns that are the most predictive of sentiment tend to be used by a particularly privileged group of authors: men.
From this specific example, I will develop the ways in which a supervised machine-learning algorithm can embed biases that enhance privilege or are otherwise harmful: from training data, to figures of merit, to feature selection.
These issues are morally and legally important to everyone who is in the business of making inferences about people from data.
Mike Lee Williams is a research engineer at Cloudera Fast Forward Labs, where he builds prototypes that bring the latest ideas in machine learning and AI to life and helps Cloudera’s customers understand how to make use of these new technologies. Mike holds a PhD in astrophysics from Oxford.
©2015, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.