Collections of tweets are overly rich in the sense that not all tweets are relevant for a task at hand (for instance, because they are posted by nonhuman accounts, contain spam, refer to irrelevant events, or point to irrelevant sense of an ambiguous keyword used in data collection). This richness lends tweet collections a dynamic characteristic—there is no guarantee that tweet collections will have similar characteristics across different periods of time.
Ali Hürriyetoglu and Nelleke Oostdijk share the results of a study on using unsupervised and supervised machine learning with linguistic insight to enable people to identify relevant tweets for their needs and offer an overview of their tool, Relevancer. Relevancer enables an expert—namely, anybody who is able to make knowledgeable decisions about how to annotate tweet clusters in order to understand a tweet collection in a certain context—to analyze a tweet collection. Related groups of tweets (information threads), defined by the expert, are detected using unsupervised machine learning, confirmed by the expert, and used to classify remaining or new tweets using supervised machine learning. The tool requires expert feedback in terms of cluster annotation in order to complete the analysis. Experts can repeat the analysis process in case they collect new data with the same keywords or decide to do another type of annotation as they understand the collection better when they evaluate the automatically selected first set of coherent clusters. This method advances the state of art in terms of efficient and complete understanding and management of a nonstandard, rich, and dynamic data type.
The strength of this approach is the ability to scale to a large collection without sacrificing the precision or the recall by understanding intrinsic characteristics of the features that can be extracted from tweets, used key terms, and temporal content distribution on social media. Finally, sharing the responsibility for completeness and precision with the users of the tool ensures they will achieve and preserve the target performance they require.
Ali and Nelleke demonstrate how to work with Relevancer on four use cases using tweet collections collected with the words “flood,” “earthquake,” and “genocide.”
Ali Hürriyetoglu is a data scientist at Statistics Netherlands and a PhD candidate at Radboud University, where his research focuses on the information extraction and social media analysis fields. He has a background in computer science. He has worked for a number of organizations, including EU JRC and Appen, in the language technologies area and recently completed a five-month traineeship at Netbase Solutions Inc. in Santa Clara, CA, where he focused on Turkish morphological analysis and sentiment analysis. Born to a family with Arabic origins in Turkey, Ali is fluent in Arabic, Turkish, English, German, Italian, and Dutch. He holds an undergraduate degree in computer engineering from Ege University in Izmir, Turkey, and a master’s degree in cognitive science at Middle East Technical University in Ankara, Turkey. During his undergraduate studies, he was an exchange student at Technische Hochschule Mittelhessen in Giessen, Germany.
Nelleke Oostdijk is an associate professor at Radboud University in Nijmegen, the Netherlands. A computational linguist with a keen interest in language use and variation, Nelleke has been involved in various projects directed at extracting information from social media data. More specifically, she has been exploring ways in which linguistic knowledge could be brought into play to improve on purely machine-learning approaches. In collaborations with different societal partners, she has helped demonstrate the strength of a hybrid approach when applied to a range of topic and domains, including detecting threatening tweets, identifying forum posts suggesting that specific food supplements might be contaminated, and topic and event detection in the case of tweets about natural disasters (earthquakes, floods, etc.) and emergencies.
©2017, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org