Good recommendations turn occasional users into daily users. When users don’t have to search for good content, they engage more. Collaborative filtering is one of the most successful tools for recommending the right content to the right audience, yet it requires plenty of signal to work efficiently. New users with little history get completely unpersonalized recommendations, and fresh content gets recommended to no one (or worse, to the wrong audience). This is known as the cold-start problem.
So what’s the solution? If a brand-new user watches a freshly uploaded biking video, you would probably want to recommend more biking content, and possibly suggest related topics such as fitness. In short, instead of direct content-to-content recommendations, you would go for content-to-topic-to-content.
The first step is to build a system that automatically figures out what your content is about. One option would be to just parse the text (i.e., the video title and description) and extract unusually frequent tuples of words (using tf–idf). However, plain text can be ambiguous (does “football” refer to American football or to soccer?), and it can be redundant (“sky diving” is the same as “parachuting”) and messy (“Mikael Jackson” is clearly “Michael Jackson” misspelled).
Aurélien Géron shares a much better option: leveraging the power of knowledge graphs such as Wikidata, DBpedia, and Google’s KG (initially based on Freebase, which has been sunsetted in favor of Wikidata). Each node in a graph represents a unique, unambiguous topic, and these topics are connected into a gigantic machine-queryable graph. This structure can be exploited to provide meaningful, consistent, browsable, and personalized content (e.g., list the most famous professional soccer players born in the user’s city). Many signals can be used to identify a content’s topic, from text (title, description, comments, anchors, search queries, etc.) to user behavior (e.g., topics explored during the same session) to audiovisual content analysis (using deep learning), and beyond.
No tool is perfect, and knowledge graphs are no exception. In particular, although they are great at making good recommendations for new users and serving fresh content to the right audience, they are not ideal for new topics since it takes time for a new topic to be added to a knowledge graph. One solution is to use a mixed vocabulary including both KG topics and plain text.
Beyond recommendations, better understanding what your content is actually about can help you drive your content strategy (e.g., do users engage with cooking videos?), enable context-aware ad targeting (e.g., display a makeup ad on beauty tips content), improve search (e.g., people searching for “Paris” could get only results about the city, plus a disambiguation box asking them whether they meant “Paris Hilton”, or the band “Paris”), structure the user experience (e.g., if the content is about a movie, show the main actor bios), help detect spam (e.g., why is the content unrelated to its title?), and much more.
Aurélien Géron is a machine learning consultant at Kiwisoft and author of the best-selling O’Reilly book Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow. Previously, he led YouTube’s video classification team, was a founder and CTO of Wifirst, and was a consultant in a variety of domains: finance (JPMorgan and Société Générale), defense (Canada’s DOD), and healthcare (blood transfusion). He also published a few technical books (on C++, WiFi, and internet architectures), and he’s a lecturer at the Dauphine University in Paris. He lives in Singapore with his wife and three children.
©2017, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org