Search Query Categorization at Scale

Alex Dorman (Magnetic), Michal Laclavik (Magnetic)
Data Science
Location: 113
Average rating: ***..
(3.00, 2 ratings)
Slides:   1-PPTX 

Classification of short text into a predefined hierarchy of categories is a challenge. The need to categorize short texts arises in multiple domains: keywords and queries in online advertising, improvement of search engine results, analysis of tweets or messages in social networks, etc. We leverage community-moderated, freely-available data sets (Wikipedia, DBPedia, Freebase) and open-source tools (Hadoop, Solr) to build a flexible and extensible classification model. Magnetic is an online advertising company specializing in search retargeting and applying data science to online search behavior. We create custom real-time audience segments based on what users have searched for across the web. Targeting an individual keywords found in user search history is a great way to build an audience. But the need to create manually selected keywords might present operational challenge. Ability to classify queries and keywords helps to create larger audiences with less effort and better accuracy. Among the other use cases for keyword classification in online advertising are reporting on size of inventory available by category, and campaign performance optimization. We will share our experiences building a real-world data science system that scales to production data volumes of more than 20 million keyword classifications per hour. And will touch on some aspect of knowledge discovery such as language detection, n-gram extraction, and entity recognition.

Photo of Alex Dorman

Alex Dorman


Alex Dorman, CTO at Magnetic, holds more than twenty years of technology experience and fifteen years of experience managing engineering and data science teams. Alex is using Hadoop technologies for last 7+ years. Magnetic is an online advertising company and is leader in search retargeting. Before joining Magnetic, Alex built Big Data platforms and teams at Proclivity Media and ContextWeb/PulsePoint. Alex began his career at Intel Software Labs in Israel.

Photo of Michal Laclavik

Michal Laclavik


Michal Laclavik, Sr. Data Scientist at Magnetic, has more than ten years experience on R&D in the field of Semantic Technologies, Information Retrieval and Big Data technologies. Michal is using Hadoop for his research since 2008. Before joining Magnetic, he was doing PhD at Slovak Academy of Science and working as a researcher on several EU funded projects. Michal has multiple publications. He is also giving lectures on Information Retrieval at Slovak University of Technology.