Classification of short text into a predefined hierarchy of categories is a challenge. The need to categorize short texts arises in multiple domains: keywords and queries in online advertising, improvement of search engine results, analysis of tweets or messages in social networks, etc. We leverage community-moderated, freely-available data sets (Wikipedia, DBPedia, Freebase) and open-source tools (Hadoop, Solr) to build a flexible and extensible classification model. Magnetic is an online advertising company specializing in search retargeting and applying data science to online search behavior. We create custom real-time audience segments based on what users have searched for across the web. Targeting an individual keywords found in user search history is a great way to build an audience. But the need to create manually selected keywords might present operational challenge. Ability to classify queries and keywords helps to create larger audiences with less effort and better accuracy. Among the other use cases for keyword classification in online advertising are reporting on size of inventory available by category, and campaign performance optimization. We will share our experiences building a real-world data science system that scales to production data volumes of more than 20 million keyword classifications per hour. And will touch on some aspect of knowledge discovery such as language detection, n-gram extraction, and entity recognition.
Alex Dorman, CTO at Magnetic, holds more than twenty years of technology experience and fifteen years of experience managing engineering and data science teams. Alex is using Hadoop technologies for last 7+ years. Magnetic is an online advertising company and is leader in search retargeting. Before joining Magnetic, Alex built Big Data platforms and teams at Proclivity Media and ContextWeb/PulsePoint. Alex began his career at Intel Software Labs in Israel.
Michal Laclavik, Sr. Data Scientist at Magnetic, has more than ten years experience on R&D in the field of Semantic Technologies, Information Retrieval and Big Data technologies. Michal is using Hadoop for his research since 2008. Before joining Magnetic, he was doing PhD at Slovak Academy of Science and working as a researcher on several EU funded projects. Michal has multiple publications. He is also giving lectures on Information Retrieval at Slovak University of Technology.
For exhibition and sponsorship opportunities, email email@example.com
For information on trade opportunities with O'Reilly conferences, email firstname.lastname@example.org
For media-related inquiries, contact Maureen Jennings at email@example.com
View a complete list of Strata + Hadoop World contacts
©2015, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.