Text classification automates the task of filing documents into
pre-defined categories based on a set of example documents. The first step in automating classification is to transform the documents to feature vectors. Though this step is highly domain specific Apache Mahout provides you with a lot of easy to use tooling to help you get started, most of which relies heavily on Apache Lucene for analysis, tokenisation and filtering.
This talk shows how to use facetting to quickly get an understanding of the fields in your document. It walks you through the steps necessary to convert your text documents into feature vectors that Mahout classifiers can use including a few anecdotes on drafting domain specific features.
Isabel Drost-Fromm is member of the Apache Software Foundation. She is founder of the Apache Hadoop Get Together in Berlin, was co-organiser of the first European NoSQL meetup as well as the Berlin Buzzwords conference. She co-founded Apache Mahout and is active Apache Mahout committer. Isabel is actively engaged with communities of various Apache projects, e.g. Apache Lucene and Apache Hadoop. She is regular speaker at renown conferences on topics related to free software development, scalability, Apache Lucene, Apache Hadoop and Apache Mahout. Currently Isabel Drost-Fromm works for Nokia Gate 5 GmbH as Software Developer.
Comments on this page are now closed.
For exhibition and sponsorship opportunities, contact Susan Stewart at firstname.lastname@example.org
For information on trade opportunities with O'Reilly conferences email mediapartners
For media-related inquiries, contact Maureen Jennings at email@example.com
View a complete list of Strata contacts