Making text classification trivial - combining Apache Lucene and Mahout

Tools & Technology
Location: King's Suite - Sandringham Level: Intermediate
Average rating: *....
(1.88, 8 ratings)
Slides:   external link

Text classification automates the task of filing documents into
pre-defined categories based on a set of example documents. The first step in automating classification is to transform the documents to feature vectors. Though this step is highly domain specific Apache Mahout provides you with a lot of easy to use tooling to help you get started, most of which relies heavily on Apache Lucene for analysis, tokenisation and filtering.

This talk shows how to use facetting to quickly get an understanding of the fields in your document. It walks you through the steps necessary to convert your text documents into feature vectors that Mahout classifiers can use including a few anecdotes on drafting domain specific features.

Photo of Isabel Drost

Isabel Drost

Apache Software Foundation/ Nokia Gate 5 GmbH

Isabel Drost-Fromm is member of the Apache Software Foundation. She is founder of the Apache Hadoop Get Together in Berlin, was co-organiser of the first European NoSQL meetup as well as the Berlin Buzzwords conference. She co-founded Apache Mahout and is active Apache Mahout committer. Isabel is actively engaged with communities of various Apache projects, e.g. Apache Lucene and Apache Hadoop. She is regular speaker at renown conferences on topics related to free software development, scalability, Apache Lucene, Apache Hadoop and Apache Mahout. Currently Isabel Drost-Fromm works for Nokia Gate 5 GmbH as Software Developer.

Comments on this page are now closed.

Comments

Picture of Shirley Bailes
Shirley Bailes
13/11/2013 9:36 GMT

@louis v: We have replaced the PDF with the URL the speaker originally provided. This is working now.

Picture of louis v
louis v
13/11/2013 9:16 GMT

The PDF file seems to be corrupted.

Picture of Shirley Bailes
Shirley Bailes
19/09/2013 16:28 BST

Thorsten: I cannot speak to Isabel’s Xing profile, but her blog opens fine for me: http://blog.isabel-drost.de/.

Picture of Thorsten Luedtke
Thorsten Luedtke
19/09/2013 11:36 BST

http://blog.isabel-drost.de/ doesn’t work. The Xing profile is also dark…

Sponsors

Sponsorship Opportunities

For exhibition and sponsorship opportunities, contact Susan Stewart at sstewart@oreilly.com

Media Partner Opportunities

For information on trade opportunities with O'Reilly conferences email mediapartners
@oreilly.com

Press & Media

For media-related inquiries, contact Maureen Jennings at maureen@oreilly.com

Contact Us

View a complete list of Strata contacts