Large scale web mining

Ken Krugler (Scale Unlimited)
Data Science, Ballroom E
Please note: to attend, your registration must include Tutorials.
Average rating: **...
(2.75, 4 ratings)

Attendees: Please read the instructions & prerequisites before arriving to the tutorial.

This tutorial will teach attendees about the key aspects of scalable web mining, via six modules:

1. Introduction

  • Why web data is valuable
  • Key challenges to web crawling
  • Realistic definitions for success

2. Focused Web Crawling

  • Reducing time & cost by focusing the crawl
  • Approaches to classifying and scoring pages
  • Solutions for scalable web crawling

3. Structured Data Extraction

  • Data mining essentials
  • Structured text extraction
  • Automated vs. manual extraction

4. Analyzing the Data

  • Making it searchable
  • Finding “interesting” text
  • Machine learning with Mahout

5. Barriers to Success

  • Polite crawling versus deep crawling
  • Spam, splog, honeypots and nasty webmasters
  • Ajax, robots.txt and Facebook

6. Examples and Summary

  • Hotel reviews
  • Music pages
  • SEO analysis
Photo of Ken Krugler

Ken Krugler

Scale Unlimited

Veteran developer and entrepreneur, 25+ years experience. Founder and President of TransPac Software, a 20 year leader in internationalization, mobile devices, and search consulting. Founder and CTO of Krugle, a vertical search engine and enterprise appliance for code and technical information. Co-founder of Bixo web mining project. Committer for the Apache Tika project. Author and speaker on vertical search and web mining.

Comments on this page are now closed.


Picture of Ken Krugler
Ken Krugler
02/26/2012 10:39pm PST

Hi Jim – Mac is actually the preferred platform. Windows users need to install Cygwin to get a Linux-like environment for running Hadoop, but Mac already has that – you can just use the regular Terminal window.

Jim Grayson
02/26/2012 4:29pm PST

The instructions for lab appear to assume a windows user. Are the tools required already installed for a mac? Is there a preference on a windows vs mac set up?


  • EMC
  • Microsoft
  • HPCC Systems™ from LexisNexis® Risk Solutions
  • MarkLogic
  • Shared Learning Collaborative
  • Cloudera
  • Digital Reasoning Systems
  • Pentaho
  • Rackspace Hosting
  • Teradata Aster
  • VMware
  • IBM
  • NetApp
  • Oracle
  • 1010data
  • 10gen
  • Acxiom
  • Amazon Web Services
  • Calpont
  • Cisco
  • Couchbase
  • Cray
  • Datameer
  • DataSift
  • DataStax
  • Esri
  • Facebook
  • Feedzai
  • Hadapt
  • Hortonworks
  • Impetus
  • Jaspersoft
  • Karmasphere
  • Lucid Imagination
  • MapR Technologies
  • Pervasive
  • Platform Computing
  • Revolution Analytics
  • Scaleout Software
  • Skytree, Inc.
  • Splunk
  • Tableau Software
  • Talend

For information on exhibition and sponsorship opportunities at the conference, contact Susan Stewart at

For information on trade opportunities with O'Reilly conferences contact Kathy Yu at mediapartners

For media-related inquiries, contact Maureen Jennings at

View a complete list of Strata contacts