Search and Real Time Analytics on Big Data

Beyond Hadoop Ballroom AB
Tutorial Please note: to attend, your registration must include Tutorials on Tuesday.
Average rating: **...
(2.31, 13 ratings)

MESSAGE TO ATTENDEES:

*Hi all,

You can find the nearly complete exercises and slides here:

https://s3.amazonaws.com/thinkbig-academy/Strata2013/RealTimeSearchAndAnalytics-master.zip

Please note that there are still a couple of exercises and a few slides in the deck that I am putting the final touches on, but this will give you a good idea of what we will be presenting. Please make sure to re-download this zip file the day prior to the tutorial.

Some important notes:

A few of the exercises include installing and working on a local installation of Solr (your computer). We will guide you through the installation process. I would suggest using a Linux distribution such as Ubuntu. I will be using Mac OS X. Windows users can setup a VM or use Cygwin.

Another exercise will use an Amazon EC2 cluster, I will provide the connection details on the day of the tutorial.

Further details are in the READMEs of the exercises themselves.

I look forward to seeing you all!

Ryan*

More and more clients are interested in understanding how they can make use of big data. One typical use case is how to run advanced ad hoc queries over massive sets of data receive results in real time. There are an increasingly large number of products in the market available for all types of data types and requirements, including products such as DataStax Enterprise or Apache Solr/Lucene. We will break down and describe the distributed search landscape and show you how to use these interesting technologies with a hands on tutorial.

Big Data → Distributed Search – applications in industry (10 minutes)

  • Search applications for structured data
  • Search applications for unstructured data
  • Geo-indexed search
  • Why distributed search? What happens as index size grows with data?

Example Use Case (40 minutes)

Use Case: Log Data
Requirements

  • Petabytes of semi-structured log data
  • Billions of parsed Solr documents
  • Apache Solr
  • Schema
  • Backups
  • Disaster Recovery
  • Duplicates
  • Joins

Technology Landscape (10-15 minutes)

Solr/Lucene out of the box

  • Integrated Solr Solutions
  • Datastax Enterprise (DSE) 2.0
  • Lily
  • Lucid Imagination
  • Kitenga
  • Katta

Non-Solr Solutions

  • Riak
  • MongoDB
  • Amazon cloud search
  • Google BigQuery
Photo of Ryan Tabora

Ryan Tabora

Think Big Analytics

Ryan is a data developer at Think Big Analytics. He leads technical consulting projects for big data implementations at Fortune 500 clients. He has in depth experience working with Solr/Lucene and the Hadoop stack.

Jason Rutherglen

Datastax

Jason works at Datastax as a senior Big Data engineer architecting,
developing and supporting the Datastax Enterprise product line which
includes Solr integrated with Cassandra. His career has involved an
array of technologies including search, Hadoop, Hive, mobile phones,
cryptography, and natural language processing. Jason has been
developing solutions with Lucene and Solr for more than 7 years and is
a co-author of Programming Hive from O’Reilly. Jason frequently
gives tutorials and speaks at conferences such as Strata, Cassandra Summit, ApacheCon and others.

Comments on this page are now closed.

Comments

Picture of Ryan Tabora
Ryan Tabora
02/26/2013 12:15pm PST

Hi all,

Thank you for attending.

Ive put the slides on slideshare here: http://www.slideshare.net/ratabora/real-time-search-and-analytics-on-big-data

Im updating the zip as we speak.

Also, your review/rating (with comments) would greatly be appreciated. We want to make this a great presentation for you and we can only do that with constructive feedback. On a side note, a few have commented on the room space being inadequate for a tutorial (no desks to put laptop on, too many people). Please do not include that bit in the review as that is an issue of Strata’s and not of the presenters. We hope we provided something valuable to you.

Please contact me if you would like additional help with the exercises or just search in general.

Thanks, Ryan ryan.tabora@thinkbiganalytics.com

Picture of Ryan Tabora
Ryan Tabora
02/24/2013 4:44am PST

A few people have asked whether or not Eclipse/M2E are required. They are not, they are just the tools I use to browse through the code. You can use whatever text editor you like, but to build/run the code you will need Maven 3.x.

Thanks, Ryan

Picture of Ryan Tabora
Ryan Tabora
02/22/2013 8:42am PST

Hi all,

Very excited to talk to you all on Tuesday. My apologies for many messages, but if you haven’t read any of my other messages please read this one!

Prior to the talk: 1) Set up an Ubuntu VM if you are running Windows (I recommend VirtualBox). Linux-based OS or Mac OS should be fine. 2) Run exercises 1 and 3 prior to the class to verify everything works. The instructions are very detailed. 3) Message me if you have any problems! (ryan.tabora@thinkbiganalytics.com)

We’re still making minor tweaks to some of the slides, so please make sure to download on Monday to get the latest package.

Download everything here: https://s3.amazonaws.com/thinkbig-academy/Strata2013/RealTimeSearchAndAnalytics-master.zip

Thank you and see you on Tuesday!

Regards, Ryan

Picture of Ryan Tabora
Ryan Tabora
02/21/2013 8:24am PST

Hi all,

You can find the nearly complete exercises and slides here:

https://s3.amazonaws.com/thinkbig-academy/Strata2013/RealTimeSearchAndAnalytics-master.zip

Please note that there are still a couple of exercises and a few slides in the deck that I am putting the final touches on, but this will give you a good idea of what we will be presenting. Please make sure to re-download this zip file the day prior to the tutorial.

Some important notes:

A few of the exercises include installing and working on a local installation of Solr (your computer). We will guide you through the installation process. I would suggest using a Linux distribution such as Ubuntu. I will be using Mac OS X. Windows users can setup a VM or use Cygwin.

Also ideally you will have Eclipse, the M2E plugin, and Maven 3.0 installed prior to the tutorial! These instructions are in the top-level README file of the download.

Another exercise will use an Amazon EC2 cluster, I will provide the connection details on the day of the tutorial.

Further details are in the READMEs of the exercises themselves.

I look forward to seeing you all!

Ryan

Picture of Ryan Tabora
Ryan Tabora
01/17/2013 8:10am PST

Hi all!

I am very excited to present some new material this year. We took all of the comments from our previous talk and hopefully this year it will be better than ever.

Just to give you a heads up we will be hosting the presentation material (instructions, sample data, source code, presentation) at github.com/ratabora/RealTim.... At this time the repository is private as we are still doing work on it. Once it is complete (I am expecting to be complete by end of January), I’ll open it to the public and you can download the files. Please make sure to download these files prior to the talk.

Regards, Ryan

Sponsors

Sponsorship Opportunities

For information on exhibition and sponsorship opportunities at the conference, contact Susan Stewart at sstewart@oreilly.com

Media Partner Opportunities

For information on trade opportunities with O'Reilly conferences contact Kathy Yu at mediapartners
@oreilly.com

Press and Media

For media-related inquiries, contact Maureen Jennings at maureen@oreilly.com

Contact Us

View a complete list of Strata contacts