Presented By O'Reilly and Cloudera
Make Data Work
March 28–29, 2016: Training
March 29–31, 2016: Conference
San Jose, CA

Using Apache Solr on Hadoop

Glynn Durham (Cloudera)
9:00am - 5:00pm Monday, March 28 & Tuesday, March 29

Location: 211 A

Participants should plan to attend both days of this 2-day training course. Training passes do not include access to tutorials on Tuesday.

Average rating: *****
(5.00, 3 ratings)

Prerequisite knowledge

This course is intended for developers and data engineers with at least basic familiarity with Hadoop and experience programming in a general-purpose language such as Java, C, C++, Perl, or Python. Participants should be comfortable with the Linux command line and should be able to perform basic tasks such as creating and removing directories, viewing and changing file permissions, executing scripts, and examining file output. No prior experience with Apache Solr or Cloudera Search is required, nor is any experience with HBase or SQL.


Cloudera Search combines the open standard for search, Apache Solr, with the proven scalability of Apache Hadoop. Using hands-on exercises, Glynn Durham guides participants through a 2-day training in using these tools to ingest, transform, index, and query data at scale and also demonstrates how to build interactive dashboards for analytics and integrate the search engine with external applications. No previous experience with Apache Solr required.

Course Objectives
Cloudera Search is an open source bundle of software that makes it possible to run the industry-standard search platform, Apache Solr, at a massive scale by leveraging Apache Hadoop. After successfully completing this two-day hands-on training, you will be able to:

  • Understand how Solr and Hadoop can work together to provide a scalable system for searching data stored in your cluster
  • Write Solr queries that quickly find the data you need
  • Design schemas that are appropriate for your use cases
  • Perform batch indexing of data stored in HDFS and HBase
  • Perform indexing of streaming data in near-real-time with Flume
  • Use Morphlines to transform data during the indexing process
  • Configure key features in Solr that can improve the usability and performance of your applications
  • Create an interactive dashboard for your data collection using Hue
  • Connect external applications to your search engine

Course Outline

Overview of Cloudera Search

  • What is Cloudera Search?
  • Helpful features
  • Use cases
  • Basic architecture

Performing basic queries

  • Executing a query in the admin UI
  • Basic syntax
  • Techniques for approximate matching
  • Controlling output

Writing more powerful queries

  • Relevancy and filters
  • Query parsers
  • Functions
  • Geospatial search
  • Faceting

Preparing to index documents

  • Overview of the indexing process
  • Generating configuration files
  • Schema design
  • Collection management
  • Using Morphlines to extract, transform, and load data into Solr

Batch indexing HDFS data with MapReduce

  • Overview of the HDFS batch indexing process
  • Using the MapReduce indexing tool
  • Testing and troubleshooting

Near-real-time indexing with Flume

  • Overview of the near-real-time indexing process
  • Introduction to Apache Flume
  • How to perform near-real-time indexing with Flume
  • Testing and troubleshooting

Indexing data in other languages and formats

  • Field types and analyzer chains
  • Word stemming, character mapping, and language support
  • Schema and analysis support in the admin UI
  • Metadata and content extraction with Apache Tika
  • Indexing binary file types with Solr Cell

Improving search quality and performance

  • Delivering relevant results
  • Helping users find information
  • Query performance and troubleshooting

Building user interfaces for search

  • Search UI overview
  • Building a user interface with Hue
  • Integrating search into custom applications
Photo of Glynn Durham

Glynn Durham


Glynn Durham is a senior instructor at Cloudera. Previously, he worked for Oracle, Forté Software, MySQL, and Cloudera, spending five or more years at each.

Comments on this page are now closed.


Picture of Glynn Durham
Glynn Durham
03/25/2016 6:54am PDT

You will download 2 PDF files. Hands-on exercises will be hosted on the web service Skytap. Or, if you’d like to host the exercise VM on your own machine, you will need to install VMware Fusion.for Mac (30-day free trial), or VMware Player forWindows (free). I will supply the VM file to you on a USB thrums drive, so you’ll just copy that large file from a USB if you choose to go that way. Feel free to email me directly with further questions:

Picture of Yuri Chemolosov
03/25/2016 3:14am PDT

Do we need to pre-download, any big chunk of code or data as it was advised in the email:


> Please check the training courses page on the website for downloads you need for the training. There WILL NOT be bandwidth for everybody to download during the training, so please prepare in advance.


Baoan Wang
03/24/2016 3:52pm PDT

Any software do we need to load before training? Thanks.

Picture of Sophia DeMartini
Sophia DeMartini
03/07/2016 2:54am PST

Hi Ozhan,

Yes, please bring a laptop with you to the training.


ozhan gulen
03/06/2016 11:37pm PST

Hello From Turkey,

Will we need to bring laptop with us to attend this training?

Best regards,