Presented By O'Reilly and Cloudera
Make Data Work
March 28–29, 2016: Training
March 29–31, 2016: Conference
San Jose, CA

Elasticsearch and Apache Lucene for Apache Spark and MLlib

Costin Leau (Elastic)
11:00am–11:40am Thursday, 03/31/2016
Data Innovations

Location: 210 D/H
Tags: real-time
Average rating: ****.
(4.15, 13 ratings)

Prerequisite knowledge

Attendees should have a basic understanding of machine-learning algorithms and Scala and familiarity with Elasticsearch, Spark, and Spark MLlib.


Spark’s MLlib makes it a snap to apply machine-learning algorithms to huge datasets. However, especially when dealing with unstructured text, data input always requires some preprocessing before it can be fed to your ML algorithms.

But how do you prepare the unstructured text you want to process? And what if it is not just in English, but also in Mandarin, Thai, or Arabic? Elasticsearch’s rich analysis capabilities, all powered by Lucene, make it perfectly suited for processing and tokenizing data for machine-learning tasks all in real time, no matter which language you are looking at—not to mention searching through.

So how do we marry Spark with Elasticsearch? Costin Leau gives an overview of Elastic’s current efforts to enhance Elasticsearch’s existing integration with Spark, going beyond Spark core and Spark SQL by focusing on text processing and machine learning. You’ll leave with a thorough understanding of how Elasticsearch, Spark, and Spark’s MLlib can make it much easier to search through and analyze data, no matter the text-based input.

Photo of Costin Leau

Costin Leau


Costin Leau is an engineer at Elasticsearch, where he leads big data efforts. An open source veteran, Costin led various Spring projects (Spring OSGi, GemFire, Redis, Hadoop) and authored an OSGi spec. He has spoken about Java, big data, and Elasticsearch-related topics at a number of conferences, including Strata, Spark, Hadoop Summit, JavaOne, Devoxx/Javapolis, JavaZone, and SpringOne.