Spark’s MLlib makes it a snap to apply machine-learning algorithms to huge datasets. However, especially when dealing with unstructured text, data input always requires some preprocessing before it can be fed to your ML algorithms.
But how do you prepare the unstructured text you want to process? And what if it is not just in English, but also in Mandarin, Thai, or Arabic? Elasticsearch’s rich analysis capabilities, all powered by Lucene, make it perfectly suited for processing and tokenizing data for machine-learning tasks all in real time, no matter which language you are looking at—not to mention searching through.
So how do we marry Spark with Elasticsearch? Costin Leau gives an overview of Elastic’s current efforts to enhance Elasticsearch’s existing integration with Spark, going beyond Spark core and Spark SQL by focusing on text processing and machine learning. You’ll leave with a thorough understanding of how Elasticsearch, Spark, and Spark’s MLlib can make it much easier to search through and analyze data, no matter the text-based input.
Costin Leau is an engineer at Elasticsearch, where he leads big data efforts. An open source veteran, Costin led various Spring projects (Spring OSGi, GemFire, Redis, Hadoop) and authored an OSGi spec. He has spoken about Java, big data, and Elasticsearch-related topics at a number of conferences, including Strata, Spark, Hadoop Summit, JavaOne, Devoxx/Javapolis, JavaZone, and SpringOne.
©2016, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.