Presented By O'Reilly and Cloudera
December 5-6, 2016: Training
December 6–8, 2016: Tutorials & Conference
Singapore

Building and tuning machine-learning apps using Spark ML and GraphX Libraries

Vartika Singh (Cloudera), Jayant Shekhar (Sparkflows Inc.)
9:00am–12:30pm Tuesday, December 6, 2016
Spark & beyond
Location: 323 Level: Intermediate
Average rating: ***..
(3.00, 7 ratings)

Prerequisite Knowledge

  • Some programming experience in either Scala or Python
  • A basic understanding of machine learning (useful but not required)

Materials or downloads needed in advance

We are excited to have you at the tutorial on ‘Building and tuning machine-learning apps using Spark ML and GraphX Libraries’.

You would follow the demos of Spark ML and GraphX for the source code provided.

If you would like to also run the source code provided, the install instructions and code are at github.com/WhiteFangBuck/strata-2016-singapore

Because of the time constraints during the tutorial, it would be hard to set things up during the session.

What you'll learn

  • Understand the basics of machine-learning algorithms
  • Be able to draw a relation to the corresponding API implementation in the Spark ML/GraphX libraries in order to start designing and building basic apps

Description

Vartika Singh and Jayant Shekhar offers a hands-on tutorial that exposes you to techniques for building and tuning machine-learning apps using Spark ML libraries, building pipelines, tuning parameters, and graph processing with GraphX.

Vartika and Jayant cover a few different classes of ML algorithms, such as regression, classification, clustering, GraphX, and deep learning algorithms in the Spark MLlib, ML, and GraphX libraries, and discuss the use cases and nuances of feature extraction, parameter tuning, statistical analysis for optimization and dimensionality reduction as it applies to these algorithms. Along the way, you’ll do some hands-on coding, solving various problems using the mentioned algorithms and techniques.

Photo of Vartika Singh

Vartika Singh

Cloudera

Vartika Singh is a solutions architect at Cloudera with over 12 years of experience applying machine learning techniques to big data problems.

Photo of Jayant Shekhar

Jayant Shekhar

Sparkflows Inc.

Jayant Shekhar is the founder of Sparkflows Inc., which enables machine learning on large datasets using Spark ML and intelligent workflows. Jayant focuses on Spark, streaming, and machine learning and is a contributor to Spark. Previously, Jayant was a principal solutions architect at Cloudera working with companies both large and small in various verticals on big data use cases, architecture, algorithms, and deployments. Prior to Cloudera, Jayant worked at Yahoo, where he was instrumental in building out the large-scale content/listings platform using Hadoop and big data technologies. Jayant also worked at eBay, building out a new shopping platform, K2, using Nutch and Hadoop among others, as well as KLA-Tencor, building software for reticle inspection stations and defect analysis systems. Jayant holds a bachelor’s degree in computer science from IIT Kharagpur and a master’s degree in computer engineering from San Jose State University.

Comments on this page are now closed.

Comments

Picture of Jayant Shekhar
Jayant Shekhar
12/07/2016 3:18pm +08

Thanks Rajesh!

In Spark 2.0, LabeledPoint and RDD support with MLlib is still there. But the RDD-based API’s have entered maintenance mode.

http://spark.apache.org/docs/latest/mllib-data-types.html

http://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-least-squares-lasso-and-ridge-regression

With the DataFrame/DataSet based API, you do not need LabeledPoint. But provide the names of the columns for the Label and Features in the DataFrame.

Picture of Rajesh Sampathkumar
12/06/2016 8:56pm +08

Thanks for your workshop today. In regression within RDD-based Spark MLLib we used to have labeled-point abstractions and RDDs could be constructed based on this. Do we have equivalents to this in Spark 2.0 and up?

Picture of Jayant Shekhar
Jayant Shekhar
12/05/2016 11:20am +08

The download sizes are large ~ 200 MB.

So, it would be great if you can download them beforehand.

The code refers to Scala 2.11.8.

If using IntelliJ, do install the Scala Plugin

If using Eclipse, do use Scala IDE for Eclipse available at : http://scala-ide.org/download/sdk.html

Picture of Jayant Shekhar
Jayant Shekhar
12/05/2016 10:57am +08

Detailed download and install instructions are here :

https://github.com/WhiteFangBuck/strata-2016-singapore

We look forward to seeing you all tomorrow.

The tutorial can either be executed in spark-shell, or in an IDE – IntelliJ. The Scala language plugin would have to be installed in IntelliJ.

11/02/2016 12:50am +08

Please provide us downloads needed in advance. (Eclipse, Scala version etc.)