Presented By O'Reilly and Cloudera
Make Data Work
September 26–27, 2016: Training
September 27–29, 2016: Tutorials & Conference
New York, NY

Building machine-learning apps with Spark: MLlib, ML Pipelines, and GraphX

Vartika Singh (Cloudera), Jayant Shekhar (Sparkflows Inc.)
9:00am–12:30pm Tuesday, 09/27/2016
Spark & beyond
Location: 1 E 07/1 E 08 Level: Intermediate
Average rating: ***..
(3.11, 19 ratings)

Prerequisite knowledge

  • A basic understanding of Spark
  • General experience with machine-learning algorithms
  • Materials or downloads needed in advance

  • A Unix- or Windows-based machine with Spark 2.0 downloaded

      Installation instructions:

      1. Download Spark

    • Download Spark 2.0
    • Direct download link

      2. Install Spark on Mac

    • tar -zxvf spark-2.0.0-bin-hadoop2.7.tgz
    • ln -sf spark-2.0.0-bin-hadoop2.7 spark
    • Add the path of spark bin directory to environment variable
    • export PATH=${JAVA_HOME}/bin:/Users/your_username/spark-2.0.0-bin-hadoop2.7/bin:$PATH
    • 3. Install Spark on Windows

    • Extract the files from the downloaded spark tgz file
    • Add the spark bin directory to the path : …\spark-2.0.0-bin-hadoop2.7\bin
    • 4. Run spark-shell (You may see java exceptions related to winutils binary. It can be ignored.)

      5. Install Git

  • What you'll learn

  • Explore building machine-learning applications using ID/shell
  • Understand the nuances of ML algorithms as implemented in Spark and learn how to effectively tune them using various features and parameters available via Spark libraries
  • Description

    Vartika Singh and Jayant Shekhar walk you through techniques for building and tuning machine-learning apps using Spark MLlib and Spark ML Pipelines and graph processing with GraphX. Vartika and Jayant cover regression, classification, clustering, GraphX, and deep learning algorithms in the Spark MLlib, ML, and GraphX libraries as well as the nuances of feature extraction, parameter tuning, statistical analysis for optimization and dimensionality reduction as it applies to these algorithms, and the use cases addressed therein. Using hands-on coding, you’ll learn to solve various problems using the mentioned algorithms and techniques.

    Photo of Vartika Singh

    Vartika Singh


    Vartika Singh is a solutions architect at Cloudera with over 12 years of experience applying machine learning techniques to big data problems.

    Photo of Jayant Shekhar

    Jayant Shekhar

    Sparkflows Inc.

    Jayant Shekhar is the founder of Sparkflows Inc., which enables machine learning on large datasets using Spark ML and intelligent workflows. Jayant focuses on Spark, streaming, and machine learning and is a contributor to Spark. Previously, Jayant was a principal solutions architect at Cloudera working with companies both large and small in various verticals on big data use cases, architecture, algorithms, and deployments. Prior to Cloudera, Jayant worked at Yahoo, where he was instrumental in building out the large-scale content/listings platform using Hadoop and big data technologies. Jayant also worked at eBay, building out a new shopping platform, K2, using Nutch and Hadoop among others, as well as KLA-Tencor, building software for reticle inspection stations and defect analysis systems. Jayant holds a bachelor’s degree in computer science from IIT Kharagpur and a master’s degree in computer engineering from San Jose State University.

    Comments on this page are now closed.


    Picture of Vartika Singh
    09/29/2016 8:13am EDT

    Thank you Brandon!

    Brandon Reese
    09/28/2016 7:51pm EDT

    Many windows users were having issues with spark-shell here is what worked for me:

    1. download winutils.exe from
    2. move it to c:\hadoop\bin
    3. run from admin command prompt:
    a. C:\hadoop\bin\winutils.exe chmod 777 /tmp/hive
    4. run command for spark-shell with extra conf parameter
    a. spark-shell —driver-memory 2G —executor-memory 3G —executor-cores 2 —conf spark.sql.warehouse.dir=file:///c:/tmp/spark-warehouse

    Picture of Jayant Shekhar
    Jayant Shekhar
    09/27/2016 7:38pm EDT

    Hi Ram,

    The slides are available here :


    Picture of Jayant Shekhar
    Jayant Shekhar
    09/27/2016 7:37pm EDT

    Hi Seung,

    You can use OneHotEncoder for the categorical columns. However, you would find good discussion here on whether kmeans is directly applicable to categorical variables :

    09/27/2016 6:12pm EDT

    I couldn’t attend the session today. Can you send me the link to your slides.

    Seung Hwan Lee
    09/27/2016 11:09am EDT

    OK I see maybe I saw another file :P.. if there are categorical columns should I use stringIndexer and OneHotEncoder in kmeans?

    Picture of Vartika Singh
    09/27/2016 10:32am EDT

    We have uploaded the slides and the addendum code for Topic Modeling. This code is called TopicModelingWithStemmer.scala. This code has not been compile or run time tested yet and has been included mainly for demonstrating Stemmer and NGram API. Please email us separately if you would like further help.

    Picture of Vartika Singh
    09/27/2016 9:44am EDT

    Well, there is really no need for One-hot encoding, as none of the columns have categorical values.

    If you really would like to, then go ahead and try using VectorIndexer.

    Seung Hwan Lee
    09/27/2016 9:08am EDT

    Why do not use stribg indexer and oneHotEncoder in the kmeans example for categorical values ? (In the linear regression exanple use both of them)

    09/26/2016 8:36pm EDT

    Thank to you both… I had 1.6 and updated to jdk1.8.0_101.jdk – I am glad that was simple :)

    Picture of Jayant Shekhar
    Jayant Shekhar
    09/26/2016 7:25pm EDT

    Looking forward to seeing you tomorrow Kulsoom!

    Can you ensure you have Java 7+

    Thank you

    Picture of Vartika Singh
    09/26/2016 7:23pm EDT

    Hello Kulsoom!

    It’s great that you can join us!

    Could you check what your java version is?

    It has to higher than jdk 1.7_67

    09/26/2016 6:56pm EDT

    Hi – I am looking forward to this session. I wanted to know what version of Java is needed? When I run spark-shell I get this message: "Exception in thread “main” java.lang.UnsupportedClassVersionError: org/apache/spark/launcher/Main : Unsupported major.minor version 51.0
    Thank you