Presented By O'Reilly and Cloudera
Make Data Work
31 May–1 June 2016: Training
1 June–3 June 2016: Conference
London, UK

Building machine-learning apps with Spark: MLlib, ML Pipelines, and GraphX

Jayant Shekhar (Sparkflows Inc.), Vartika Singh (Cloudera), Krishna Sankar (U.S.Bank)
9:00–12:30 Wednesday, 1/06/2016
Spark & beyond
Location: Capital Suite 12 Level: Intermediate
Average rating: **...
(2.80, 15 ratings)

Prerequisite knowledge

Attendees should know how to program in Scala, Java, or Python.

Materials or downloads needed in advance

The tutorial will be primarily in interactive demo mode. If you would like to follow along, please install and preconfigure the following setup:
  • Scala IDE for Eclipse
  • Import the code
  • Maven build
  • Source code and install instructions can be found here.


    Jayant Shekhar, Vartika Singh, and Krishna Sankar explore techniques for building machine-learning apps using Spark ML as well as the principles of graph processing with Spark GraphX. Jayant, Vartika, and Krishna cover the various algorithms available in Spark ML—including those for doing basic statistics, classification and regression, collaborative filtering, clustering, dimensionality reduction, and frequent pattern mining, as well as streaming k-means clustering—and walk attendees through demos of the provided source code, solving various problems using these algorithms. They will also outline use cases for graph processing and offer an overview of programming with Spark GraphX followed by coding for different graph processing problems using GraphX.

    Topics include:
    How to apply Spark ML libraries for:

    • Feature extraction and transformation
    • Classification, regression and clustering
    • Streaming k-means
    • Model selection

    How to use GraphX to:

    • Build property graphs
    • Run graph algorithms
    • Use the Pregel API to solve graph problems
    Photo of Jayant Shekhar

    Jayant Shekhar

    Sparkflows Inc.

    Jayant Shekhar is the founder of Sparkflows Inc., which enables machine learning on large datasets using Spark ML and intelligent workflows. Jayant focuses on Spark, streaming, and machine learning and is a contributor to Spark. Previously, Jayant was a principal solutions architect at Cloudera working with companies both large and small in various verticals on big data use cases, architecture, algorithms, and deployments. Prior to Cloudera, Jayant worked at Yahoo, where he was instrumental in building out the large-scale content/listings platform using Hadoop and big data technologies. Jayant also worked at eBay, building out a new shopping platform, K2, using Nutch and Hadoop among others, as well as KLA-Tencor, building software for reticle inspection stations and defect analysis systems. Jayant holds a bachelor’s degree in computer science from IIT Kharagpur and a master’s degree in computer engineering from San Jose State University.

    Photo of Vartika Singh

    Vartika Singh


    Vartika Singh is a field data science architect at Cloudera. Previously, Vartika was a data scientist applying machine learning algorithms to real-world use cases ranging from clickstream to image processing. She has 12 years of experience designing and developing solutions and frameworks utilizing machine learning techniques.

    Photo of Krishna Sankar

    Krishna Sankar


    Krishna Sankar is a Distinguished Engineer − Artificial Intelligence & Machine Learning at U.S. Bank focusing on augmented intelligence, digital human as well as areas like AI explainability. Earlier stints include Senior Data Scientist with Volvo Cars, Chief Data Scientist at, Data Scientist/Tata America Intl, Director of Data Science/Bioinformatics startup & as a Distinguished Engineer/Cisco. He has been speaking at various conferences incl ML tutorials at Strata SJC & LONDON 2016, Spark Summit [], Strata-Sparkcamp, OSCON, Pycon & Pydata, writes about Nash Equilibrium, Isaac Asimov and Robots Rules[ as well as has been guest lecturing at the Naval Postgraduate School. His occasional blogs can be found at
    They include NeurIPS2018 — Conference Summary [], Deep Thinking by Garry Kasparov: The Education Of A Machine [] and Ask not if AlphaZero can beat humans in Go — Ask if AlphaZero can teach humans to be a Go champion []. His other passions are semantic Go engines, flying Drones (working towards Drone Pilot License (FAA UAS Pilot) and Lego Robotics – you will find him at the Detroit FLL World Competition as Robots Design Judge