Presented By O'Reilly and Cloudera
Make Data Work
March 28–29, 2016: Training
March 29–31, 2016: Conference
San Jose, CA

Building machine-learning apps with Spark: MLlib, ML Pipelines, and GraphX (Half Day)

Jayant Shekhar (Sparkflows Inc.), Amandeep Khurana (Cloudera), Krishna Sankar (U.S.Bank), Vartika Singh (Cloudera)
9:00am–12:30pm Tuesday, 03/29/2016
Spark & Beyond

Location: 210 A/E
Average rating: **...
(2.80, 45 ratings)

Materials or downloads needed in advance

Participants must have IntelliJ (version 14.1+) or ScalaIDE for Eclipse and Maven (version 3.3+) installed on a UNIX-based machine.
  1. Scala IDE - Either IntelliJ OR ScalaIDE for Eclipse are needed
  2. Maven
  3. Apache Zeppelin
    • Download source from
    • Compile zeppelin
      • mvn clean package -DskipTests -Pspark-1.6 -Phadoop-2.6 -Ppyspark
    • Run the Zeppelin daemon
      • ./bin/ start|stop|status|restart
      • ./bin/ start
    • Run IDE in browser
      • localhost:8080
  4. Git (Nice to have)


Jayant Shekhar, Amandeep Khurana, Krishna Sankar, and Vartika Singh guide participants through techniques for building machine-learning apps using Spark MLlib and Spark ML and demonstrate the principles of graph processing with Spark GraphX. Jayant, Amandeep, Krishna, and Vartika begin with the use cases for machine learning with Apache Spark. You’ll explore the various algorithms available in Spark MLlib and Spark ML, including those for doing basic statistics, classification and regression, collaborative filtering, clustering, dimensionality reduction, and frequent pattern mining. Along the way, you’ll solve problems using the mentioned algorithms and cover streaming k-means clustering. You’ll also learn use cases for graph processing and get an overview of programming with Spark GraphX, followed by hands-on coding examples of graph-processing problems using GraphX.

Photo of Jayant Shekhar

Jayant Shekhar

Sparkflows Inc.

Jayant Shekhar is the founder of Sparkflows Inc., which enables machine learning on large datasets using Spark ML and intelligent workflows. Jayant focuses on Spark, streaming, and machine learning and is a contributor to Spark. Previously, Jayant was a principal solutions architect at Cloudera working with companies both large and small in various verticals on big data use cases, architecture, algorithms, and deployments. Prior to Cloudera, Jayant worked at Yahoo, where he was instrumental in building out the large-scale content/listings platform using Hadoop and big data technologies. Jayant also worked at eBay, building out a new shopping platform, K2, using Nutch and Hadoop among others, as well as KLA-Tencor, building software for reticle inspection stations and defect analysis systems. Jayant holds a bachelor’s degree in computer science from IIT Kharagpur and a master’s degree in computer engineering from San Jose State University.

Photo of Amandeep Khurana

Amandeep Khurana


Amandeep Khurana is a solutions architect at Cloudera, where he’s involved in the entire lifecycle of Hadoop adoption for customers from use-case discovery to taking systems to production. Amandeep is also a coauthor of HBase In Action, a book geared toward building applications using HBase. Prior to Cloudera, Amandeep was at Amazon Web Services, where he was a part of the Elastic MapReduce team, and built the first version of EMR’s HBase offering.

Photo of Krishna Sankar

Krishna Sankar


Krishna Sankar is a Distinguished Engineer − Artificial Intelligence & Machine Learning at U.S. Bank focusing on augmented intelligence, digital human as well as areas like AI explainability. Earlier stints include Senior Data Scientist with Volvo Cars, Chief Data Scientist at, Data Scientist/Tata America Intl, Director of Data Science/Bioinformatics startup & as a Distinguished Engineer/Cisco. He has been speaking at various conferences incl ML tutorials at Strata SJC & LONDON 2016, Spark Summit [], Strata-Sparkcamp, OSCON, Pycon & Pydata, writes about Nash Equilibrium, Isaac Asimov and Robots Rules[ as well as has been guest lecturing at the Naval Postgraduate School. His occasional blogs can be found at
They include NeurIPS2018 — Conference Summary [], Deep Thinking by Garry Kasparov: The Education Of A Machine [] and Ask not if AlphaZero can beat humans in Go — Ask if AlphaZero can teach humans to be a Go champion []. His other passions are semantic Go engines, flying Drones (working towards Drone Pilot License (FAA UAS Pilot) and Lego Robotics – you will find him at the Detroit FLL World Competition as Robots Design Judge

Photo of Vartika Singh

Vartika Singh


Vartika Singh is a solutions architect at Cloudera with over 12 years of experience applying machine learning techniques to big data problems.

Comments on this page are now closed.


Picture of Jayant Shekhar
Jayant Shekhar
03/29/2016 10:02am PDT

Hi Craig, uploaded the latest MLlib.pdf to the repo. Thanks!

Picture of Krishna Sankar
03/29/2016 9:35am PDT

We will post all slides, duly updated.

Craig Rubendall
03/29/2016 8:42am PDT

The pdf files in the github repo don’t match what was presented, for example, the mllib.pdf doesn’t contain the section on linear regression. Is there a separate location for the slides in their entirety?

Picture of Krishna Sankar
03/29/2016 3:14am PDT

We have 2 pdfs in the github. Also all presentations have been uploaded. They will show up in speaker slides and videos.

sanjeev taran
03/29/2016 3:05am PDT

Can you please post the link to the presentation slides

Picture of Krishna Sankar
03/28/2016 11:39pm PDT

The github for data & code is
Cheers & see you all soon

Picture of Krishna Sankar
03/28/2016 12:08pm PDT

Hi Yi, Pl make sure the download went thru fine and was able to decompress the files. Change directory to the zeppelin-0.5.6-incubating/ and then try the mvn … command. Sometimes the ‘-’ or other characters change. You can copy the appropriate flags from the Zeppelin site
Which OS are you using ? Did you check the pom.xml file ?

Yi He
03/28/2016 11:45am PDT

[WARNING] The requested profile “spark-1.6” could not be activated because it does not exist.
[WARNING] The requested profile “hadoop-2.6” could not be activated because it does not exist.
[WARNING] The requested profile “pyspark” could not be activated because it does not exist.


I am getting the errors above.

Picture of Jayant Shekhar
Jayant Shekhar
03/28/2016 11:23am PDT

Hi Yi, no we do not need to install Spark 1.6, Hadoop 2.6, and PySpark.


Yi He
03/28/2016 10:58am PDT

Compile zeppelin
mvn clean package -DskipTests -Pspark-1.6 -Phadoop-2.6 -Ppyspark

Instruction above. Do we also need to install Spark 1.6, Hadoop 2.6, and Ppyspark. If yes, can you provide a good tutorial?

Picture of Jayant Shekhar
Jayant Shekhar
03/27/2016 6:17am PDT

Code and Data for this session is now posted at :

We would have minor updates to it before the session.

Picture of Jayant Shekhar
Jayant Shekhar
03/27/2016 4:59am PDT

If you run into issues installing Zeppelin on windows, we would have those exercises also running in IntelliJ/ScalaIDE for Eclipse.

Picture of Jayant Shekhar
Jayant Shekhar
03/24/2016 7:21am PDT

Yes, we would have the code and data on github, Andrew. We would keep the size of data light for the Tutorial. Having an IDE as mentioned above with the Scala SDK installed would be great. Then there is Zeppelin.

Picture of Krishna Sankar
03/24/2016 6:09am PDT

We will have the data & code in github. Will post the link soon.

03/24/2016 3:04am PDT

Just confirming – we do not need to independently download any Spark libraries other than the instructions above, and we do not need to download or pre-create any sample data for use in any of the examples?