Presented By O'Reilly and Cloudera
Make Data Work
Feb 17–20, 2015 • San Jose, CA

Spark Camp: An Introduction to Apache Spark with Hands-on Tutorials

Paco Nathan (, Holden Karau (Independent), Krishna Sankar (U.S.Bank), Reza Zadeh (Matroid | Stanford), Denny Guang-yeu Lee (Databricks), Chris Fregly (Amazon Web Services)
9:00am–5:00pm Wednesday, 02/18/2015
Hadoop & Beyond
Location: LL21 E/F
Average rating: ***..
(3.71, 17 ratings)

Materials or downloads needed in advance

Laptop, with Java JDK 6/7/8 installed. Please avoid using either Brew or Cygwin to install Spark.



Some experience coding in Python, SQL, Java, or Scala, plus some familiarity with Big Data issues/concepts.

What’s required for a laptop to use in the tutorial?

  • laptop with wifi and browser, and reasonably current hardware (+2GB RAM)
  • MacOSX, Windows, Linux — all work fine
  • make sure you don’t have corporate security controls that prevent use of network
  • have Java JDK 6/7/8 installed
  • have Python 2.7 installed

NB: do not install Spark with Homebrew or Cygwin

We will provide USB sticks with the necessary data+code. To save time, if people participating in the tutorial want to download in advance, the USB contents are here

Also, please see the Apache Spark developer certification exam being held at Strata on Fri Feb 20:

Tutorial Description

Spark Camp, organized by the creators of the Apache Spark project at Databricks, will be a day long hands-on introduction to the Spark platform including Spark Core, the Spark Shell, Spark Streaming, Spark SQL, MLlib, GraphX, and more. We will start with an overview of use cases and demonstrate writing simple Spark applications. We will cover each of the main components of the Spark stack via a series of technical talks targeted at developers that are new to Spark. Intermixed with the talks will be periods of hands-on lab work. Attendees will download and use Spark on their own laptops, as well as learn how to configure and deploy Spark in distributed big data environments including common Hadoop distributions and Mesos.

Developer Certification for Apache Spark
O’Reilly has partnered with Databricks, creators of Spark, to offer the Developer Certification for Apache Spark. The next Spark certification exam takes place at Strata + Hadooop World in San Jose on Friday, February 20. Learn more.

Photo of Paco Nathan

Paco Nathan

O’Reilly author (Just Enough Math and Enterprise Data Workflows with Cascading) and a “player/coach” who’s led innovative Data teams building large-scale apps. Director of Community Evangelism for Apache Spark with Databricks, advisor to Amplify Partners . Expert in machine learning, cluster computing, and Enterprise use cases for Big Data. Interests: Spark, Ag+Data, Open Data, Mesos, PMML, Cascalog, Scalding, Clojure, Python, Chatbots, NLP.

Photo of Holden Karau

Holden Karau


Holden Karau is a software development engineer at Databricks and is active in open source. She the author of a book on Spark and has assisted with Spark workshops. Prior to Databricks she worked on a variety of search and classification problems at Google, Foursquare, and Amazon. She graduated from the University of Waterloo with a Bachelors of Mathematics in Computer Science.

Photo of Krishna Sankar

Krishna Sankar


Krishna Sankar is a Distinguished Engineer − Artificial Intelligence & Machine Learning at U.S. Bank focusing on augmented intelligence, digital human as well as areas like AI explainability. Earlier stints include Senior Data Scientist with Volvo Cars, Chief Data Scientist at, Data Scientist/Tata America Intl, Director of Data Science/Bioinformatics startup & as a Distinguished Engineer/Cisco. He has been speaking at various conferences incl ML tutorials at Strata SJC & LONDON 2016, Spark Summit [], Strata-Sparkcamp, OSCON, Pycon & Pydata, writes about Nash Equilibrium, Isaac Asimov and Robots Rules[ as well as has been guest lecturing at the Naval Postgraduate School. His occasional blogs can be found at
They include NeurIPS2018 — Conference Summary [], Deep Thinking by Garry Kasparov: The Education Of A Machine [] and Ask not if AlphaZero can beat humans in Go — Ask if AlphaZero can teach humans to be a Go champion []. His other passions are semantic Go engines, flying Drones (working towards Drone Pilot License (FAA UAS Pilot) and Lego Robotics – you will find him at the Detroit FLL World Competition as Robots Design Judge

Photo of Reza Zadeh

Reza Zadeh

Matroid | Stanford

Consulting professor at Stanford within ICME, conducting research and teaching courses targeting doctorate students. Technical Advisor at Databricks. I focus on Discrete Applied Mathematics, Machine Learning Theory and Applications, and Large-Scale Distributed Computing.

Photo of Denny Guang-yeu Lee

Denny Guang-yeu Lee


Denny Lee is a Developer Advocate at Databricks. He is a hands-on distributed systems and data sciences engineer with extensive experience developing internet-scale infrastructure, data platforms, and predictive analytics systems for both on-premise and cloud environments. He also has a Masters of Biomedical Informatics from Oregon Health and Sciences University and has architected and implemented powerful data solutions for enterprise Healthcare customers. His current technical focuses include Distributed Systems, Apache Spark, Deep Learning, Machine Learning, and Genomics.

Photo of Chris Fregly

Chris Fregly

Amazon Web Services

Chris Fregly is a senior developer advocate focused on AI and machine learning at Amazon Web Services (AWS). Chris shares knowledge with fellow developers and data scientists through his Advanced Kubeflow AI Meetup and regularly speaks at AI and ML conferences across the globe. Previously, Chris was a founder at PipelineAI, where he worked with many startups and enterprises to deploy machine learning pipelines using many open source and AWS products including Kubeflow, Amazon EKS, and Amazon SageMaker.

Comments on this page are now closed.


Picture of Paco Nathan
Paco Nathan
02/24/2015 10:13am PST

Great to meet so many of you at Spark Camp. On behalf of the instructors, we really enjoyed this!

We’d like to get your feedback about Spark Camp, to help improve it. Here’s a Google Form for a survey:

Also, if you want to extend your account for the cloud-based notebook that we used in the tutorial, please let us know via this survey.

Many thanks!

Picture of Paco Nathan
Paco Nathan
02/19/2015 12:54am PST

Follow-up from yesterday. Here are some great talks that go into detail about Spark SQL:

“The Spark SQL Optimizer and External Data Sources API
Michael Armbrust

“What’s coming for Spark in 2015”
Patrick Wendell

Plus, in general check the archives from talks/meetups on the Apache Spark channel on YouTube:

Picture of Paco Nathan
Paco Nathan
02/19/2015 12:35am PST

Hi Anoop,

Thank you kindly! And feedback on the UI is much appreciated. I like the breadcrumbs approach.

I just checked on with one of my personal accounts (not Databricks) and it is difficult to find/use the login buttons from UserVoice. That’s up in the top/right corner. I will update the slide, but I’ve notified the UX team to fix that. Thanks for catching that!

Picture of Paco Nathan
Paco Nathan
02/19/2015 12:28am PST

Hi Patrick,

For the folder exports: this is also the case for IPython notebook — code + markdown + results are represented in JSON. You can use a JSON pretty printer, such as piping through “python -m json.tool”

However, if you export individual notebooks, those are in the source language, i.e., Scala, Python, etc.

Anoop Johnson
02/18/2015 10:12pm PST

Great session yesterday. Thanks! I have a feedback about Databricks cloud UI – (posting it here since does not allow me to)

It would be great if the UI could have breadcrumbs that shows the current path. Sometimes I end up having to do many clicks to navigate through the workspace and a breadcrumb will make the navigation much easier.

02/18/2015 3:32pm PST

Thanks Paco. After unzip, each file is actually a JSON doc file. It’s not human readable, need parsing to extract command and result.

Picture of Paco Nathan
Paco Nathan
02/18/2015 1:39pm PST

The formatting on these comments munged that note, but the *.dbc download is actually a ZIP file.

Try using “unzip -l _SparkCamp.dbc”

Picture of Paco Nathan
Paco Nathan
02/18/2015 1:38pm PST

Hi Patrick,

The extension looks proprietary, but it’s actually a JAR, ie. ZIP format. Try this:

bash-3.2$ unzip l _SparkCamp.dbc
Archive: _SparkCamp.dbc
Length Date Time Name
——— -- -- --
2544435 02-18-15 01:20 _SparkCamp/
15670 02-18-15 01:20 _SparkCamp/08.graphx.scala
104968 02-18-15 01:20 SparkCamp/demo_mllibiris.scala

Picture of Krishna Sankar
Krishna Sankar
02/18/2015 1:37pm PST

Export – Source File works at the notebook level.

02/18/2015 1:30pm PST

Anyway to download a whole set of _SparkCamp Notebooks?
Tried to export it from DBC, but it’s in proprietary DBC Archive format.

Picture of Krishna Sankar
Krishna Sankar
02/18/2015 8:00am PST

Thanks Paco. The solution I showed below. It is just one way – a good start at best. Nothing fancy:

  1. Databricks notebook source exported at Wed, 18 Feb 2015 23:57:11 UTC
  2. Coding Exercise 1 – Wordcount + join
  3. Krishna Sankar (2/18/15)
  4. Not optimized for scale et al. Just to give a start
  1. COMMAND -———-
  1. Always a good practive to have this
    import datetime
    print “Last ran @ %s” %
  1. COMMAND -———-
  1. Again, a good practice
    print sc.version
  1. COMMAND -———-

lines_01 = sc.textFile(‘/mnt/paco/intro/CHANGES.txt’)

  1. COMMAND -———-


  1. COMMAND -———-

from operator import add
wc_01 = lines_01.flatMap(lambda x : x.split(’ ’)).map(lambda x : (x,1)).reduceByKey(add)

  1. COMMAND -———-


  1. COMMAND -———-


  1. COMMAND -———-
  1. If you want to see how the words are distributed
  2. Collect over alarge dataset can potentially exhaust the memory
  1. COMMAND -———-

wc_01.filter(lambda x : x0 == ‘spark’).collect()

  1. COMMAND -———-

lines_02 = sc.textFile(‘/mnt/paco/intro/’)
wc_02 = lines_02.flatMap(lambda x : x.split(’ ’)).map(lambda x : (x,1)).reduceByKey(add)

  1. COMMAND -———-

wc_01.join(wc_02).filter(lambda x : x0 == ‘Spark’).collect()

  1. COMMAND -———-

wc_02.sortByKey().take(10) #collect()

  1. COMMAND -———-
  1. By mistake I used the ‘spark’ with lowercase. Interesting because normal join won’t give anything as only one file has ‘spark’
  1. COMMAND -———-

wc_02.filter(lambda x : x0 == ‘spark’).collect()

  1. COMMAND -———-
  1. To catch ‘spark’, we need the fullOuterJoin !
  1. COMMAND -———-

wc_01.fullOuterJoin(wc_02).sortByKey().filter(lambda x : x0 == ‘spark’).collect()

Picture of Paco Nathan
Paco Nathan
02/18/2015 5:59am PST

Krishna Sankar will give that talk — my apologies for mistype.

Picture of Paco Nathan
Paco Nathan
02/18/2015 5:58am PST

Hi Hieu,

Yes, in fact Krisha will give that talk, up immediately next.

Hieu Ho
02/18/2015 5:54am PST

can you make available the solution to the workflow assignment for comparison? Thanks

Picture of Paco Nathan
Paco Nathan
02/18/2015 12:34am PST


Thank you much!

Picture of Paco Nathan
Paco Nathan
02/17/2015 10:26am PST

Hi Carnot,

It’s not necessary to download. We hope to have fixed that, which was difficult at previous conferences. No, we won’t be using VMs.

See you tomorrow -


carnot antonio romero
02/17/2015 9:31am PST

So just to be clear: there is no advance download of Spark itself or the exercises? I was expecting to have to download a VM or something similarly huge.

Picture of Paco Nathan
Paco Nathan
02/17/2015 4:39am PST

Hi Alaa,

We will work with you tomorrow about that. See you there!


Alaa Zubaidi
02/17/2015 4:25am PST

Trying to prepare for tomorrow.. got the following error on my laptop:

D:\PDF\Spark\spark-training\simple-app>..\spark\bin\spark-submit -class “SimpleA
pp” -master local[*] target\scala-2.10\simple-project_2.10-1.0.jar
Exception in thread “main” java.lang.NoSuchMethodError: scala.collection.immutab
at org.apache.spark.deploy.SparkSubmitArguments.parse$1(SparkSubmitArgum
at org.apache.spark.deploy.SparkSubmitArguments.parseOpts(SparkSubmitArg
at org.apache.spark.deploy.SparkSubmitArguments.(SparkSubmitArgume
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:70)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Picture of Paco Nathan
Paco Nathan
02/16/2015 9:47am PST

Hi Ozlem,

SQL experience will help lots. It would be best to have some Python, but so many of the coding exercises involve code samples and then you can edit them or cut&paste from earlier example code to complete the exercise. So lots of Python experience is not needed at all.

Ozlem Gorur
02/16/2015 9:43am PST

Can someone without phyton or java knowledge but Hive and SQL experience attend the Spark camp?

Picture of Paco Nathan
Paco Nathan
02/14/2015 9:47am PST

Hi Roland,

You got it to run correctly. Those are “warnings” on the console, not exceptions.

In class, we’ll show how to turn down the log level, to get rid of some of that noise — however, often in debugging it is useful.

See you there next week!

Picture of Roland Hochmuth
Roland Hochmuth
02/14/2015 9:43am PST

In prep for the camp I ran spark-submit and received the following errors and and wondering how to resolve.

Rolands-MacBook-Pro-2:simple-app rolandhochmuth$ ../spark/bin/spark-submit —class “SimpleApp” —master local[*] target/scala-2.10/simple-project_2.10-1.0.jar
Spark assembly has been built with Hive, including Datanucleus jars on classpath
2015-02-14 18:25:50.428 java[27528:1703] Unable to load realm info from SCDynamicStore
15/02/14 18:26:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
15/02/14 18:26:50 WARN LoadSnappy: Snappy native library not loaded
Lines with a: 83, Lines with b: 38

Picture of Paco Nathan
Paco Nathan
02/11/2015 3:55am PST

Hi Kyle,

Certainly, yes. See you there!

Kyle Davis
02/11/2015 3:44am PST

Hey Paco

I am an equities researcher attending the conference because I have interest in the Hortonworks/Cloudera and Spark ecosystems. I do not really want to participate in the hands on learning aspect so will it be ok if I am just an observer?

Picture of Paco Nathan
Paco Nathan
02/11/2015 12:40am PST

Thank you Sean -

There will be an update on the at the tutorial. We will have USBs to hand out. Yes, the file layout for the Apache Spark download changed in the 1.2.x release. We’ll cover that in the tutorial.

See you next week!

Picture of Sean Boisen
Sean Boisen
02/11/2015 12:27am PST

I downloaded, extracted, and ran the commands for building and using the simple-app. However, the file lists two folders that don’t appear to be included in streaming and website.

Picture of Paco Nathan
Paco Nathan
01/07/2015 6:17am PST

We show examples mostly in Scala, Python, SQL, plus a few in Java.

01/07/2015 6:06am PST

What language will be used for the workshop? Scala,Java, or Python?