Presented By O'Reilly and Cloudera
Make Data Work
Feb 17–20, 2015 • San Jose, CA

Spark Camp: An Introduction to Apache Spark with Hands-on Tutorials

Paco Nathan (O'Reilly Media), Holden Karau (IBM), Krishna Sankar (Volvo Cars), Reza Zadeh (Stanford | Matroid), Denny Lee (Microsoft), Chris Fregly (PipelineIO)
9:00am–5:00pm Wednesday, 02/18/2015
Hadoop & Beyond
Location: LL21 E/F
Average rating: ***..
(3.71, 17 ratings)

Materials or downloads needed in advance

Laptop, with Java JDK 6/7/8 installed. Please avoid using either Brew or Cygwin to install Spark.



Some experience coding in Python, SQL, Java, or Scala, plus some familiarity with Big Data issues/concepts.

What’s required for a laptop to use in the tutorial?

  • laptop with wifi and browser, and reasonably current hardware (+2GB RAM)
  • MacOSX, Windows, Linux — all work fine
  • make sure you don’t have corporate security controls that prevent use of network
  • have Java JDK 6/7/8 installed
  • have Python 2.7 installed

NB: do not install Spark with Homebrew or Cygwin

We will provide USB sticks with the necessary data+code. To save time, if people participating in the tutorial want to download in advance, the USB contents are here

Also, please see the Apache Spark developer certification exam being held at Strata on Fri Feb 20:

Tutorial Description

Spark Camp, organized by the creators of the Apache Spark project at Databricks, will be a day long hands-on introduction to the Spark platform including Spark Core, the Spark Shell, Spark Streaming, Spark SQL, MLlib, GraphX, and more. We will start with an overview of use cases and demonstrate writing simple Spark applications. We will cover each of the main components of the Spark stack via a series of technical talks targeted at developers that are new to Spark. Intermixed with the talks will be periods of hands-on lab work. Attendees will download and use Spark on their own laptops, as well as learn how to configure and deploy Spark in distributed big data environments including common Hadoop distributions and Mesos.

Developer Certification for Apache Spark
O’Reilly has partnered with Databricks, creators of Spark, to offer the Developer Certification for Apache Spark. The next Spark certification exam takes place at Strata + Hadooop World in San Jose on Friday, February 20. Learn more.

Photo of Paco Nathan

Paco Nathan

O'Reilly Media

O’Reilly author (Just Enough Math and Enterprise Data Workflows with Cascading) and a “player/coach” who’s led innovative Data teams building large-scale apps. Director of Community Evangelism for Apache Spark with Databricks, advisor to Amplify Partners . Expert in machine learning, cluster computing, and Enterprise use cases for Big Data. Interests: Spark, Ag+Data, Open Data, Mesos, PMML, Cascalog, Scalding, Clojure, Python, Chatbots, NLP.

Photo of Holden Karau

Holden Karau


Holden Karau is a software development engineer at Databricks and is active in open source. She the author of a book on Spark and has assisted with Spark workshops. Prior to Databricks she worked on a variety of search and classification problems at Google, Foursquare, and Amazon. She graduated from the University of Waterloo with a Bachelors of Mathematics in Computer Science.

Photo of Krishna Sankar

Krishna Sankar

Volvo Cars

Krishna Sankar is a consulting data scientist working on retail analytics, social media data science, and forays into deep learning, as well as codeveloping the DeepLearnR package interfacing R over TensorFlow/Skflow. Previously, Krishna was a chief data scientist at, where he focused on optimizing user experience via inference, intelligence, and interfaces. Earlier stints include principal architect/data scientist at Tata America Intl., director of data science at a bioinformatics startup, and distinguished engineer at Cisco. He is a frequent speaker at conferences, including Spark Summit, Spark Camp, OSCON, PyCon, and PyData, on topics such as predicting NFL winners, Spark, data science, machine learning, and social media analysis, as well as a guest lecturer at the Naval Postgraduate School. Krishna’s occasional blogs can be found at His other passion is Lego robotics. You will find him at the St. Louis First Lego League World Competition as a robot design judge.

Photo of Reza Zadeh

Reza Zadeh

Stanford | Matroid

Consulting professor at Stanford within ICME, conducting research and teaching courses targeting doctorate students. Technical Advisor at Databricks. I focus on Discrete Applied Mathematics, Machine Learning Theory and Applications, and Large-Scale Distributed Computing.

Photo of Denny Lee

Denny Lee


Denny Lee is a Principal Program Manager at Microsoft. He is a hands-on distributed systems and data sciences engineer with more than 15 years of experience developing internet-scale infrastructure, data platforms, and distributed systems for both on-premises and cloud. His key focuses surround solving complex large scale data problems – providing not only architectural direction but the hands-on implementation of these systems.

He has extensive experience in building greenfield teams as well as turn around / change catalyst. Prior to joining Azure DocumentDB, Denny worked as a Technology Evangelist at Databricks, Senior Director of Data Sciences Engineering at Concur, and was part of the incubation team that built Hadoop on Windows and Azure (currently known as Microsoft HDInsight).

Photo of Chris Fregly

Chris Fregly


Chris Fregly is a research scientist at PipelineIO, a San Francisco-based streaming machine learning and artificial intelligence startup. Previously, Chris was a distributed systems engineer at Netflix, a data solutions engineer at Databricks, and a founding member of the IBM Spark Technology Center in San Francisco. Chris is a regular speaker at conferences and meetups throughout the world. He’s also an Apache Spark contributor, a Netflix Open Source committer, founder of the Global Advanced Spark and TensorFlow meetup, author of the upcoming book Advanced Spark, and creator of the upcoming O’Reilly video series Deploying and Scaling Distributed TensorFlow in Production.

Comments on this page are now closed.


Picture of Paco Nathan
Paco Nathan
02/24/2015 10:13am PST

Great to meet so many of you at Spark Camp. On behalf of the instructors, we really enjoyed this!

We’d like to get your feedback about Spark Camp, to help improve it. Here’s a Google Form for a survey:

Also, if you want to extend your account for the cloud-based notebook that we used in the tutorial, please let us know via this survey.

Many thanks!

Picture of Paco Nathan
Paco Nathan
02/19/2015 12:54am PST

Follow-up from yesterday. Here are some great talks that go into detail about Spark SQL:

“The Spark SQL Optimizer and External Data Sources API
Michael Armbrust

“What’s coming for Spark in 2015”
Patrick Wendell

Plus, in general check the archives from talks/meetups on the Apache Spark channel on YouTube:

Picture of Paco Nathan
Paco Nathan
02/19/2015 12:35am PST

Hi Anoop,

Thank you kindly! And feedback on the UI is much appreciated. I like the breadcrumbs approach.

I just checked on with one of my personal accounts (not Databricks) and it is difficult to find/use the login buttons from UserVoice. That’s up in the top/right corner. I will update the slide, but I’ve notified the UX team to fix that. Thanks for catching that!

Picture of Paco Nathan
Paco Nathan
02/19/2015 12:28am PST

Hi Patrick,

For the folder exports: this is also the case for IPython notebook — code + markdown + results are represented in JSON. You can use a JSON pretty printer, such as piping through “python -m json.tool”

However, if you export individual notebooks, those are in the source language, i.e., Scala, Python, etc.

Anoop Johnson
02/18/2015 10:12pm PST

Great session yesterday. Thanks! I have a feedback about Databricks cloud UI – (posting it here since does not allow me to)

It would be great if the UI could have breadcrumbs that shows the current path. Sometimes I end up having to do many clicks to navigate through the workspace and a breadcrumb will make the navigation much easier.

02/18/2015 3:32pm PST

Thanks Paco. After unzip, each file is actually a JSON doc file. It’s not human readable, need parsing to extract command and result.

Picture of Paco Nathan
Paco Nathan
02/18/2015 1:39pm PST

The formatting on these comments munged that note, but the *.dbc download is actually a ZIP file.

Try using “unzip -l _SparkCamp.dbc”

Picture of Paco Nathan
Paco Nathan
02/18/2015 1:38pm PST

Hi Patrick,

The extension looks proprietary, but it’s actually a JAR, ie. ZIP format. Try this:

bash-3.2$ unzip l _SparkCamp.dbc
Archive: _SparkCamp.dbc
Length Date Time Name
——— -- -- --
2544435 02-18-15 01:20 _SparkCamp/
15670 02-18-15 01:20 _SparkCamp/08.graphx.scala
104968 02-18-15 01:20 SparkCamp/demo_mllibiris.scala

Picture of Krishna Sankar
Krishna Sankar
02/18/2015 1:37pm PST

Export – Source File works at the notebook level.

02/18/2015 1:30pm PST

Anyway to download a whole set of _SparkCamp Notebooks?
Tried to export it from DBC, but it’s in proprietary DBC Archive format.

Picture of Krishna Sankar
Krishna Sankar
02/18/2015 8:00am PST

Thanks Paco. The solution I showed below. It is just one way – a good start at best. Nothing fancy:

  1. Databricks notebook source exported at Wed, 18 Feb 2015 23:57:11 UTC
  2. Coding Exercise 1 – Wordcount + join
  3. Krishna Sankar (2/18/15)
  4. Not optimized for scale et al. Just to give a start
  1. COMMAND -———-
  1. Always a good practive to have this
    import datetime
    print “Last ran @ %s” %
  1. COMMAND -———-
  1. Again, a good practice
    print sc.version
  1. COMMAND -———-

lines_01 = sc.textFile(‘/mnt/paco/intro/CHANGES.txt’)

  1. COMMAND -———-


  1. COMMAND -———-

from operator import add
wc_01 = lines_01.flatMap(lambda x : x.split(’ ’)).map(lambda x : (x,1)).reduceByKey(add)

  1. COMMAND -———-


  1. COMMAND -———-


  1. COMMAND -———-
  1. If you want to see how the words are distributed
  2. Collect over alarge dataset can potentially exhaust the memory
  1. COMMAND -———-

wc_01.filter(lambda x : x0 == ‘spark’).collect()

  1. COMMAND -———-

lines_02 = sc.textFile(‘/mnt/paco/intro/’)
wc_02 = lines_02.flatMap(lambda x : x.split(’ ’)).map(lambda x : (x,1)).reduceByKey(add)

  1. COMMAND -———-

wc_01.join(wc_02).filter(lambda x : x0 == ‘Spark’).collect()

  1. COMMAND -———-

wc_02.sortByKey().take(10) #collect()

  1. COMMAND -———-
  1. By mistake I used the ‘spark’ with lowercase. Interesting because normal join won’t give anything as only one file has ‘spark’
  1. COMMAND -———-

wc_02.filter(lambda x : x0 == ‘spark’).collect()

  1. COMMAND -———-
  1. To catch ‘spark’, we need the fullOuterJoin !
  1. COMMAND -———-

wc_01.fullOuterJoin(wc_02).sortByKey().filter(lambda x : x0 == ‘spark’).collect()

Picture of Paco Nathan
Paco Nathan
02/18/2015 5:59am PST

Krishna Sankar will give that talk — my apologies for mistype.

Picture of Paco Nathan
Paco Nathan
02/18/2015 5:58am PST

Hi Hieu,

Yes, in fact Krisha will give that talk, up immediately next.

Hieu Ho
02/18/2015 5:54am PST

can you make available the solution to the workflow assignment for comparison? Thanks

Picture of Paco Nathan
Paco Nathan
02/18/2015 12:34am PST


Thank you much!

Picture of Paco Nathan
Paco Nathan
02/17/2015 10:26am PST

Hi Carnot,

It’s not necessary to download. We hope to have fixed that, which was difficult at previous conferences. No, we won’t be using VMs.

See you tomorrow -


carnot antonio romero
02/17/2015 9:31am PST

So just to be clear: there is no advance download of Spark itself or the exercises? I was expecting to have to download a VM or something similarly huge.

Picture of Paco Nathan
Paco Nathan
02/17/2015 4:39am PST

Hi Alaa,

We will work with you tomorrow about that. See you there!


Alaa Zubaidi
02/17/2015 4:25am PST

Trying to prepare for tomorrow.. got the following error on my laptop:

D:\PDF\Spark\spark-training\simple-app>..\spark\bin\spark-submit -class “SimpleA
pp” -master local[*] target\scala-2.10\simple-project_2.10-1.0.jar
Exception in thread “main” java.lang.NoSuchMethodError: scala.collection.immutab
at org.apache.spark.deploy.SparkSubmitArguments.parse$1(SparkSubmitArgum
at org.apache.spark.deploy.SparkSubmitArguments.parseOpts(SparkSubmitArg
at org.apache.spark.deploy.SparkSubmitArguments.(SparkSubmitArgume
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:70)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Picture of Paco Nathan
Paco Nathan
02/16/2015 9:47am PST

Hi Ozlem,

SQL experience will help lots. It would be best to have some Python, but so many of the coding exercises involve code samples and then you can edit them or cut&paste from earlier example code to complete the exercise. So lots of Python experience is not needed at all.

Ozlem Gorur
02/16/2015 9:43am PST

Can someone without phyton or java knowledge but Hive and SQL experience attend the Spark camp?

Picture of Paco Nathan
Paco Nathan
02/14/2015 9:47am PST

Hi Roland,

You got it to run correctly. Those are “warnings” on the console, not exceptions.

In class, we’ll show how to turn down the log level, to get rid of some of that noise — however, often in debugging it is useful.

See you there next week!

Picture of Roland Hochmuth
Roland Hochmuth
02/14/2015 9:43am PST

In prep for the camp I ran spark-submit and received the following errors and and wondering how to resolve.

Rolands-MacBook-Pro-2:simple-app rolandhochmuth$ ../spark/bin/spark-submit —class “SimpleApp” —master local[*] target/scala-2.10/simple-project_2.10-1.0.jar
Spark assembly has been built with Hive, including Datanucleus jars on classpath
2015-02-14 18:25:50.428 java[27528:1703] Unable to load realm info from SCDynamicStore
15/02/14 18:26:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
15/02/14 18:26:50 WARN LoadSnappy: Snappy native library not loaded
Lines with a: 83, Lines with b: 38

Picture of Paco Nathan
Paco Nathan
02/11/2015 3:55am PST

Hi Kyle,

Certainly, yes. See you there!

Kyle Davis
02/11/2015 3:44am PST

Hey Paco

I am an equities researcher attending the conference because I have interest in the Hortonworks/Cloudera and Spark ecosystems. I do not really want to participate in the hands on learning aspect so will it be ok if I am just an observer?

Picture of Paco Nathan
Paco Nathan
02/11/2015 12:40am PST

Thank you Sean -

There will be an update on the at the tutorial. We will have USBs to hand out. Yes, the file layout for the Apache Spark download changed in the 1.2.x release. We’ll cover that in the tutorial.

See you next week!

Picture of Sean Boisen
Sean Boisen
02/11/2015 12:27am PST

I downloaded, extracted, and ran the commands for building and using the simple-app. However, the file lists two folders that don’t appear to be included in streaming and website.

Picture of Paco Nathan
Paco Nathan
01/07/2015 6:17am PST

We show examples mostly in Scala, Python, SQL, plus a few in Java.

01/07/2015 6:06am PST

What language will be used for the workshop? Scala,Java, or Python?