Presented By O'Reilly and Cloudera
Make Data Work
Feb 17–20, 2015 • San Jose, CA

Spark Camp: An Introduction to Apache Spark with Hands-on Tutorials

Paco Nathan (O'Reilly Media), Holden Karau (IBM), Krishna Sankar (Volvo Cars), Reza Zadeh (Stanford University), Denny Lee (Concur Technologies), Chris Fregly (Flux Capacitor AI)
9:00am–5:00pm Wednesday, 02/18/2015
Hadoop & Beyond
Location: LL21 E/F
Average rating: ***..
(3.71, 17 ratings)

Materials or downloads needed in advance

Laptop, with Java JDK 6/7/8 installed. Please avoid using either Brew or Cygwin to install Spark.

Description

Prerequisites

Some experience coding in Python, SQL, Java, or Scala, plus some familiarity with Big Data issues/concepts.

What’s required for a laptop to use in the tutorial?

  • laptop with wifi and browser, and reasonably current hardware (+2GB RAM)
  • MacOSX, Windows, Linux — all work fine
  • make sure you don’t have corporate security controls that prevent use of network
  • have Java JDK 6/7/8 installed
  • have Python 2.7 installed

NB: do not install Spark with Homebrew or Cygwin

We will provide USB sticks with the necessary data+code. To save time, if people participating in the tutorial want to download in advance, the USB contents are here

Also, please see the Apache Spark developer certification exam being held at Strata on Fri Feb 20: http://www.oreilly.com/go/sparkcert

Tutorial Description

Spark Camp, organized by the creators of the Apache Spark project at Databricks, will be a day long hands-on introduction to the Spark platform including Spark Core, the Spark Shell, Spark Streaming, Spark SQL, MLlib, GraphX, and more. We will start with an overview of use cases and demonstrate writing simple Spark applications. We will cover each of the main components of the Spark stack via a series of technical talks targeted at developers that are new to Spark. Intermixed with the talks will be periods of hands-on lab work. Attendees will download and use Spark on their own laptops, as well as learn how to configure and deploy Spark in distributed big data environments including common Hadoop distributions and Mesos.

Developer Certification for Apache Spark
O’Reilly has partnered with Databricks, creators of Spark, to offer the Developer Certification for Apache Spark. The next Spark certification exam takes place at Strata + Hadooop World in San Jose on Friday, February 20. Learn more.

Photo of Paco Nathan

Paco Nathan

O'Reilly Media

O’Reilly author (Just Enough Math and Enterprise Data Workflows with Cascading) and a “player/coach” who’s led innovative Data teams building large-scale apps. Director of Community Evangelism for Apache Spark with Databricks, advisor to Amplify Partners . Expert in machine learning, cluster computing, and Enterprise use cases for Big Data. Interests: Spark, Ag+Data, Open Data, Mesos, PMML, Cascalog, Scalding, Clojure, Python, Chatbots, NLP.

Photo of Holden Karau

Holden Karau

IBM

Holden Karau is a software development engineer at Databricks and is active in open source. She the author of a book on Spark and has assisted with Spark workshops. Prior to Databricks she worked on a variety of search and classification problems at Google, Foursquare, and Amazon. She graduated from the University of Waterloo with a Bachelors of Mathematics in Computer Science.

Photo of Krishna Sankar

Krishna Sankar

Volvo Cars

Krishna Sankar is a consulting data scientist working on retail analytics, social media data science, and forays into deep learning, as well as codeveloping the DeepLearnR package interfacing R over TensorFlow/Skflow. Previously, Krishna was a chief data scientist at Blackarrow.tv, where he focused on optimizing user experience via inference, intelligence, and interfaces. Earlier stints include principal architect/data scientist at Tata America Intl., director of data science at a bioinformatics startup, and distinguished engineer at Cisco. He is a frequent speaker at conferences, including Spark Summit, Spark Camp, OSCON, PyCon, and PyData, on topics such as predicting NFL winners, Spark, data science, machine learning, and social media analysis, as well as a guest lecturer at the Naval Postgraduate School. Krishna’s occasional blogs can be found at Doubleclix.wordpress.com. His other passion is Lego robotics. You will find him at the St. Louis First Lego League World Competition as a robot design judge.

Photo of Reza Zadeh

Reza Zadeh

Stanford University

Consulting professor at Stanford within ICME, conducting research and teaching courses targeting doctorate students. Technical Advisor at Databricks. I focus on Discrete Applied Mathematics, Machine Learning Theory and Applications, and Large-Scale Distributed Computing.

Photo of Denny Lee

Denny Lee

Concur Technologies

I am a hands on data architect and developer / hacker with more than 15 years of experience developing internet-scale infrastructure, data platforms, and distributed systems for both On-Premises and Cloud. My key focuses surround solving complex large scale data problems – providing not only architectural direction but hands-on implementation of these systems. Experience in building greenfield teams as well as turn around / change catalyst.

My current technical interests include Apache Spark, Big Data, Machine Learning, Graph databases, Cloud Infrastructure, and Distributed Systems Robustness.

Photo of Chris Fregly

Chris Fregly

Flux Capacitor AI

Chris Fregly is a Research Scientist at Flux Capacitor AI – a streaming analytics and machine learning startup in San Francisco. Chris is an Apache Spark Contributor, Netflix Open Source Committer, organizer of the global Advanced Spark and TensorFlow Meetup, and author of the upcoming book, Advanced Spark. Previously, Chris was an engineer at Databricks and Netflix – as well as a founding member of the IBM Spark Technology Center.

Comments on this page are now closed.

Comments

Picture of Paco Nathan
02/24/2015 6:13pm PST

Great to meet so many of you at Spark Camp. On behalf of the instructors, we really enjoyed this!

We’d like to get your feedback about Spark Camp, to help improve it. Here’s a Google Form for a survey: http://goo.gl/forms/s67ml4sN23

Also, if you want to extend your account for the cloud-based notebook that we used in the tutorial, please let us know via this survey.

Many thanks!
Paco

Picture of Paco Nathan
02/19/2015 8:54am PST

Follow-up from yesterday. Here are some great talks that go into detail about Spark SQL:

“The Spark SQL Optimizer and External Data Sources API
Michael Armbrust
http://youtu.be/GQSNJAzxOr8

“What’s coming for Spark in 2015”
Patrick Wendell
http://youtu.be/YWppYPWznSQ

Plus, in general check the archives from talks/meetups on the Apache Spark channel on YouTube:
https://www.youtube.com/channel/UCRzsq7k4-kT-h3TDUBQ82-w

Picture of Paco Nathan
02/19/2015 8:35am PST

Hi Anoop,

Thank you kindly! And feedback on the UI is much appreciated. I like the breadcrumbs approach.

I just checked on http://feedback.databricks.com/ with one of my personal accounts (not Databricks) and it is difficult to find/use the login buttons from UserVoice. That’s up in the top/right corner. I will update the slide, but I’ve notified the UX team to fix that. Thanks for catching that!

Picture of Paco Nathan
02/19/2015 8:28am PST

Hi Patrick,

For the folder exports: this is also the case for IPython notebook — code + markdown + results are represented in JSON. You can use a JSON pretty printer, such as piping through “python -m json.tool”

However, if you export individual notebooks, those are in the source language, i.e., Scala, Python, etc.

02/19/2015 6:12am PST

Great session yesterday. Thanks! I have a feedback about Databricks cloud UI – (posting it here since http://feedback.databricks.com/ does not allow me to)

It would be great if the UI could have breadcrumbs that shows the current path. Sometimes I end up having to do many clicks to navigate through the workspace and a breadcrumb will make the navigation much easier.

02/18/2015 11:32pm PST

Thanks Paco. After unzip, each file is actually a JSON doc file. It’s not human readable, need parsing to extract command and result.

Picture of Paco Nathan
02/18/2015 9:39pm PST

The formatting on these comments munged that note, but the *.dbc download is actually a ZIP file.

Try using “unzip -l _SparkCamp.dbc”

Picture of Paco Nathan
02/18/2015 9:38pm PST

Hi Patrick,

The extension looks proprietary, but it’s actually a JAR, ie. ZIP format. Try this:

bash-3.2$ unzip l _SparkCamp.dbc
Archive: _SparkCamp.dbc
Length Date Time Name
-
——— -- -- --
2544435 02-18-15 01:20 _SparkCamp/
15670 02-18-15 01:20 _SparkCamp/08.graphx.scala
104968 02-18-15 01:20 SparkCamp/demo_mllibiris.scala

Picture of Krishna Sankar
02/18/2015 9:37pm PST

Export – Source File works at the notebook level.

02/18/2015 9:30pm PST

Anyway to download a whole set of _SparkCamp Notebooks?
Tried to export it from DBC, but it’s in proprietary DBC Archive format.

Picture of Krishna Sankar
02/18/2015 4:00pm PST

Thanks Paco. The solution I showed below. It is just one way – a good start at best. Nothing fancy:

  1. Databricks notebook source exported at Wed, 18 Feb 2015 23:57:11 UTC
  2. Coding Exercise 1 – Wordcount + join
  3. Krishna Sankar (2/18/15)
  4. Not optimized for scale et al. Just to give a start
  1. COMMAND -———-
  1. Always a good practive to have this
    import datetime
    print “Last ran @ %s” % datetime.datetime.now()
  1. COMMAND -———-
  1. Again, a good practice
    print sc.version
  1. COMMAND -———-

lines_01 = sc.textFile(‘/mnt/paco/intro/CHANGES.txt’)

  1. COMMAND -———-

lines_01.count()

  1. COMMAND -———-

from operator import add
wc_01 = lines_01.flatMap(lambda x : x.split(’ ’)).map(lambda x : (x,1)).reduceByKey(add)

  1. COMMAND -———-

wc_01.count()

  1. COMMAND -———-

wc_01.take(10)

  1. COMMAND -———-
  1. If you want to see how the words are distributed
  2. Collect over alarge dataset can potentially exhaust the memory
    wc_01.sortByKey().collect()
  1. COMMAND -———-

wc_01.filter(lambda x : x0 == ‘spark’).collect()

  1. COMMAND -———-

lines_02 = sc.textFile(‘/mnt/paco/intro/README.md’)
lines_02.count()
wc_02 = lines_02.flatMap(lambda x : x.split(’ ’)).map(lambda x : (x,1)).reduceByKey(add)
wc_02.count()

  1. COMMAND -———-

wc_01.join(wc_02).filter(lambda x : x0 == ‘Spark’).collect()

  1. COMMAND -———-

wc_02.sortByKey().take(10) #collect()

  1. COMMAND -———-
  1. By mistake I used the ‘spark’ with lowercase. Interesting because normal join won’t give anything as only one file has ‘spark’
  1. COMMAND -———-

wc_02.filter(lambda x : x0 == ‘spark’).collect()

  1. COMMAND -———-
  1. To catch ‘spark’, we need the fullOuterJoin !
  1. COMMAND -———-

wc_01.fullOuterJoin(wc_02).sortByKey().filter(lambda x : x0 == ‘spark’).collect()

Picture of Paco Nathan
02/18/2015 1:59pm PST

Krishna Sankar will give that talk — my apologies for mistype.

Picture of Paco Nathan
02/18/2015 1:58pm PST

Hi Hieu,

Yes, in fact Krisha will give that talk, up immediately next.

02/18/2015 1:54pm PST

can you make available the solution to the workflow assignment for comparison? Thanks

Picture of Paco Nathan
02/18/2015 8:34am PST

SLIDES TO DOWNLOAD FOR TODAY:

http://training.databricks.com/workshop/sparkcamp.pdf

Thank you much!

Picture of Paco Nathan
02/17/2015 6:26pm PST

Hi Carnot,

It’s not necessary to download. We hope to have fixed that, which was difficult at previous conferences. No, we won’t be using VMs.

See you tomorrow -

Paco

02/17/2015 5:31pm PST

So just to be clear: there is no advance download of Spark itself or the exercises? I was expecting to have to download a VM or something similarly huge.

Picture of Paco Nathan
02/17/2015 12:39pm PST

Hi Alaa,

We will work with you tomorrow about that. See you there!

Paco

02/17/2015 12:25pm PST

Hi,
Trying to prepare for tomorrow.. got the following error on my laptop:

D:\PDF\Spark\spark-training\simple-app>..\spark\bin\spark-submit -class “SimpleA
pp” -master local[*] target\scala-2.10\simple-project_2.10-1.0.jar
Exception in thread “main” java.lang.NoSuchMethodError: scala.collection.immutab
le.$colon$colon.hd$1()Ljava/lang/Object;
at org.apache.spark.deploy.SparkSubmitArguments.parse$1(SparkSubmitArgum
ents.scala:227)
at org.apache.spark.deploy.SparkSubmitArguments.parseOpts(SparkSubmitArg
uments.scala:220)
at org.apache.spark.deploy.SparkSubmitArguments.(SparkSubmitArgume
nts.scala:75)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:70)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Picture of Paco Nathan
02/16/2015 5:47pm PST

Hi Ozlem,

SQL experience will help lots. It would be best to have some Python, but so many of the coding exercises involve code samples and then you can edit them or cut&paste from earlier example code to complete the exercise. So lots of Python experience is not needed at all.

02/16/2015 5:43pm PST

Can someone without phyton or java knowledge but Hive and SQL experience attend the Spark camp?

Picture of Paco Nathan
02/14/2015 5:47pm PST

Hi Roland,

You got it to run correctly. Those are “warnings” on the console, not exceptions.

In class, we’ll show how to turn down the log level, to get rid of some of that noise — however, often in debugging it is useful.

See you there next week!

Picture of Roland Hochmuth
02/14/2015 5:43pm PST

In prep for the camp I ran spark-submit and received the following errors and and wondering how to resolve.

Rolands-MacBook-Pro-2:simple-app rolandhochmuth$ ../spark/bin/spark-submit —class “SimpleApp” —master local[*] target/scala-2.10/simple-project_2.10-1.0.jar
Spark assembly has been built with Hive, including Datanucleus jars on classpath
2015-02-14 18:25:50.428 java[27528:1703] Unable to load realm info from SCDynamicStore
15/02/14 18:26:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
15/02/14 18:26:50 WARN LoadSnappy: Snappy native library not loaded
Lines with a: 83, Lines with b: 38

Picture of Paco Nathan
02/11/2015 11:55am PST

Hi Kyle,

Certainly, yes. See you there!

02/11/2015 11:44am PST

Hey Paco

I am an equities researcher attending the conference because I have interest in the Hortonworks/Cloudera and Spark ecosystems. I do not really want to participate in the hands on learning aspect so will it be ok if I am just an observer?

Picture of Paco Nathan
02/11/2015 8:40am PST

Thank you Sean -

There will be an update on the USB.zip at the tutorial. We will have USBs to hand out. Yes, the file layout for the Apache Spark download changed in the 1.2.x release. We’ll cover that in the tutorial.

See you next week!
Paco

Picture of Sean Boisen
02/11/2015 8:27am PST

I downloaded usb.zip, extracted, and ran the commands for building and using the simple-app. However, the README.md file lists two folders that don’t appear to be included in usb.zip: streaming and website.

Picture of Paco Nathan
01/07/2015 2:17pm PST

We show examples mostly in Scala, Python, SQL, plus a few in Java.

01/07/2015 2:06pm PST

What language will be used for the workshop? Scala,Java, or Python?