Presented By O'Reilly and Cloudera
Make Data Work
March 28–29, 2016: Training
March 29–31, 2016: Conference
San Jose, CA

Spark foundations: Prototyping Spark use cases on Wikipedia datasets

Brian Clapper (Databricks)
9:00am - 5:00pm Monday, March 28 & Tuesday, March 29

Location: 211D

Participants should plan to attend both days of this 2-day training course. Training passes do not include access to tutorials on Tuesday.

Average rating: ****.
(4.32, 19 ratings)

Prerequisite knowledge

Participants must have a laptop with an up-to-date version of Chrome or Firefox (Internet Explorer not supported). You should have a basic understanding of software development; some experience coding in Python, Java, SQL, Scala, or R; and Scala programming basics (check out Scala Basics and Atomic Scala).


The real power and value proposition of Apache Spark is in building a unified use case that combines ETL, batch analytics, real-time stream analysis, machine learning, graph processing, and visualizations. In this 2-day course, Brian Clapper employs hands-on exercises using various Wikipedia datasets to illustrate the variety of ideal programming paradigms Spark makes possible. By the end of the training, you’ll be able to create proofs of concept and prototype applications using Spark.

The course will consist of about 50% lecture and 50% hands-on labs. All participants will have access to Databricks Community Edition after class to continue working on labs and assignments.

Note that most of the hands-on labs will be taught in Scala. (PySpark architecture and code examples will be covered briefly.)

Who should attend?

People with less than 2 months of hands-on experience with Spark

Datasets explored in class:


Day 1

9:00 AM – 9:30 AM
Introduction to Wikipedia and Spark
Demo: Logging into Databricks and a tour of the user interface

  • Overview of the six Wikipedia data sources
  • Overview of Apache Spark APIs, libraries, and cluster architecture

9:30 AM – 10:30 AM
DataFrames and Spark SQL
Datasets used: Pageviews and Clickstream

Use a SQLContext to create a DataFrame from different data sources (S3, JSON, RDBMS, HDFS, Cassandra, etc.)

  • Run some common operations on DataFrames to explore it
  • Cache a DataFrame into memory
  • Correctly size the number of partitions in a DF, including the size of each partition
  • Use the Spark CSV library from Spark Packages to read structured files
  • Mix SQL and DataFrame queries
  • Write a user-defined function (UDF)
  • Join two DataFrames
  • Overview of how Spark SQL’s Catalyst optimizer converts logical plans to optimized physical plans
  • Create visualizations using matplotlib, Databricks, and Google Visualizations
  • Use the Spark UI’s new SQL tab to troubleshoot performance issues (like input read size, identifying stage boundaries, and Cartesian products)

10:30 AM – 11:00 AM

11:00 AM – 12:30 PM
DataFrames and Spark SQL (cont.)

12:30 PM – 1:30 PM

1:30 PM – 3:00 PM
Spark core architecture

  • Driver and executor JVMs
  • Local mode
  • Resource managers (standalone, YARN, Mesos)
  • How to optimally configure Spark (# of slots, JVM sizes, garbage collection, etc.)
  • PySpark architecture (different serialization, extra Python processes, UDFs are slower, etc.)
  • Reading Spark logs and stout on drivers vs. executors
  • Spark UI: exploring the user interface to understand what’s going on behind the scenes of your application (# of tasks, memory of executors, slow tasks, Spark master/worker UIs, etc.)

3:00 PM – 3:30 PM

3:30 PM – 5:00 PM
Resilient distributed datasets
Datasets used: Pagecounts and English Wikipedia

  • When to use DataFrames vs. RDDs (type-safety, memory pressure, optimizations, i/o)
  • Two ways to create an RDD using a SparkContext: parallelize and read from an external data source
  • Common transformations and actions
  • Narrow vs. wide transformations and performance implications (pipelining, shuffle)
  • How transformations lazily build up a directed acyclic graph (DAG)
  • How a Spark application breaks down to Jobs > Stages > Tasks
  • Repartitioning an RDD (repartition vs. coalesce)
  • Different memory persistence levels for RDDs (memory, disk, serialization, etc.)
  • Different types of RDDs (HadoopRDD, ShuffledRDD, MapPartitionsRDD, PairRDD, etc.)
  • Spark UI: how to interpret the new DAG visualization, how to troubleshoot common performance issues like GroupByKey vs. ReduceByKey by looking at shuffle read/write info

Day 2

9:00 AM – 9:30 AM
Review of Day 1

  • DataFrames and Spark SQL
  • Spark architecture
  • RDDs

9:30 AM – 10:30 AM
Shared variables (accumulators and broadcast variables)

  • Common use cases for shared variables
  • How accumulators can be used to implement distributed counters in parallel
  • Using broadcast variables to keep a read-only variable cached on each machine
  • Broadcast variables internals: BitTorrent implementation
  • Differences between broadcast variables and closures/lambdas (across stages vs. per stage)
  • Configuring the autoBroadcastJoinThreshold in Spark SQL to do more efficient joins

10:30 AM – 11:00 AM

11:00 AM – 12:00 PM
Datasets used: Clickstream

  • Use cases for graph processing
  • Graph processing fundamentals: vertex, edge (unidirectional, bidirectional), labels
  • Common graph algorithms: in-degree, out-degree, Pagerank, subGraph, shortest path, triplets
  • GraphX internals: How Spark stores large graphs in RDDs (VertexRDD, EdgeRDD, and routing table RDD)

12:00 PM – 12:30 PM
Spark Streaming
Datasets used: Live edits stream of multiple languages

  • Architecture of Spark Streaming: receivers, batch interval, block interval, direct pull
  • How the microbatch mechanism in Spark Streaming breaks up the stream into tiny batches and processes them
  • How to use a StreamingContext to create input DStreams (discretized streams)
  • Common transformations and actions on DStreams (map, filter, count, union, join, etc.)
  • Creating live, dynamically updated visualizations in Databricks (that update every 2 seconds)
  • Spark UI: how to use the new Spark Streaming UI to understand the performance of batch size vs. processing latency
  • Receiver vs. direct pull approach
  • High availability guidelines (WAL, checkpointing)
  • Window operations: apply transformations over a sliding window of data

12:30 PM – 1:30 PM

1:30 PM – 2:30 PM
Spark Streaming (cont.)

2:30 PM – 3:00 PM
Spark machine learning
Datasets used: English Wikipedia w/ edits

  • Common use cases of machine learning with Spark
  • When to use Spark MLlib (w/ RDDs) vs. Spark ML (w/ DataFrames)
  • ML Pipelines concepts: DataFrames, transformer, estimator, pipeline, parameter
  • Basic statistics with MLlib
  • Tf-idf (term frequency-inverse document frequency)
  • Streaming machine learning (k-means, linear regression, logistic regression)

3:00 PM – 3:30 PM

3:30 PM – 4:30 PM
Spark machine learning (cont.)

4:30 PM – 5:00 PM
Spark R&D (optional)

  • Project Tungsten
  • New Datasets API
  • Upcoming developments: DataFrames in Streaming and GraphX, new MLlib algorithms, etc.
  • Berkeley Data Analytics Stack (Succinct, IndexedRDD, BlinkDB, SampleClean)
Photo of Brian Clapper

Brian Clapper


Brian Clapper is a senior instructor and curriculum developer at Databricks. Brian has more than 32 years’ experience as a software developer. Brian has worked for a stock exchange, the US Navy, a large software company, several startups, and small companies and, most recently, as an independent consultant and trainer for 7 years. Brian is fluent in many languages, including Scala, Java, Python, Ruby, C#, and C. In addition, he is highly familiar with current web application technologies, including frameworks like Play!, Ruby on Rails, and Django, and frontend technologies like jQuery, EmberJS, and AngularJS. Brian founded the Philly Area Scala Enthusiasts in 2010 and, since 2011, has been a co-organizer of the Northeast Scala Symposium; he was also co-organizer of Scalathon in 2011 and 2012. He maintains a substantial GitHub repository of open source projects and is fond of saying that even after many years as a software developer, programming is still one of his favorite activities.

Comments on this page are now closed.


Jonathan Shimonovich
03/21/2016 3:37pm PDT

Will the training slides/docs be available afterwards for self studying?

Chenguang Yang
03/21/2016 2:47am PDT

Hardware and software requirement on my laptop?

Picture of Sophia DeMartini
Sophia DeMartini
02/01/2016 7:51am PST

Hi Anirban and Vivek,

Thank you for your interest in the Spark Training. We do not currently have a waitlist for this, but if you email me at, I will let you know if a spot does happen to open up in the future.
vivekanand praturi
01/31/2016 5:18am PST

Very interested to attend this but it got sold out! How do we get to know if there is any last min chance due to some cancellations?

Anirban Das
01/28/2016 3:48am PST

Tried to register but this event is sold out now. Will you be holding a similar training in Bay Area in the near future? I am very interested to attend.

Picture of Sameer Farooqui
Sameer Farooqui
01/04/2016 7:02am PST

Hi John, we’re just about to update the outlines this week. Basically, Spark Camp will be a fast paced, crash course for students with at least a month of experience with Spark. I will be teaching Spark Camp with help from a couple of engineers at Databricks. We are expecting ~150 students in class, so there will be very limited time for Q&A with the instructors.

The 2-day Spark Foundations class is for students new to Spark. It will cover more depth over 2-days and move at a bit of a slower pace. We will limit the 2-day class to ~45 students, so there will be more opportunity for Q&A with the instructor. The instructor will be Brian Clapper.

Picture of John Teifel
12/18/2015 3:42am PST


What is the added value of attending the Spark Foundations Training on Monday and Tuesday compared to attending only the Spark Camp Tutorial on Tuesday? It looks like Sameer Farooqui will be speaking at both and it looks like there are extremely similar topics.