Presented By O'Reilly and Cloudera
Make Data Work
March 13–14, 2017: Training
March 14–16, 2017: Tutorials & Conference
San Jose, CA

In-Person Training
Spark foundations: Prototyping Spark use cases on Wikipedia datasets

Jacob Parr (Databricks)
Monday, March 13 & Tuesday, March 14, 9:00am - 5:00pm
Spark & beyond
Location: 212 C
Secondary topics:  Streaming

Participants should plan to attend both days of this 2-day training course. Platinum and Training passes do not include access to tutorials on Tuesday.

The real power and value proposition of Apache Spark is in building a unified use case that combines ETL, batch analytics, real-time stream analysis, machine learning, graph processing, and visualizations. Jacob Parr employs hands-on exercises using various Wikipedia datasets to illustrate the variety of ideal programming paradigms Spark makes possible.

What you'll learn, and how you can apply it

  • Understand the variety of ideal programming paradigms Spark makes possible


  • A basic understanding of software development
  • Some experience coding in Python, Java, SQL, Scala, or R
  • Familiarity with Scala programming basics (check out Scala Basics and Atomic Scala)

The real power and value proposition of Apache Spark is in building a unified use case that combines ETL, batch analytics, real-time stream analysis, machine learning, graph processing, and visualizations. Jacob Parr employs hands-on exercises using various Wikipedia datasets to illustrate the variety of ideal programming paradigms Spark makes possible. By the end of the training, you’ll be able to create proofs of concept and prototype applications using Spark.

The course will consist of about 50% lecture and 50% hands-on labs. All participants will have access to Databricks Community Edition after class to continue working on labs and assignments.

Note that most of the hands-on labs will be taught in Scala. (PySpark architecture and code examples will be covered briefly.)

Who should attend?

People with less than two months of hands-on experience with Spark

Datasets explored in class:


Day 1

9:00am – 9:30am
Introduction to Wikipedia and Spark
Demo: Logging into Databricks and a tour of the user interface

  • Overview of the six Wikipedia data sources
  • Overview of Apache Spark APIs, libraries, and cluster architecture

9:30am – 10:30am
DataFrames and Spark SQL
Datasets used: Pageviews and Clickstream

Use a SQLContext to create a DataFrame from different data sources (S3, JSON, RDBMS, HDFS, Cassandra, etc.)

  • Run some common operations on DataFrames to explore it
  • Cache a DataFrame into memory
  • Correctly size the number of partitions in a DF, including the size of each partition
  • Use the Spark CSV library from Spark Packages to read structured files
  • Mix SQL and DataFrame queries
  • Write a user-defined function (UDF)
  • Join two DataFrames
  • Overview of how Spark SQL’s Catalyst optimizer converts logical plans to optimized physical plans
  • Create visualizations using matplotlib, Databricks, and Google Visualizations
  • Use the Spark UI’s new SQL tab to troubleshoot performance issues (like input read size, identifying stage boundaries, and Cartesian products)

10:30am – 11:00am

11:00am – 12:30pm
DataFrames and Spark SQL (cont.)

12:30pm – 1:30pm

1:30pm – 3:00pm
Spark core architecture

  • Driver and executor JVMs
  • Local mode
  • Resource managers (standalone, YARN, Mesos)
  • How to optimally configure Spark (# of slots, JVM sizes, garbage collection, etc.)
  • PySpark architecture (different serialization, extra Python processes, UDFs are slower, etc.)
  • Reading Spark logs and stout on drivers versus executors
  • Spark UI: Exploring the user interface to understand what’s going on behind the scenes of your application (# of tasks, memory of executors, slow tasks, Spark master/worker UIs, etc.)

3:00pm – 3:30pm

3:30pm – 5:00pm
Resilient distributed datasets
Datasets used: Pagecounts and English Wikipedia

  • When to use DataFrames versus RDDs (type-safety, memory pressure, optimizations, i/o)
  • Two ways to create an RDD using a SparkContext: Parallelize and read from an external data source
  • Common transformations and actions
  • Narrow versus wide transformations and performance implications (pipelining, shuffle)
  • How transformations lazily build up a directed acyclic graph (DAG)
  • How a Spark application breaks down to Jobs > Stages > Tasks
  • Repartitioning an RDD (repartition versus coalesce)
  • Different memory persistence levels for RDDs (memory, disk, serialization, etc.)
  • Different types of RDDs (HadoopRDD, ShuffledRDD, MapPartitionsRDD, PairRDD, etc.)
  • Spark UI: How to interpret the new DAG visualization, how to troubleshoot common performance issues like GroupByKey versus ReduceByKey by looking at shuffle read/write info

Day 2

9:00am – 9:30am
Review of Day 1

  • DataFrames and Spark SQL
  • Spark architecture
  • RDDs

9:30am – 10:30am
Shared variables (accumulators and broadcast variables)

  • Common use cases for shared variables
  • How accumulators can be used to implement distributed counters in parallel
  • Using broadcast variables to keep a read-only variable cached on each machine
  • Broadcast variables internals: BitTorrent implementation
  • Differences between broadcast variables and closures/lambdas (across stages versus per stage)
  • Configuring the autoBroadcastJoinThreshold in Spark SQL to do more efficient joins

10:30am – 11:00am

11:00am – 12:00pm
Datasets used: Clickstream

  • Use cases for graph processing
  • Graph processing fundamentals: Vertex, edge (unidirectional, bidirectional), labels
  • Common graph algorithms: In-degree, out-degree, Pagerank, subGraph, shortest path, triplets
  • GraphX internals: How Spark stores large graphs in RDDs (VertexRDD, EdgeRDD, and routing table RDD)

12:00pm – 12:30pm
Spark Streaming
Datasets used: Live edits stream of multiple languages

  • Architecture of Spark Streaming: Receivers, batch interval, block interval, direct pull
  • How the microbatch mechanism in Spark Streaming breaks up the stream into tiny batches and processes them
  • How to use a StreamingContext to create input DStreams (discretized streams)
  • Common transformations and actions on DStreams (map, filter, count, union, join, etc.)
  • Creating live, dynamically updated visualizations in Databricks (that update every two seconds)
  • Spark UI: How to use the new Spark Streaming UI to understand the performance of batch size versus processing latency
  • Receiver versus direct pull approach
  • High-availability guidelines (WAL, checkpointing)
  • Window operations: Apply transformations over a sliding window of data

12:30pm – 1:30pm

1:30pm – 2:30pm
Spark Streaming (cont.)

2:30pm – 3:00pm
Spark machine learning
Datasets used: English Wikipedia w/ edits

  • Common use cases of machine learning with Spark
  • When to use Spark MLlib (w/ RDDs) versus Spark ML (w/ DataFrames)
  • ML Pipelines concepts: DataFrames, transformer, estimator, pipeline, parameter
  • Basic statistics with MLlib
  • Tf-idf (term frequency-inverse document frequency)
  • Streaming machine learning (k-means, linear regression, logistic regression)

3:00pm – 3:30pm

3:30pm – 4:30pm
Spark machine learning (cont.)

4:30pm – 5:00pm
Spark R&D (optional)

  • Project Tungsten
  • New Datasets API
  • Upcoming developments: DataFrames in Streaming and GraphX, new MLlib algorithms, etc.
  • Berkeley Data Analytics Stack (Succinct, IndexedRDD, BlinkDB, SampleClean)

About your instructor

Jacob Parr is the owner of JParr Productions, where he writes couseware, leads one-on-one training for companies like Databricks, Nike, Comcast, Cisco, AOL, and Moody’s Analytics, and speaks at conferences like Spark Summit. Jacob became interested in software development at the age of 11, and just two years later, he began programming his own video games—he’s been developing software ever since. Over his 20-year career, he has worked in software testing and test automation for Sierra On-Line (aka The ImagiNation Network aka AOL Entertainment), where he also developed software for Sierra Telephone, first as an engineer and eventually as an architect and senior developer, and in custom software development for websites, ecommerce systems, real-estate applications, and even the occasional enterprise tax consultant. His background includes telecommunications, billing systems, service order systems, trouble ticketing systems, and enterprise integration, and he has built everything from swing apps to monoliths to REST and microservices architecture. He participates in a number of open source projects. Jacob lives in Oakhurst, CA, with his lovely wife. As empty-nesters of three adult children, they enjoy spoiling their Boston terriers. He loves to play practical jokes, fly drones, chase his nephews and nieces with an arsenal of Nerf guns, and work on his n-scale train set. In his little spare time, he loves to (you guessed it) work on his pet software projects.

Twitter for SireInsectus

Conference registration

Get the Platinum pass or the Training pass to add this course to your package.

Comments on this page are now closed.


03/01/2017 7:26am PST

Do I need to bring my own laptop for the training. Do I need specific software installed?

Picture of Sophia DeMartini
02/21/2017 3:23am PST

Hi Brent,

We unfortunately do not record the trainings, as they’re more focused on hands-on learning. If you have more questions about how to figure out which training to attend, please email me at

Thank you,

Brent Johnson |
02/20/2017 10:54am PST

I am torn as to which of two training sessions to attend. Is there a means by which I can register for one session and later watch the video of another?