Presented By O'Reilly and Cloudera
Make Data Work
March 28–29, 2016: Training
March 29–31, 2016: Conference
San Jose, CA

Spark camp: Exploring Wikipedia with Spark (Full Day)

Sameer Farooqui (Databricks)
9:00am–5:00pm Tuesday, 03/29/2016
Spark & Beyond

Location: LL21 E/F
Average rating: ***..
(3.90, 41 ratings)

Prerequisite knowledge

Participants must have a laptop with an up-to-date version of Chrome or Firefox (Internet Explorer not supported) installed. You should have a basic understanding of software development; some experience coding in Python, Java, SQL, Scala, or R; and Scala programming basics (check out Scala Basics and Atomic Scala).

Please note that it is a requirement that each participant set up a Databricks account for use during the tutorial. To ensure a swift and effective start to the tutorial, your account must be set up before the tutorial begins.


The class will consist of about 60% lecture and 40% hands-on labs + demos. Note that the hands-on labs in class will be taught in Scala. All students will have access to Databricks for one month after class to continue working on labs + assignments.

Datasets explored in class:


9:00 AM – 9:30 AM
Introduction to Wikipedia and Spark

  • Overview of the five Wikipedia data sources
  • Overview of Apache Spark APIs, libraries, and cluster architecture
  • Demo: Logging into Databricks and a tour of the user interface

9:30 AM – 10:30 AM
DataFrames and Spark SQL
Datasets used: Pageviews and Clickstream

  • How to use a SQLContext to create a DataFrame from different data sources (S3, JSON, RDBMS, HDFS, Cassandra, etc.)
  • Run some common operations on DataFrames to explore it
  • Cache a DataFrame into memory
  • Correctly size the number of partitions in a DataFrame, including the size of each partition
  • Mix SQL and DataFrame queries
  • Join two DataFrames
  • Overview of how Spark SQL’s Catalyst optimizer converts logical plans to optimized physical plans
  • Create visualizations using Databricks and Google Visualizations
  • Use the Spark UI’s new SQL tab to troubleshoot performance issues (like input read size, identifying stage boundaries, and Cartesian products)

10:30 AM – 11:00 AM

11:00 AM – 12:00 PM
Spark core architecture

  • Driver and executor JVMs
  • Local mode
  • Resource managers (standalone)
  • How to optimally configure Spark (# of slots, JVM sizes, garbage collection, etc.)
  • Reading Spark logs and stout on driver vs. executors
  • Spark UI: Exploring the user interface to understand what’s going on behind the scenes of your application (# of tasks, memory of executors, slow tasks, Spark master/worker UIs, etc.)

12:00 PM – 1:00 PM

1:00 PM – 2:00 PM
Resilient distributed datasets
Dataset used: Pagecounts

  • When to use DataFrames vs. RDDs (type-safety, memory pressure, optimizations, i/o)
  • Narrow vs. wide transformations and performance implications (pipelining, shuffle)
  • How transformations lazily build up a directed acyclic graph (DAG)
  • How a Spark application breaks down to Jobs > Stages > Tasks
  • Repartitioning an RDD (repartition vs. coalesce)
  • Different memory persistence levels for RDDs (memory, disk, serialization, etc.)
  • Different types of RDDs (HadoopRDD, ShuffledRDD, MapPartitionsRDD, PairRDD, etc.)
  • Spark UI: How to interpret the new DAG visualization, how to troubleshoot common performance issues like GroupByKey vs. ReduceByKey by looking at shuffle read/write info

2:00 PM – 2:30 PM
Datasets used: Clickstream

  • Use cases for graph processing
  • Graph processing fundamentals: Vertex, Edge (unidirectional, bidirectional), Labels
  • Common Graph algorithms: in-degree, out-degree, Pagerank, subGraph, Shortest Path, Triplet

2:30 PM – 3:00 PM
Spark Streaming
Datasets: Live Edit Stream from multiple Languages

  • Architecture of Spark Streaming: Receivers, batch interval, block interval, direct pull
  • How the micro-batch mechanism in Spark Streaming breaks up the stream into tiny batches and processes them
  • How to use a StreamingContext to create Input DStreams (Discretized Streams)
  • Common transformations and actions on DStreams (map, filter, count, union, join, etc)
  • Creating a live, dynamically updated visualizations in Databricks (that update every 2 seconds)
  • Spark UI: How to use the new Spark Streaming UI to understand the performance of batch size vs. processing latency

3:00 PM – 3:30 PM

3:30 PM – 3:45 PM
Guest talk: Choosing an optimal storage backend for your Spark use case
Vida Ha

  • Data Storage Tips for Optimal Spark Performance
  • HDFS file/block sizes
  • Compression Formats (gzip, Snappy, bzip2, LZO, LZ4, etc)
  • Working with CSV, JSON, XML file types

3:45 PM – 4:45 PM
Machine Learning

  • Datasets: English Wikipedia + Edits (optional)
  • Common use cases of Machine Learning with Spark
  • When to use spark.mllib (w/ RDDs) vs. (w/ DataFrames)
  • ML Pipelines concepts: DataFrames, Transformer, Estimator, Pipeline, Parameter
  • Using TF-IDF and K-means to cluster 5 million articles into 100 clusters
Photo of Sameer Farooqui

Sameer Farooqui


Sameer Farooqui is a client services engineer at Databricks, where he works with customers on Apache Spark deployments. Sameer works with the Hadoop ecosystem, Cassandra, Couchbase, and general NoSQL domain. Prior to Databricks, he worked as a freelance big data consultant and trainer globally and taught big data courses. Before that, Sameer was a systems architect at Hortonworks, an emerging data platforms consultant at Accenture R&D, and an enterprise consultant for Symantec/Veritas (specializing in VCS, VVR, and SF-HA).

Comments on this page are now closed.


04/01/2016 8:25am PDT

I am still not seeing the files that were suppose to be added to the Labs before we export so we can then import into Community Edition.

Jamini Samantaray
03/29/2016 3:00am PDT

Can someone put the link to the presentation here ?

Picture of Sameer Farooqui
Sameer Farooqui
03/28/2016 2:11pm PDT

No need to download any of the datasets. The links are just there for future reference for you.

03/28/2016 1:49pm PDT

Regarding the “Datasets explored in class”:

Do we need to download these ahead of time?
If so, exactly which files?

For example, if I follow the link to English Wikipedia (54 GB), I see a bunch of links. But I don’t see anything that is consistent with “54 GB”.

And the “Clickstream (1.2 GB)” one just won’t finish loading (on OS X), tried several times.

Picture of Stephen Dillon
01/24/2016 9:58pm PST

Will the Spark developer certification exam be offered at this Spark camp?

Picture of Sanjay Subramanian
Sanjay Subramanian
12/10/2015 2:49am PST

It would be useful if you can state the problems you will address as a result of this course ?
Example 1 – We will build a topic modeler that will predict a topic given a Wikipedia page as input