Presented By O'Reilly and Cloudera
December 5-6, 2016: Training
December 6–8, 2016: Tutorials & Conference

Spark camp: Exploring Wikipedia with Spark

Sameer Farooqui (Databricks)
9:00am–5:00pm Tuesday, December 6, 2016
Spark & beyond
Location: 328/329
Tags: real-time
Average rating: *****
(5.00, 1 rating)

Sponsored by:

The real power and value proposition of Apache Spark is in building a unified use case that combines ETL, batch analytics, real-time stream analysis, machine learning, graph processing, and visualizations. Through hands-on examples, Sameer Farooqui explores various Wikipedia datasets to illustrate a variety of ideal programming paradigms.

The class will consist of about 60% lecture and 40% hands-on labs + demos. Note that the hands-on labs in class will be taught in Scala. All students will have access to Databricks for one month after class to continue working on labs and assignments.

Datasets explored in class:


Introduction to Wikipedia and Spark

  • Overview of the five Wikipedia data sources
  • Overview of Apache Spark APIs, libraries, and cluster architecture
  • Demo: Logging into Databricks and a tour of the user interface

DataFrames and Spark SQL
Datasets used: Pageviews and Clickstream

  • Use an SQLContext to create a DataFrame from different data sources (S3, JSON, RDBMS, HDFS, Cassandra, etc.)
  • Run some common operations on DataFrames to explore it
  • Cache a DataFrame into memory
  • Correctly size the number of partitions in a DataFrame, including the size of each partition
  • Mix SQL and DataFrame queries
  • Join two DataFrames
  • Overview of how Spark SQL’s Catalyst optimizer converts logical plans to optimized physical plans
  • Create visualizations using Databricks and Google Visualizations
  • Use the Spark UI’s new SQL tab to troubleshoot performance issues (like input read size, identifying stage boundaries, and Cartesian products)


Spark core architecture

  • Driver and executor JVMs
  • Local mode
  • Resource managers (standalone)
  • How to optimally configure Spark (# of slots, JVM sizes, garbage collection, etc.)
  • Reading Spark logs and stout on driver versus executors
  • Spark UI: Exploring the user interface to understand what’s going on behind the scenes of your application (# of tasks, memory of executors, slow tasks, Spark master/worker UIs, etc.)


Resilient distributed datasets
Dataset used: Pagecounts

  • When to use DataFrames vs. RDDs (type-safety, memory pressure, optimizations, I/O)
  • Narrow versus wide transformations and performance implications (pipelining, shuffle)
  • How transformations lazily build up a directed acyclic graph (DAG)
  • How a Spark application breaks down to Jobs > Stages > Tasks
  • Repartitioning an RDD (repartition versus coalesce)
  • Different memory persistence levels for RDDs (memory, disk, serialization, etc.)
  • Different types of RDDs (HadoopRDD, ShuffledRDD, MapPartitionsRDD, PairRDD, etc.)
  • Spark UI: How to interpret the new DAG visualization, how to troubleshoot common performance issues like GroupByKey versus ReduceByKey by looking at shuffle read/write info

Datasets used: Clickstream

  • Use cases for graph processing
  • Graph processing fundamentals: Vertex, edge (unidirectional, bidirectional), labels
  • Common Graph algorithms: In-degree, out-degree, Pagerank, subGraph, Shortest Path, Triplet

Spark Streaming
Datasets: Live edits stream from multiple languages

  • Architecture of Spark Streaming: Receivers, batch interval, block interval, direct pull
  • How the microbatch mechanism in Spark Streaming breaks up the stream into tiny batches and processes them
  • How to use a StreamingContext to create input DStreams (Discretized Streams)
  • Common transformations and actions on DStreams (map, filter, count, union, join, etc.)
  • Creating a live, dynamically updated visualization in Databricks (that updates every two seconds)
  • Spark UI: How to use the new Spark Streaming UI to understand the performance of batch size versus processing latency


Guest talk: Choosing an optimal storage backend for your Spark use case—Vida Ha

  • Data storage tips for optimal Spark performance
  • HDFS file/block sizes
  • Compression formats (gzip, Snappy, bzip2, LZO, LZ4, etc.)
  • Working with CSV, JSON, and XML file types

Machine learning

  • Datasets: English Wikipedia and Live edits (optional)
  • Common use cases of machine learning with Spark
  • When to use Spark MLlib (w/ RDDs) versus Spark ML (w/ DataFrames)
  • ML Pipelines concepts: DataFrames, transformer, estimator, pipeline, parameter
  • Using TF-IDF and K-means to cluster 5 million articles into 100 clusters
Photo of Sameer Farooqui

Sameer Farooqui


Sameer Farooqui is a client services engineer at Databricks, where he works with customers on Apache Spark deployments. Sameer works with the Hadoop ecosystem, Cassandra, Couchbase, and general NoSQL domain. Prior to Databricks, he worked as a freelance big data consultant and trainer globally and taught big data courses. Before that, Sameer was a systems architect at Hortonworks, an emerging data platforms consultant at Accenture R&D, and an enterprise consultant for Symantec/Veritas (specializing in VCS, VVR, and SF-HA).

Comments on this page are now closed.


12/19/2016 9:10pm +08

Hi , Post session I am trying to open the .dbc file but I am unable to , how do I open it ?

12/05/2016 11:05pm +08

Do I need to know Scala for this session?
Also is there a version/slides for this class in Python?

Julius Novan Cahyadi
11/29/2016 8:23pm +08


Is there any prerequisites installation/application that I need to prepare ?
Should I download the All above Datasets first?

Thank you.

Picture of Sophia DeMartini
Sophia DeMartini
11/22/2016 8:07am +08

Hi Roee and Duc,

Yes, please bring your own laptop so that you can follow along with the material being presented in the tutorial.

Thank you,

11/22/2016 3:10am +08

Should I bring my own laptop?

11/07/2016 7:56pm +08

Do I have a laptop to do hands-on labs?