Presented By O'Reilly and Cloudera
Make Data Work
31 May–1 June 2016: Training
1 June–3 June 2016: Conference
London, UK

Spark camp: Exploring Wikipedia with Spark

Sameer Farooqui (Databricks)
9:00–17:00 Wednesday, 1/06/2016
Spark & beyond
Location: Capital Suite 8/9
Average rating: ****.
(4.44, 16 ratings)

Prerequisite knowledge

Attendees should have a basic understanding of software development, some experience coding in Python, Java, SQL, Scala or R, and a basic understanding of Scala programming (check out Scala Basics and Atomic Scala).

Materials or downloads needed in advance

Attendees need a laptop with an up-to-date version of Chrome or Firefox. (Internet Explorer is not supported.)

Please note that it is a requirement for the Spark Camp tutorial that each attendee have a Databricks account for use during the tutorial. To ensure a swift and effective start to the tutorial, your account must be set up before the tutorial begins.


The real power and value proposition of Apache Spark is in building a unified use case that combines ETL, batch analytics, real-time stream analysis, machine learning, graph processing, and visualizations. Through hands-on examples, Sameer Farooqui explores various Wikipedia datasets to illustrate a variety of ideal programming paradigms.

The class will consist of about 30% lecture and 70% hands-on labs and demos. Note that the hands-on labs in class will be taught in Scala. All students will have access to Databricks Community Edition after class to continue working on labs and assignments.

Related datasets explored in class:

30 mins: Introduction to Wikipedia and Spark

  • Overview of the five Wikipedia data sources
  • Overview of Apache Spark APIs, libraries, and cluster architecture
  • Demo: Logging into Databricks and a tour of the user interface

30 mins: Analyzing traffic patterns in the past hour to Wikipedia articles

  • Datasets used: Pagecounts
  • Spark API used: DataFrames
  • How to use a SQLContext to create a DataFrame from data on S3
  • Run some common operations on DataFrames to explore it
  • View the schema of a DataFrame
  • Use the following transformations: select(), distinct(), groupBy(), sum(), orderBy(), filter(), limit()
  • Use the following actions: show(), count()

30 mins: Analyzing desktop vs. mobile visitors (from a data engineer perspective)

  • Datasets used: Pageviews
  • Spark API used: DataFrames
  • Learn how actions kick off jobs and stages
  • Understand how DataFrame partitions relate to compute tasks
  • Use Spark UI to monitor details of job execution (input read, shuffle, storage UI, SQL visualization)
  • Cache a DataFrame to memory (and learn how to unpersist it)
  • Use the following transformations: orderBy(), filter()
  • How to size the number of partitions in a DataFrame
  • Catalyst optimizer: how DataFrame queries are converted from a logical plan to a physical plan
  • Configuration option: spark.sql.shuffle.partitions

Morning break: 10:30am–11:00am

30 mins: Analyzing desktop vs. mobile visitors (from a data analyst perspective)

  • Datasets used: Pageviews
  • Spark API used: DataFrames
  • Learn how to use the SQL functions package (sum)
  • Cast a string col type into a timestamp col type
  • Browse the Spark SQL API docs
  • Learn how to use “date-time functions”
  • Create and use a user-defined function (UDF)
  • Join two DataFrames
  • Make Databricks and Matplotlib visualizations

20 mins: Group exercise: “Monday mystery”

  • Datasets used: Pageviews
  • Spark API used: DataFrames
  • Work with a partner to solve the Monday mystery.

10 mins: Q&A

  • Open Q&A

Lunch: Noon–1:00pm

45 mins: Analyzing Wikipedia clickstream with DataFrames and SQL

  • Datasets used: Clickstream
  • Spark API used: DataFrames, Spark SQL
  • Learn how to use the Spark CSV library to read structured files
  • Use %sh to run shell commands
  • Learn about Spark’s architecture and JVM sizing
  • Use jps to list Java Virtual Machines
  • Repartition a DataFrame
  • Use the following DataFrame operations: printSchema(), select(), show(), count(), groupBy(), sum(), limit(), orderBy(), filter(), withColumnRenamed(), join(), withColumn()
  • Create a Google visualization to understand the clickstream traffic for the “Apache Spark” article
  • Explain in DataFrames and SQL
  • Troubleshoot UDFs using the SQL UI

30 mins: Analyzing the Wikipedia pagecounts with RDDs, Datasets, and DataFrames

  • Datasets used: Pagecounts
  • Spark API used: RDD, Dataset, DataFrames
  • Understand the difference between RDDs, DataFrames, and Datasets
  • Learn how to convert your RDD code to Datasets
  • Learn performance and memory storage differences between RDDs (row format) and DataFrames/Datasets (Tungsten binary format)
  • Learn advanced persistence options like MEMORY_AND_DISK
  • Learn how to define a case class to organize data in an RDD or Dataset into objects

30 mins: Analyzing Wikipedia clickstream with GraphFrames

  • Datasets used: Clickstream
  • Spark API used: GraphFrames
  • Use cases for graph processing
  • Graph processing fundamentals (nodes, edges)
  • Learn how to view data from a graph perspective
  • Quick tour of graph algorithms: inDegree, outDegree, SubGraph, Shortest Path, Pagerank

15 mins: Natural language processing fundamentals with English Wikipedia

  • Datasets used: English Wikipedia
  • Spark API used: DataFrames
  • Explore a sample of the English Wikipedia snapshot
  • Extracting a bag of words for each Wikipedia article with RegExTokenizer
  • Stop words removal
  • Finding the most common words in the English language

Afternoon break: 3:00pm–3:30pm

45 mins: Build an ML pipeline to cluster 100,000 articles into 100 clusters

  • Datasets used: English Wikipedia
  • Spark API used: DataFrames and
  • Supervised vs. unsupervised learning
  • Spark.mllib vs.
  • ML algorithms covered: TF-IDF, k-means
  • transformers and estimators

30 mins: Analyze the live edits stream of multiple Wikipedia languages

  • Datasets used: Wikipedia live edits streams
  • Spark API used: Spark Streaming, RDD, DataFrames
  • Spark Streaming microbatch architecture: receivers, batch interval, block interval
  • How to create stable streaming applications
  • How to use a StreamingContext to create input DStreams (discretized streams)
  • Common transformations and actions on DStreams (map, filter, count, union, join, etc.)
  • Spark UI: How to use the Spark Streaming UI to understand the performance of batch size vs. processing latency
  • Creating live, dynamically updated visualizations in Databricks (that update every few seconds)

15 mins: Conclusion and Q&A

  • Future of Spark 2.0: Streaming DataFrames, architecture evolution, performance benefits
  • Future of Wikipedia: Rise of smartphones, defending against sock puppets, bias, paid editors
Photo of Sameer Farooqui

Sameer Farooqui


Sameer Farooqui is a client services engineer at Databricks, where he works with customers on Apache Spark deployments. Sameer works with the Hadoop ecosystem, Cassandra, Couchbase, and general NoSQL domain. Prior to Databricks, he worked as a freelance big data consultant and trainer globally and taught big data courses. Before that, Sameer was a systems architect at Hortonworks, an emerging data platforms consultant at Accenture R&D, and an enterprise consultant for Symantec/Veritas (specializing in VCS, VVR, and SF-HA).

Comments on this page are now closed.


Picture of Gianfranco Cecconi
Gianfranco Cecconi
25/01/2016 8:43 GMT

Hi Sameer, All, it is not clear what the prerequisites to attend are. Of course the session is technical, but do we need to knowledgeable of to make the best out of the training? Thanks.