Presented By O'Reilly and Cloudera
Make Data Work
December 1–3, 2015 • Singapore

Spark Camp: Exploring Wikipedia with Spark (Tackling a unified use case)

Sameer Farooqui (Databricks), Paco Nathan (, Reynold Xin (Databricks)
9:00am–5:00pm Tuesday, 12/01/2015
Spark & Beyond
Location: 328-329 Level: Intermediate
Average rating: ****.
(4.00, 20 ratings)

Prerequisite Knowledge

Basic programming experience in an object-oriented or functional language (the class will mostly be taught in Scala).

This technical class is designed as an introduction for engineers, data scientists, and analysts with less than a month or so experience with Spark.


Sponsored by:

Computer Requirements

All of the hands-on labs for class will be run on the Databricks platform in a browser. Please bring a laptop with updated versions of Chrome or Firefox (Internet Explorer and Safari are not supported). You do not need Spark, Scala or Python installed on your laptop. The operating system of your laptop does not matter.


The real power and value proposition of Apache Spark is in building a unified use case that combines ETL, batch analytics, real-time stream analysis, machine learning, graph processing and visualizations. In class we will explore various Wikipedia datasets while applying the ideal programming paradigm for each analysis. The class will comprise of about 50% lecture and 50% hands on labs + demos.

Topics covered include:

  • Overview: Wikipedia and Spark
  • Analyze data using:
  • DataFrames + Spark SQL
  • RDDs
  • Spark Streaming
  • MLlib (Machine Learning)
  • GraphX
  • Leveraging knowledge of Spark’s Architecture for performance tuning and debugging
  • How and when to use advanced Spark features:
  • Accumulators
  • Broadcast variables
  • Memory persistence levels
  • Spark UI details

Tutorial assisted by Paco Nathan and Reynold Xin

Photo of Sameer Farooqui

Sameer Farooqui


Sameer Farooqui is a client services engineer at Databricks, where he works with customers on Apache Spark deployments. Sameer works with the Hadoop ecosystem, Cassandra, Couchbase, and general NoSQL domain. Prior to Databricks, he worked as a freelance big data consultant and trainer globally and taught big data courses. Before that, Sameer was a systems architect at Hortonworks, an emerging data platforms consultant at Accenture R&D, and an enterprise consultant for Symantec/Veritas (specializing in VCS, VVR, and SF-HA).

Photo of Paco Nathan

Paco Nathan

Paco Nathan is known as a “player/coach” with core expertise in data science, natural language processing, machine learning, and cloud computing. He has 35+ years of experience in the tech industry, at companies ranging from Bell Labs to early-stage startups. His recent roles include director of the Learning Group at O’Reilly and director of community evangelism at Databricks and Apache Spark. Paco is the cochair of Rev conference and an advisor for Amplify Partners, Deep Learning Analytics, Recognai, and Primer. He was named one of the "top 30 people in big data and analytics" in 2015 by Innovation Enterprise.

Photo of Reynold Xin

Reynold Xin


Reynold Xin is a cofounder and chief architect at Databricks as well as an Apache Spark PMC member and release manager for Spark’s 2.0 release. Prior to Databricks, Reynold was pursuing a PhD at the UC Berkeley AMPLab, where he worked on large-scale data processing.

Comments on this page are now closed.


Picture of Sameer Farooqui
Sameer Farooqui
12/08/2015 11:53am +08

Jordan, You can still log in to your shard with same user/pass for another 2 weeks or so.

Jordan Jordanov
12/08/2015 1:27am +08

Hi Sameer,
You mentioned during the event that we can use the environment for some time to repeat the exercises.
Shall we use again and the same user/pass from the conf?

Jagadeesh Potturi
11/30/2015 10:14pm +08

Thanks Sameer.
Glad that we have received email from your team with URL , to learn Scala basics.

Picture of Sameer Farooqui
Sameer Farooqui
11/28/2015 2:00pm +08

Hi Jagadeesh, most of the class will be taught in Scala with a bit of Python.

Jagadeesh Potturi
11/27/2015 2:22am +08

Hello Team,

Which Programming language would you be using during the training? Java or Scala? I would prefer Java :) , but happy to learn Scala.

Thank you,

Picture of Eugene Teo
Eugene Teo
11/25/2015 8:12am +08

Awesome. Thanks Sameer.

Picture of Sameer Farooqui
Sameer Farooqui
11/25/2015 2:04am +08

Hi Eugene, no need to download or install anything special for class. You just need Chrome or Firefox installed. We will run the labs on Databricks notebooks in the browser, backed by Spark clusters in Amazon EC2.

Picture of Eugene Teo
Eugene Teo
11/24/2015 10:22pm +08

I’m excited about the Spark Camp, and I want to be prepared. What do I need to download and install on my laptop before attending this class?

Thanks, Eugene

Picture of Sameer Farooqui
Sameer Farooqui
11/20/2015 4:07pm +08

Hi Umanga, the class is intended for engineers with less than 1 month of hands-on experience with Spark. So, I would consider it a beginner to medium level class.

Umanga Bista
10/14/2015 9:08pm +08

Is this camp intended for just beginners or for advanced users as well ?