Presented By O'Reilly and Cloudera
Make Data Work
September 26–27, 2016: Training
September 27–29, 2016: Tutorials & Conference
New York, NY

Spark camp: Exploring Wikipedia with Spark

Zoltan Toth (
9:00am–5:00pm Tuesday, 09/27/2016
Spark & beyond
Location: Hall 1B

Each Spark Camp attendee must have a pre-established Databricks account for use during the tutorial. To ensure efficient administration of the tutorial, O’Reilly and Databricks require your consent to permit O’Reilly to share with Databricks the first name, last name and email address you provided during the conference registration process. Databricks’ use of this information, including the set-up of a Databricks account for your use during the tutorial, is governed by its privacy policy.

Average rating: **...
(2.90, 10 ratings)

What you'll learn

  • Learn a variety of ideal programming paradigms for Spark
  • Description

    Sponsored by:

    The real power and value proposition of Apache Spark is in building a unified use case that combines ETL, batch analytics, real-time stream analysis, machine learning, graph processing, and visualizations. Through hands-on examples, Zoltan Toth explores various Wikipedia datasets to illustrate a variety of ideal programming paradigms.

    The class will consist of about 60% lecture and 40% hands-on labs and demos. Note that the hands-on labs in class will be taught in Scala. All students will have access to Databricks for one month after class to continue working on labs and assignments.

    Datasets explored in class:


    9:00am – 9:30am
    Introduction to Wikipedia and Spark

    • Overview of the five Wikipedia data sources
    • Overview of Apache Spark APIs, libraries, and cluster architecture
    • Demo: Logging into Databricks and a tour of the user interface

    9:30am – 10:30am
    DataFrames and Spark SQL
    Datasets used: Pageviews and Clickstream

    • Use an SQLContext to create a DataFrame from different data sources (S3, JSON, RDBMS, HDFS, Cassandra, etc.)
    • Run some common operations on DataFrames to explore it
    • Cache a DataFrame into memory
    • Correctly size the number of partitions in a DataFrame, including the size of each partition
    • Mix SQL and DataFrame queries
    • Join two DataFrames
    • Overview of how Spark SQL’s Catalyst optimizer converts logical plans to optimized physical plans
    • Create visualizations using Databricks and Google Visualizations
    • Use the Spark UI’s new SQL tab to troubleshoot performance issues (like input read size, identifying stage boundaries, and Cartesian products)

    10:30am – 11:00am

    11:00am – 12:00pm
    Spark core architecture

    • Driver and executor JVMs
    • Local mode
    • Resource managers (standalone)
    • How to optimally configure Spark (# of slots, JVM sizes, garbage collection, etc.)
    • Reading Spark logs and stout on driver versus executors
    • Spark UI: Exploring the user interface to understand what’s going on behind the scenes of your application (# of tasks, memory of executors, slow tasks, Spark master/worker UIs, etc.)

    12:00pm – 1:00pm

    1:00pm – 2:00pm
    Resilient distributed datasets
    Dataset used: Pagecounts

    • When to use DataFrames versus RDDs (type-safety, memory pressure, optimizations, i/o)
    • Narrow versus wide transformations and performance implications (pipelining, shuffle)
    • How transformations lazily build up a directed acyclic graph (DAG)
    • How a Spark application breaks down to jobs > stages > tasks
    • Repartitioning an RDD (repartition versus coalesce)
    • Different memory persistence levels for RDDs (memory, disk, serialization, etc.)
    • Different types of RDDs (HadoopRDD, ShuffledRDD, MapPartitionsRDD, PairRDD, etc.)
    • Spark UI: How to interpret the new DAG visualization, how to troubleshoot common performance issues like GroupByKey versus ReduceByKey by looking at shuffle read/write info

    2:00pm – 2:30pm
    Datasets used: Clickstream

    • Use cases for graph processing
    • Graph processing fundamentals: Vertex, edge (unidirectional, bidirectional), labels
    • Common Graph algorithms: In-degree, out-degree, Pagerank, subGraph, Shortest Path, Triplet

    2:30pm – 3:00pm
    Spark Streaming
    Datasets used: Live edits stream from multiple languages

    • Architecture of Spark Streaming: Receivers, batch interval, block interval, direct pull
    • How the microbatch mechanism in Spark Streaming breaks up the stream into tiny batches and processes them
    • How to use a StreamingContext to create input DStreams (Discretized Streams)
    • Common transformations and actions on DStreams (map, filter, count, union, join, etc.)
    • Creating a live, dynamically updated visualization in Databricks (that updates every two seconds)
    • Spark UI: How to use the new Spark Streaming UI to understand the performance of batch size versus processing latency

    3:00pm – 3:30pm

    3:30pm – 3:45pm
    Guest talk: Choosing an optimal storage backend for your Spark use case

    • Data storage tips for optimal Spark performance
    • HDFS file/block sizes
    • Compression formats (gzip, Snappy, bzip2, LZO, LZ4, etc.)
    • Working with CSV, JSON, and XML file types

    3:45pm – 4:45pm
    Machine learning
    Datasets used: English Wikipedia and Live edits (optional)

    • Common use cases of machine learning with Spark
    • When to use Spark MLlib (w/ RDDs) versus (w/ DataFrames)
    • ML Pipelines concepts: DataFrames, transformer, estimator, pipeline, parameter
    • Using TF-IDF and K-means to cluster 5 million articles into 100 clusters
    Photo of Zoltan Toth

    Zoltan Toth

    Zoltan Toth is a freelance data engineer and trainer with over 15 years of experience developing data-intensive applications. Zoltan spends most of his time helping companies kick off and mature their data analytics infrastructure and regularly gives Hadoop, big data, and ​Spark trainings. Zoltan built Prezi’s big data infrastructure and later led Prezi’s data engineering team, scaling it to serve 60 million users backed by a data volume over a petabyte. He also worked on big data and Spark-integration projects with RapidMiner, a global leader in predictive analytics. Besides working with data analytics architectures, Zoltan teaches at Central European University, one of the best independent universities in Europe.

    Comments on this page are now closed.


    Pierre Galland
    09/27/2016 5:39am EDT

    Would it be possible to have a better sound ? It sounds like the microphone or the sound system is not very well set up.

    Picture of Zoltan Toth
    Zoltan Toth
    09/25/2016 8:48pm EDT

    If you haven’t already, you will get an email with instructions: You will need a Databricks Community Edition account (free) and Firefox or Chrome installed on your laptop.
    Scala knowledge is not required for this course. You will need to have some basic programming experience though to make sure you can understand a few lines of code here and there.

    Gary Baker
    09/25/2016 5:58pm EDT

    I registered for this tutorial. DO I get an account information via email? or Do i need to create a databricks account by myself

    09/25/2016 11:42am EDT


    I registered this week for this tutorial. Is there prerequisite files/programs we need to install on our machines before the tutorial.

    Pradeep Gharat
    09/13/2016 6:08am EDT

    Is SCALA a pre-req for this course? Is it recommended for people with Python experience ?