Presented By O'Reilly and Cloudera
Make Data Work
September 26–27, 2016: Training
September 27–29, 2016: Tutorials & Conference
New York, NY

Spark foundations: Prototyping Spark use cases on Wikipedia datasets

Brian Clapper (Databricks)
9:00am - 5:00pm Monday, September 26 & Tuesday, September 27
Location: 1 C04 / 1 C05

All training courses takes place 9:00am - 5:00pm, Monday, September 26 through Tuesday, September 27 and are limited in size to maintain a high level of hands-on learning and instructor interaction.

Participants should plan to attend both days of training. Training passes do not include access to tutorials on Tuesday.

Each Spark Camp attendee must have a pre-established Databricks account for use during the tutorial. To ensure efficient administration of the tutorial, O’Reilly and Databricks require your consent to permit O’Reilly to share with Databricks the first name, last name and email address you provided during the conference registration process. Databricks’ use of this information, including the set-up of a Databricks account for your use during the tutorial, is governed by its privacy policy.

Average rating: *****
(5.00, 1 rating)

Prerequisite knowledge

  • A basic understanding of software development
  • Some experience coding in Python, Java, SQL, Scala, or R
  • Familiarity with Scala programming basics (check out Scala Basics and Atomic Scala)

    Materials or downloads needed in advance

  • A laptop with an up-to-date version of Chrome or Firefox (Internet Explorer not supported)
  • What you'll learn

  • Understand the variety of ideal programming paradigms Spark makes possible
  • Description

    The real power and value proposition of Apache Spark is in building a unified use case that combines ETL, batch analytics, real-time stream analysis, machine learning, graph processing, and visualizations. Brian Clapper employs hands-on exercises using various Wikipedia datasets to illustrate the variety of ideal programming paradigms Spark makes possible. By the end of the training, you’ll be able to create proofs of concept and prototype applications using Spark.

    The course will consist of about 50% lecture and 50% hands-on labs. All participants will have access to Databricks Community Edition after class to continue working on labs and assignments.

    Note that most of the hands-on labs will be taught in Scala. (PySpark architecture and code examples will be covered briefly.)

    Who should attend?

    People with less than two months of hands-on experience with Spark

    Datasets explored in class:


    Day 1

    9:00am – 9:30am
    Introduction to Wikipedia and Spark
    Demo: Logging into Databricks and a tour of the user interface

    • Overview of the six Wikipedia data sources
    • Overview of Apache Spark APIs, libraries, and cluster architecture

    9:30am – 10:30am
    DataFrames and Spark SQL
    Datasets used: Pageviews and Clickstream

    Use a SQLContext to create a DataFrame from different data sources (S3, JSON, RDBMS, HDFS, Cassandra, etc.)

    • Run some common operations on DataFrames to explore it
    • Cache a DataFrame into memory
    • Correctly size the number of partitions in a DF, including the size of each partition
    • Use the Spark CSV library from Spark Packages to read structured files
    • Mix SQL and DataFrame queries
    • Write a user-defined function (UDF)
    • Join two DataFrames
    • Overview of how Spark SQL’s Catalyst optimizer converts logical plans to optimized physical plans
    • Create visualizations using matplotlib, Databricks, and Google Visualizations
    • Use the Spark UI’s new SQL tab to troubleshoot performance issues (like input read size, identifying stage boundaries, and Cartesian products)

    10:30am – 11:00am

    11:00am – 12:30pm
    DataFrames and Spark SQL (cont.)

    12:30pm – 1:30pm

    1:30pm – 3:00pm
    Spark core architecture

    • Driver and executor JVMs
    • Local mode
    • Resource managers (standalone, YARN, Mesos)
    • How to optimally configure Spark (# of slots, JVM sizes, garbage collection, etc.)
    • PySpark architecture (different serialization, extra Python processes, UDFs are slower, etc.)
    • Reading Spark logs and stout on drivers versus executors
    • Spark UI: Exploring the user interface to understand what’s going on behind the scenes of your application (# of tasks, memory of executors, slow tasks, Spark master/worker UIs, etc.)

    3:00pm – 3:30pm

    3:30pm – 5:00pm
    Resilient distributed datasets
    Datasets used: Pagecounts and English Wikipedia

    • When to use DataFrames versus RDDs (type-safety, memory pressure, optimizations, i/o)
    • Two ways to create an RDD using a SparkContext: parallelize and read from an external data source
    • Common transformations and actions
    • Narrow versus wide transformations and performance implications (pipelining, shuffle)
    • How transformations lazily build up a directed acyclic graph (DAG)
    • How a Spark application breaks down to Jobs > Stages > Tasks
    • Repartitioning an RDD (repartition versus coalesce)
    • Different memory persistence levels for RDDs (memory, disk, serialization, etc.)
    • Different types of RDDs (HadoopRDD, ShuffledRDD, MapPartitionsRDD, PairRDD, etc.)
    • Spark UI: How to interpret the new DAG visualization, how to troubleshoot common performance issues like GroupByKey versus ReduceByKey by looking at shuffle read/write info

    Day 2

    9:00am – 9:30am
    Review of Day 1

    • DataFrames and Spark SQL
    • Spark architecture
    • RDDs

    9:30am – 10:30am
    Shared variables (accumulators and broadcast variables)

    • Common use cases for shared variables
    • How accumulators can be used to implement distributed counters in parallel
    • Using broadcast variables to keep a read-only variable cached on each machine
    • Broadcast variables internals: BitTorrent implementation
    • Differences between broadcast variables and closures/lambdas (across stages versus per stage)
    • Configuring the autoBroadcastJoinThreshold in Spark SQL to do more efficient joins

    10:30am – 11:00am

    11:00am – 12:00pm
    Datasets used: Clickstream

    • Use cases for graph processing
    • Graph processing fundamentals: Vertex, edge (unidirectional, bidirectional), labels
    • Common graph algorithms: In-degree, out-degree, Pagerank, subGraph, shortest path, triplets
    • GraphX internals: How Spark stores large graphs in RDDs (VertexRDD, EdgeRDD, and routing table RDD)

    12:00pm – 12:30pm
    Spark Streaming
    Datasets used: Live edits stream of multiple languages

    • Architecture of Spark Streaming: Receivers, batch interval, block interval, direct pull
    • How the microbatch mechanism in Spark Streaming breaks up the stream into tiny batches and processes them
    • How to use a StreamingContext to create input DStreams (discretized streams)
    • Common transformations and actions on DStreams (map, filter, count, union, join, etc.)
    • Creating live, dynamically updated visualizations in Databricks (that update every 2 seconds)
    • Spark UI: How to use the new Spark Streaming UI to understand the performance of batch size versus processing latency
    • Receiver versus direct pull approach
    • High-availability guidelines (WAL, checkpointing)
    • Window operations: Apply transformations over a sliding window of data

    12:30pm – 1:30pm

    1:30pm – 2:30pm
    Spark Streaming (cont.)

    2:30pm – 3:00pm
    Spark machine learning
    Datasets used: English Wikipedia w/ edits

    • Common use cases of machine learning with Spark
    • When to use Spark MLlib (w/ RDDs) versus Spark ML (w/ DataFrames)
    • ML Pipelines concepts: DataFrames, transformer, estimator, pipeline, parameter
    • Basic statistics with MLlib
    • Tf-idf (term frequency-inverse document frequency)
    • Streaming machine learning (k-means, linear regression, logistic regression)

    3:00pm – 3:30pm

    3:30pm – 4:30pm
    Spark machine learning (cont.)

    4:30pm – 5:00pm
    Spark R&D (optional)

    • Project Tungsten
    • New Datasets API
    • Upcoming developments: DataFrames in Streaming and GraphX, new MLlib algorithms, etc.
    • Berkeley Data Analytics Stack (Succinct, IndexedRDD, BlinkDB, SampleClean)
    Photo of Brian Clapper

    Brian Clapper


    Brian Clapper is a senior instructor and curriculum developer at Databricks. Brian has more than 32 years’ experience as a software developer. Brian has worked for a stock exchange, the US Navy, a large software company, several startups, and small companies and, most recently, as an independent consultant and trainer for 7 years. Brian is fluent in many languages, including Scala, Java, Python, Ruby, C#, and C. In addition, he is highly familiar with current web application technologies, including frameworks like Play!, Ruby on Rails, and Django, and frontend technologies like jQuery, EmberJS, and AngularJS. Brian founded the Philly Area Scala Enthusiasts in 2010 and, since 2011, has been a co-organizer of the Northeast Scala Symposium; he was also co-organizer of Scalathon in 2011 and 2012. He maintains a substantial GitHub repository of open source projects and is fond of saying that even after many years as a software developer, programming is still one of his favorite activities.

    Comments on this page are now closed.


    Picture of Brian Clapper
    Brian Clapper
    09/25/2016 2:34pm EDT

    You need nothing other than:

    (a) an up-to-date version of Chrome or Firefox,
    (b) a PDF viewer of some kind (if you want to follow along in your own copy of the slides), and
    © a WiFi-capable laptop (as network access is absolutely required).

    That’s it.

    09/25/2016 1:15pm EDT

    I am attending this training. But I primarily use windows laptop with Chrome browser. Is that okay or I must have Mac?

    Picture of Brian Clapper
    Brian Clapper
    09/23/2016 5:00pm EDT

    There’s no need to download anything. Materials will be supplied in class.

    Picture of Zach Beniash
    09/23/2016 3:18pm EDT

    Is there a need to download any material (such as Datasets explored in class) in advance, or is it enough to have a laptop with an up-to-date version of Chrome or Firefox?

    Ram swarna
    09/19/2016 8:32am EDT

    Am curious to know if there is a wait list as well.

    - Ram.

    Picture of Sophia DeMartini
    Sophia DeMartini
    09/16/2016 1:07pm EDT

    Hi Charles,

    There is not an official waitlist, but if you email me at speakers"at", I can keep an eye on the training, and if anyone drops out, I’ll let you know.

    Thank you,

    Charles Durai
    09/16/2016 11:17am EDT

    The training is sold out. Is there a waiting list ?

    Picture of Brian Clapper
    Brian Clapper
    09/09/2016 10:06am EDT

    Databricks has many opportunities for training. We have public classes, usually in the San Francisco area, but not always. We also run training classes at every Spark Summit. If you’re on the East Coast, the next East Coast Spark Summit will probably be in February (though, to my knowledge, that’s not yet finalized). We also offer in-person training, though that option implies that you have a group of people at the same company that need training.

    Gary Baker
    09/09/2016 9:55am EDT

    A correction to my question, I’m attending this conference but the training

    Gary Baker
    09/09/2016 9:54am EDT

    Its unfortunate that I’m not able to attend this conference. is there any other Databricks training either on the same topics or any spark related Databricks training happening outside of this conference?