Presented By O'Reilly and Cloudera
December 5-6, 2016: Training
December 6–8, 2016: Tutorials & Conference
Singapore

Spark camp: Exploring Wikipedia with Spark

Sameer Farooqui (Databricks)
9:00am–5:00pm Tuesday, December 6, 2016
Spark & beyond
Location: 328/329
Tags: real-time
Average rating: *****
(5.00, 1 rating)

Sponsored by:
Databricks

The real power and value proposition of Apache Spark is in building a unified use case that combines ETL, batch analytics, real-time stream analysis, machine learning, graph processing, and visualizations. Through hands-on examples, Sameer Farooqui explores various Wikipedia datasets to illustrate a variety of ideal programming paradigms.

The class will consist of about 60% lecture and 40% hands-on labs + demos. Note that the hands-on labs in class will be taught in Scala. All students will have access to Databricks for one month after class to continue working on labs and assignments.

Datasets explored in class:

Schedule

9:00am–9:30am
Introduction to Wikipedia and Spark

  • Overview of the five Wikipedia data sources
  • Overview of Apache Spark APIs, libraries, and cluster architecture
  • Demo: Logging into Databricks and a tour of the user interface

9:30am–10:30am
DataFrames and Spark SQL
Datasets used: Pageviews and Clickstream

  • Use an SQLContext to create a DataFrame from different data sources (S3, JSON, RDBMS, HDFS, Cassandra, etc.)
  • Run some common operations on DataFrames to explore it
  • Cache a DataFrame into memory
  • Correctly size the number of partitions in a DataFrame, including the size of each partition
  • Mix SQL and DataFrame queries
  • Join two DataFrames
  • Overview of how Spark SQL’s Catalyst optimizer converts logical plans to optimized physical plans
  • Create visualizations using Databricks and Google Visualizations
  • Use the Spark UI’s new SQL tab to troubleshoot performance issues (like input read size, identifying stage boundaries, and Cartesian products)

10:30am–11:00am
MORNING BREAK

11:00am–12:00pm
Spark core architecture

  • Driver and executor JVMs
  • Local mode
  • Resource managers (standalone)
  • How to optimally configure Spark (# of slots, JVM sizes, garbage collection, etc.)
  • Reading Spark logs and stout on driver versus executors
  • Spark UI: Exploring the user interface to understand what’s going on behind the scenes of your application (# of tasks, memory of executors, slow tasks, Spark master/worker UIs, etc.)

12:00pm–1:00pm
LUNCH

1:00pm–2:00pm
Resilient distributed datasets
Dataset used: Pagecounts

  • When to use DataFrames vs. RDDs (type-safety, memory pressure, optimizations, I/O)
  • Narrow versus wide transformations and performance implications (pipelining, shuffle)
  • How transformations lazily build up a directed acyclic graph (DAG)
  • How a Spark application breaks down to Jobs > Stages > Tasks
  • Repartitioning an RDD (repartition versus coalesce)
  • Different memory persistence levels for RDDs (memory, disk, serialization, etc.)
  • Different types of RDDs (HadoopRDD, ShuffledRDD, MapPartitionsRDD, PairRDD, etc.)
  • Spark UI: How to interpret the new DAG visualization, how to troubleshoot common performance issues like GroupByKey versus ReduceByKey by looking at shuffle read/write info

2:00pm–2:30pm
GraphX
Datasets used: Clickstream

  • Use cases for graph processing
  • Graph processing fundamentals: Vertex, edge (unidirectional, bidirectional), labels
  • Common Graph algorithms: In-degree, out-degree, Pagerank, subGraph, Shortest Path, Triplet

2:30pm–3:00pm
Spark Streaming
Datasets: Live edits stream from multiple languages

  • Architecture of Spark Streaming: Receivers, batch interval, block interval, direct pull
  • How the microbatch mechanism in Spark Streaming breaks up the stream into tiny batches and processes them
  • How to use a StreamingContext to create input DStreams (Discretized Streams)
  • Common transformations and actions on DStreams (map, filter, count, union, join, etc.)
  • Creating a live, dynamically updated visualization in Databricks (that updates every two seconds)
  • Spark UI: How to use the new Spark Streaming UI to understand the performance of batch size versus processing latency

3:00pm–3:30pm
AFTERNOON BREAK

3:30pm–3:45pm
Guest talk: Choosing an optimal storage backend for your Spark use case—Vida Ha

  • Data storage tips for optimal Spark performance
  • HDFS file/block sizes
  • Compression formats (gzip, Snappy, bzip2, LZO, LZ4, etc.)
  • Working with CSV, JSON, and XML file types

3:45pm–4:45pm
Machine learning

  • Datasets: English Wikipedia and Live edits (optional)
  • Common use cases of machine learning with Spark
  • When to use Spark MLlib (w/ RDDs) versus Spark ML (w/ DataFrames)
  • ML Pipelines concepts: DataFrames, transformer, estimator, pipeline, parameter
  • Using TF-IDF and K-means to cluster 5 million articles into 100 clusters
Photo of Sameer Farooqui

Sameer Farooqui

Databricks

Sameer Farooqui is a client services engineer at Databricks, where he works with customers on Apache Spark deployments. Sameer works with the Hadoop ecosystem, Cassandra, Couchbase, and general NoSQL domain. Prior to Databricks, he worked as a freelance big data consultant and trainer globally and taught big data courses. Before that, Sameer was a systems architect at Hortonworks, an emerging data platforms consultant at Accenture R&D, and an enterprise consultant for Symantec/Veritas (specializing in VCS, VVR, and SF-HA).

Comments on this page are now closed.

Comments

12/19/2016 9:10pm SGT

Hi , Post session I am trying to open the .dbc file but I am unable to , how do I open it ?

12/05/2016 11:05pm SGT

Do I need to know Scala for this session?
Also is there a version/slides for this class in Python?

Julius Novan Cahyadi
11/29/2016 8:23pm SGT

Hi,

Is there any prerequisites installation/application that I need to prepare ?
Should I download the All above Datasets first?

Thank you.

Picture of Sophia DeMartini
Sophia DeMartini
11/22/2016 8:07am SGT

Hi Roee and Duc,

Yes, please bring your own laptop so that you can follow along with the material being presented in the tutorial.

Thank you,
Sophia

11/22/2016 3:10am SGT

Hi,
Should I bring my own laptop?

11/07/2016 7:56pm SGT

Do I have a laptop to do hands-on labs?