Presented By O'Reilly and Cloudera
Make Data Work
Sept 29–Oct 1, 2015 • New York, NY

Spark Development Bootcamp

Laurent Weichberger (OmPoint Innovations, LLC)
9:00am–5:00pm Tuesday, 09/29/2015
Location: 3D 01/12
Average rating: ***..
(3.86, 7 ratings)
Slides:   1-PDF    2-PDF    3-PDF    4-PDF 

Materials or downloads needed in advance

Students, please arrive to class with:

  • Laptop
  • A basic understanding of software development
  • Some experience coding in Python, Java, SQL, or Scala
  • A modern operating system (Windows, OS X, Linux), browser (Internet Explorer not supported)

Please also review the presentation slides (PDF).



This three-day curriculum features advanced lectures and hands-on technical exercises for Spark usage in data exploration, analysis, and building big data applications.

Spark is a unified framework for big data analytics. Spark provides one integrated API for use by developers, data scientists, and analysts to perform diverse tasks that would have previously required separate processing engines such as batch analytics, stream processing and statistical modeling. Spark supports a wide range of popular languages including Python, R, Scala, SQL, and Java. Spark can read from diverse data sources and scale to thousands of of nodes.

In this class, you will learn how to build and manage Spark applications using Spark’s core programming APIs and its standard Libraries. Hands-on exercises will be done in Scala. Course materials emphasize design patterns and best practices for leveraging Spark in the context of other popular, complementary frameworks for building and managing enterprise data workflows.

Those who attend the training will have opportunities during the tutorial to meet and have discussions with members of the Spark development community, including Q&A sessions and discussions about real-world use cases. You will receive a free Databricks account for the duration of training.

Course Learning Objectives

After taking this class you will be able to:

  • Build a data pipeline using Spark DataFrames and Spark SQL
  • Understand Spark concepts, architecture, and applications
  • Execute SQL queries on large scale data using Spark
  • Explore and visualize your data by entering and running code in Notebooks
  • Train, and use an ML model on real data with Spark’s Machine Learning library MLlib
  • Tune Spark job performance and troubleshoot errors using logs and administration UIs
  • Find answers to common questions using Spark documentation and discussion forums
  • Write and monitor a Spark Streaming job to analyze data with sub-second latency
  • Understand common use-cases and business applications of Spark
  • Recognize all of the topics tested by the Spark Developer Certification and know what further work is required to prepare to take and pass the exam


Students, please arrive to class with:

  • A basic understanding of software development
  • Some experience coding in Python, Java, SQL, or Scala
  • A modern operating system (Windows, OS X, Linux), browser (Internet Explorer not supported)

Outline of topics covered in class

  • History of Big Data & Apache Spark
    – Introduction to the Spark Shell and the training environment
    – Just enough Scala for Spark
    – Introduction to Spark DataFrames and Spark SQL
    – Introduction to RDDs
    – Lazy Evaluation
    – Transformations and Actions
    – Caching
    – Using the Spark UIs
  • Data Sources: reading from Parquet, S3, Cassandra, HDFS, and your local file system
  • Spark’s Architecture
  • Programming with Accumulators and Broadcast Variables
  • Debugging and tuning Spark jobs using Spark’s admin UIs
  • Memory & Persistence
  • Advanced programming with RDDs (understanding the shuffle phase, partitioning, etc.)
  • Visualization: matplotlib, gg_plot, dashboards, exploration and visualization in notebooks
  • Introduction to Spark Streaming
  • Introduction to MLlib and GraphX
Photo of Laurent Weichberger

Laurent Weichberger

OmPoint Innovations, LLC

Laurent Weichberger is in constant motion as the Big Data Bear and Sr. Technical Instructor for Datameer, Inc. Laurent has been teaching Java since 2000, and started his work in Big Data during 2012 when he worked for Hortonworks, and Cloudera. He was the Director of Training at DataStax, and later became Director of Practice at Couchbase. More recently he spent the better half of 2015 working for Databricks writing and teaching about Spark, and he now is focused full time on promoting the wondrous Datameer software worldwide.

Comments on this page are now closed.


Bhairav Mehta
09/28/2015 7:25pm EDT

Any chance to enroll in this session??? I would love to find out tomorrow morning.

Nigel Noyes
09/24/2015 3:48pm EDT

Any chance to be added to this course? I just got company approval to attend.

Jeremy Cunningham
09/22/2015 10:54am EDT

The title of this training was originally something like “Advanced Spark”. I have been using spark for several months and already went through a 5 day intro from Cloudera. What I am needing is data science with Spark. Using the estimators, working with onehotencoding etc… What I don’t need is word count and descriptions of what an RDD is. The new title sounds more like the latter. Will we be covering, not just touching on but actually working with several estimators and data science tasks? I really don’t mean to be rude but I don’t want to take up a spot from someone that needs to learn spark and I don’t want to fly out there and waste my time.

Pushkar Bhirud
09/21/2015 9:19am EDT

Any chance some more spots will open up for this?

Tahir Mehmood
08/31/2015 11:06am EDT

Why we are not using Python? I was actually looking forward to this and been learning a lot of Python.

Carolyn Duby
08/30/2015 6:45am EDT

Do we need to install anything on the laptop that we bring to class or do we just use the browser? I have a work laptop but I can only install packages available in
Landesk. Looking forward to class.

Picture of Viral Parikh
Viral Parikh
08/28/2015 10:45am EDT

@sophia – thanks! i emailed you and i am not checking this site, link often. would be great if you can email me. thank you in advance!

Picture of Sophia DeMartini
Sophia DeMartini
08/28/2015 7:26am EDT

Yes, will do!

Ashutosh Sharma
08/28/2015 6:51am EDT

@Sophia – please let me know in case of any cancellations? I want to enroll for this.

Picture of Sophia DeMartini
Sophia DeMartini
08/27/2015 8:55am EDT

Hi Viral – unfortunately, there isn’t a waitlist, but if something changes, or if another spot opens up, we’ll make sure to let you know.

Picture of Viral Parikh
Viral Parikh
08/27/2015 8:47am EDT

@Laurent @Sophia – Is there any way to get on the waiting list?

Picture of Sophia DeMartini
Sophia DeMartini
08/20/2015 4:15pm EDT

@Rahul, @Amit, and @Stephanie -

Our Registration Coordinator has/will be reaching out to each of you to discuss getting you signed up (we were able to add a few more spaces).

As far as a waitlist – we do not have one, unfortunately. It’s first come first serve.


Picture of Laurent Weichberger
Laurent Weichberger
08/19/2015 7:09pm EDT

@Rahul, I hear they are working on figuring that out, there may not be a precedent for this at Strata… stand by.

Rahul Joglekar
08/19/2015 6:43pm EDT

@Laurent – Do we know if there is a waiting list ?

Amit Juneja
08/19/2015 7:36am EDT

Thanks Ben! It seems day 1 is sold out. Can I still signup for all three days?

Picture of Ben Lorica
Ben Lorica
08/19/2015 6:48am EDT

Hi Amit,

This is a 3-day course.

Amit Juneja
08/19/2015 6:46am EDT

Is this a one day or a three day course? Where can I find details on the “intensive 3-day” course?

Picture of Laurent Weichberger
Laurent Weichberger
08/18/2015 5:30pm EDT

@All, our lab exercises will be performed in the Databricks notebook environment. As the pre-reqs say you will need a browser, since this is essentially a web application environment with an embedded Scala REPL which hits a backend single node Spark cluster.

Picture of Laurent Weichberger
Laurent Weichberger
08/18/2015 5:27pm EDT

@Grecia, your question prompted us to post this more comprehensive description, I trust you have the answer you need now?

Picture of Laurent Weichberger
Laurent Weichberger
08/18/2015 5:25pm EDT

@Suraj, yes you need to proficient in either Python, Java or Scala to get the most out of this class, however lab exercises will be in Scala only.

Picture of Laurent Weichberger
Laurent Weichberger
08/18/2015 5:24pm EDT

@Jack, we will be doing all the lab exercises in Scala. The new outline (above) should answer the topics question, yes?

Picture of Laurent Weichberger
Laurent Weichberger
08/18/2015 5:22pm EDT

@Stephanie, I will ask about getting you on a waiting list (if there is one), stand by please.

Picture of Stephanie Rivera
Stephanie Rivera
08/16/2015 11:37am EDT

Is there any way to get on the waiting list? By the time I got employer approval this was full, but it was the driving reason for why I am allowed to go.

Stephanie Rivera
08/06/2015 7:54am EDT

Does anyone plan to answer these questions before Aug 14?

Jinyuan Zhou
07/10/2015 10:04am EDT

I have some working experience with spark. Can you give more details about what advanced topics will be covered and what programming language you will be using in the training. Thanks,

Suraj Eyanni
06/29/2015 12:56pm EDT

What are the pre-requisites for this class? Do I need to be proficient in python, scala or any other programming language to get most out of this class?


Grecia Lapizco
06/04/2015 9:08am EDT

Can you please provide more details about the class content? (e.g. preview of topics/lessons plans)