Skip to main content
Make Data Work
Oct 15–17, 2014 • New York, NY

Spark Camp

Paco Nathan (, michael dddd (Databricks), Tathagata Das (Databricks), Matei Zaharia (Databricks), Reynold Xin (Databricks), Ameet Talwalkar (Carnegie Mellon University | Determined AI), Holden Karau (Independent), Joseph Bradley (Databricks), Sameer Farooqui (Databricks), Patrick Wendell (Databricks)
9:00am–5:00pm Wednesday, 10/15/2014
Hadoop & Beyond
Location: Hall A 23/24
Average rating: ***..
(3.75, 20 ratings)

SparkSpark Camp: An Introduction to Apache Spark with Hands-on Tutorials

Spark Camp, organized by the creators of the Apache Spark project at Databricks, will be a day long hands-on introduction to the Spark platform including Spark Core, the Spark Shell, Spark Streaming, Spark SQL, MLlib, and more. We will start with an overview of use cases and demonstrate writing simple Spark applications. We will cover each of the main components of the Spark stack via a series of technical talks targeted at developers that are new to Spark. Intermixed with the talks will be periods of hands-on lab work. Attendees will download and use Spark on their own laptops, as well as learn how to configure and deploy Spark in distributed big data environments including common Hadoop distributions and Mesos.

Spark Camp is also happening at Strata Conference in Barcelona, November 19-21.

Don't miss out on future Spark and Strata events.
Sign up for the Strata bulletin.

Photo of Paco Nathan

Paco Nathan

Paco Nathan is known as a “player/coach” with core expertise in data science, natural language processing, machine learning, and cloud computing. He has 35+ years of experience in the tech industry, at companies ranging from Bell Labs to early-stage startups. His recent roles include director of the Learning Group at O’Reilly and director of community evangelism at Databricks and Apache Spark. Paco is the cochair of Rev conference and an advisor for Amplify Partners, Deep Learning Analytics, Recognai, and Primer. He was named one of the "top 30 people in big data and analytics" in 2015 by Innovation Enterprise.

Photo of michael dddd

michael dddd


Michael Armbrust is the lead developer of the Spark SQL and Structured Streaming projects at Databricks. Michael’s interests broadly include distributed systems, large-scale structured storage, and query optimization. Michael holds a PhD from UC Berkeley, where his thesis focused on building systems that allow developers to rapidly build scalable interactive applications and specifically defined the notion of scale independence.

Photo of Tathagata Das

Tathagata Das


Tathagata Das is an Apache Spark committer and a member of the PMC. He is the lead developer behind Spark Streaming, which he started while a PhD student in the UC Berkeley AMPLab, and is currently employed at Databricks. Prior to Databricks, Tathagata worked at the AMPLab, conducting research about data-center frameworks and networks with Scott Shenker and Ion Stoica.

Photo of Matei Zaharia

Matei Zaharia


Matei Zaharia started the Spark project at UC Berkeley and is currently CTO of Databricks. He serves as Spark’s vice president at Apache. In spring 2015, he is also beginning an assistant professor position at MIT.

Photo of Reynold Xin

Reynold Xin


Reynold Xin is a cofounder and chief architect at Databricks as well as an Apache Spark PMC member and release manager for Spark’s 2.0 release. Prior to Databricks, Reynold was pursuing a PhD at the UC Berkeley AMPLab, where he worked on large-scale data processing.

Photo of Ameet Talwalkar

Ameet Talwalkar

Carnegie Mellon University | Determined AI

Ameet Talwalkar is cofounder and chief scientist at Determined AI and an assistant professor in the School of Computer Science at Carnegie Mellon University. His research addresses scalability and ease-of-use issues in the field of statistical machine learning, with applications in computational genomics. Ameet led the initial development of the MLlib project in Apache Spark. He is the coauthor of the graduate-level textbook Foundations of Machine Learning (MIT Press) and teaches an award-winning MOOC on edX, Distributed Machine Learning with Apache Spark.

Photo of Holden Karau

Holden Karau


Holden Karau is a transgender Canadian software engineer working in the bay area. Previously, she worked at IBM, Alpine, Databricks, Google (twice), Foursquare, and Amazon. Holden is the coauthor of Learning Spark, High Performance Spark, and another Spark book that’s a bit more out of date. She’s a committer on the Apache Spark, SystemML, and Mahout projects. When not in San Francisco, Holden speaks internationally about different big data technologies (mostly Spark). She was tricked into the world of big data while trying to improve search and recommendation systems and has long since forgotten her original goal. Outside of work, she enjoys playing with fire, riding scooters, and dancing.

Photo of Joseph Bradley

Joseph Bradley


Joseph Bradley is a software engineer working on machine learning at Databricks. Joseph is an Apache Spark committer and PMC member. Previously, he was a postdoc at UC Berkeley. Joseph holds a PhD in machine learning from Carnegie Mellon University, where he focused on scalable learning for probabilistic graphical models, examining trade-offs between computation, statistical efficiency, and parallelization.

Photo of Sameer Farooqui

Sameer Farooqui


Sameer Farooqui is a client services engineer at Databricks, where he works with customers on Apache Spark deployments. Sameer works with the Hadoop ecosystem, Cassandra, Couchbase, and general NoSQL domain. Prior to Databricks, he worked as a freelance big data consultant and trainer globally and taught big data courses. Before that, Sameer was a systems architect at Hortonworks, an emerging data platforms consultant at Accenture R&D, and an enterprise consultant for Symantec/Veritas (specializing in VCS, VVR, and SF-HA).

Photo of Patrick Wendell

Patrick Wendell


Patrick Wendell is a cofounder of Databricks as well as a founding committer and PMC member of Apache Spark. Patrick has acted as release manager for several Spark releases in addition to maintaining several subsystems of Spark’s core engine. At Databricks, Patrick directs the company’s maintenance and development of Spark.

Patrick holds an MS in computer science from UC Berkeley, where his research focused on low-latency scheduling for large-scale analytics workloads, and a BSE in computer science from Princeton University.

Comments on this page are now closed.


Picture of Paco Nathan
Paco Nathan
10/16/2014 5:37am EDT

Hi Muni, we look forward to seeing you in the next one :) We are giving away those USB sticks at our booth #344, and the material is online at under “code+data”

Picture of Muni Xu
Muni Xu
10/15/2014 3:22pm EDT

Hi Paco,

I’m sorry to have missed the Spark camp today. But I’d still like to review the awesome materials you guys provided. Do you still have any extra USB stick that I can get one? Or if you have also put the materials on github or any other public places?

Thanks in advance!

Picture of Paco Nathan
Paco Nathan
10/13/2014 11:44am EDT

Thank you Robert -

As mentioned below, this training has different tracks depending on how much a person has worked with Spark already.

For Spark Camp, one does not need Scala background, either for the Intro or Advanced tracks. We present coding examples in Python, Java, Scala. Understanding the model used by Spark is much more important than any specific language, and in fact the API calls are intended to be much the same throughout different languages. Scala becomes important if you want to work on Spark internals.

In terms of Hadoop: speaking as someone who’s led Data Science teams working on large-scale problems, the use cases are not tied to Hadoop per se, and the use cases are the important parts. Even so, this tutorial is not intended as an “Intro to Data Science” course; there are many other resources for that throughout the Strata program. We will focus on Spark, how to leverage it.

In terms of the Apache Spark certificate: this establishes industry standards for measuring and validating Spark technical expertise. A tutorial will not meet that bar. In other words, Spark Camp was not designed as prep for the cert — it was designed as an intensive, hands-on course, presented by lead committers on Spark. You get access to them directly, for Q&A, etc. Having said, if you are comfortable coding the advanced exercises in Spark Camp then the exam probably won’t be a huge surprise. But it will be a challenge nonetheless. We set a high bar.

The certificate exam tests for several points:

  • understanding breadth of Spark API usage across Scala, Java, Python
  • applying best practices to avoid runtime issues and performance bottlenecks
  • distinguishing Spark features and practices from MapReduce usage
  • integrating SQL, Streaming, ML, Graph atop the Spark unified engine
  • solving typical use cases with Spark in Scala, Java, Python

Definitely, some experience building real-world apps in Spark is essential to pass that. Other good ways to prep for the cert… if you:

  • read the Apache Spark user email list regularly and could field say 80% of the newbie questions
  • have mastered the material released so far in the Learning Spark book
  • have taken at least two of our Spark professional workshops
Robert Chirwa
10/13/2014 11:19am EDT

In case anyone missed this game changer, “Spark Breaks Previous Large-Scale Sort Record”

Congratulations to the folks at Databricks. This further pumps me up for Spark Camp and learning from the minds behind the feat above.

In terms of Scala and knowing it for the Cert. preparation, Is there a good resource that you could recommend along the lines of “Just Enough Scala (for Spark)” :-) A relevant minimalistic guide to scala which will keep one focused for the aspects relevant to using the spark-shell. I am already proficient with Python if that helps guide your answer.

I know this is part of the Hadoop World Conference; What depth of Hadoop would it be advisable for one to know coming into the Spark Camp. Fair to ask since Spark bills itself as being accessible to Data Scientist who do not necessarily know about the Hadoop Distributed Filesystem (HDFS).

Picture of Paco Nathan
Paco Nathan
10/10/2014 10:19am EDT

Great question, Robert. We have material at Spark Camp that’s intended for different audience segments, depending on how much prior experience a person has with Spark. The more experience that you come in with, probably the more you maximize the experience. For example, among the instructors there will be several of the project leads for Apache Spark. You can discuss the nuances of Spark directly with them.

The “Learning Spark” book is a great place to start. The authors will be among the instructors for Spark Camp, and the six chapters released so far in early eval are great preparation. Other good resources online are listed at Reviewing the Spark Summit videos is another great way to prep.

BTW, this approach is also a great way to prepare for the new Apache Spark developer certification exam, from O’Reilly + Databricks

Robert Chirwa
10/10/2014 9:05am EDT

For those of us that would like to prepare for Spark Camp and maximize our experience could you recommend any specific readings or sparks tutorials/webcasts/videos to prime us. Is reading “Learning Spark” sufficient place for a spark newbie start? Thanks in advance.

Picture of Paco Nathan
Paco Nathan
10/07/2014 1:23pm EDT

Great question, Rajesh -
What’s required for a laptop to use in the tutorial?

  • reasonably current hardware (+2GB)
  • MacOSX, Windows, Linux — all work fine
  • make sure you don’t have corporate security controls that prevent use of network
  • have JDK 6/7/8 installed
  • do not install Spark with Homebrew or Cygwin

We will provide USB sticks with the necessary data+code

Rajesh Haran
10/07/2014 1:14pm EDT

what is prefered for the training, Linux desktop or Windows 7. Any pre-requsite software should be installed?

Picture of Paco Nathan
Paco Nathan
10/07/2014 9:46am EDT

Great question. No, actually it’s best not to install Spark in advance. We will provide USB sticks with the required download, plus data used in the exercises.

In particular, avoid using Homebrew on MacOSX or Cygwin on Windows — those will conflict with many Big Data frameworks.

Nixon Patel
10/07/2014 8:18am EDT

Do I need to have spark installed on my laptop for this class? What preparation is required to make the most of this class?

aravindan marimuthu
09/24/2014 7:20am EDT

we are evaluating Spark for our project. Is it possible for you add me to this class. If it is completely sold out, please add my name to the wait list and keep me posted if any seat become available.

Picture of Paco Nathan
Paco Nathan
09/02/2014 2:26pm EDT

Best to have some programming experience in Python, Scala, or Java — plus some familiarity with Big Data use cases.

Part is in SQL. Other parts involve: streaming, machine learning, graph queries.

Picture of Thomas Dinsmore
Thomas Dinsmore
09/02/2014 2:20pm EDT

What prior experience and training do you recommend for this class?

Picture of Paco Nathan
Paco Nathan
08/15/2014 5:11pm EDT

That’s a good point, Steve. The content for Spark Camp parallels what we run at Spark Summit.

Except, of course: the Spark Camp event at Strata will include updates from subsequent releases, new case studies, much more work on extending the set of sample apps, more example integrations, etc.

Steven Mckinney
08/15/2014 11:26am EDT

Wondering how this course will differ from the intro and advanced training at Spark Summit 2014?