Presented By O'Reilly and Cloudera
Make Data Work
March 13–14, 2017: Training
March 14–16, 2017: Tutorials & Conference
San Jose, CA

Spark camp: Apache Spark 2.0 for analytics and text mining with Spark ML

Andy Konwinski (Databricks)
9:00am5:00pm Tuesday, March 14, 2017
Spark & beyond
Location: San Jose Ballroom, Marriott
Secondary topics:  Streaming, Text
Average rating: ****.
(4.43, 7 ratings)

What you'll learn

  • Explore Apache Spark 2.0 core concepts with a focus on Spark's machine-learning library


Sponsored by:

Andy Konwinski introduces you to Apache Spark 2.0 core concepts with a focus on Spark’s machine-learning library, using text mining on real-world data as the primary end-to-end use case.

Join Andy to explore and wrangle data using Spark’s DataSet and DataFrame abstractions. You’ll use the Spark ML API to build an ML pipeline to transform free text into useful features via Spark ML’s Transformer abstraction (e.g., one-hot encoding and term frequency counting) and learn about model selection, training/fitting, and validation/inspection, as well as parameter tuning with grid search parameter selection.

The class will consist of approximately 50% hands-on programming labs in Scala and 50% lecture and discussion.

Photo of Andy Konwinski

Andy Konwinski


Andy Konwinski is a founder and VP at Databricks. He has been working on Spark since the early days of the project, starting during his PhD in the UC Berkeley AMPLab, and has contributed as a software engineer to Spark’s performance evaluation components, testing infrastructure, documentation, and more. He was also a creator of the Apache Mesos project, contributed to the Hadoop Job Scheduler, and led the creation of the UC Berkeley AMP Camps and the Spark Summits. Andy coauthored Learning Spark from O’Reilly.

Comments on this page are now closed.


Picture of Andy Konwinski
Andy Konwinski | FOUNDER
03/12/2017 9:01am PST

@Cathy Farrell, @Asha Saini, and @Patrick Lu, Re prereqs/prep:

You need to bring a laptop to class with Firefox or Chrome installed. That’s it! We will start the day by logging into Databricks Community Edition together and importing the courseware. Databricks Community Edition is a free service that makes it easy to run a real Apache Spark cluster (and choose between many different versions), edit/run/debug/etc. real Spark 2.x code, and more.

After class, you can export your work as Scala source code if you wanted to try running it somewhere else (e.g. on your laptop with Apache Spark installed from And you will continue to have free access to the courseware in your Community Edition account forever if you want to keep using that as your learning environment going forward.


@Jaime As and @Peter Schmidt, Re Scala experience required:

You don’t need to have production experience w/ Scala but you should have used it before class and understand at least the basics of Scala’s (1) first class functions (a.k.a. Closures, a.k.a. Lambdas) and (2) Collections APIs (e.g., List and it’s functional operators like map(), reduce(), flatMap(), filter(), etc.).

For a Spark oriented primer on Scala, see this crashcourse created by my awesome co-author on the O’Reilly Learning Spark book Holden Karau. Also check out this Scala reference from an NLP class at the University of Texas at Austin taught in 2013 by Dan Garrette.


I’m excited to see you all in the classroom in 2 days!

Picture of Cathy Farrell
03/10/2017 3:15am PST

Is there anything we should install prior to the session?

03/08/2017 3:28am PST

What are the pre-requisites for this class ?
Setup for programming lab, github, any material etc ?

01/30/2017 3:14am PST

Are we going to use a cloud-based Scala/Spark notebook?? Or do we need to setup a local Spark node by using our own laptop before the class?

01/22/2017 11:59pm PST

How proficient in scala does one need to be for this tutorial?