Presented By O'Reilly and Cloudera
December 5-6, 2016: Training
December 6–8, 2016: Tutorials & Conference

Just enough Scala for Spark

Dean Wampler (Anyscale)
9:00am–12:30pm Tuesday, December 6, 2016
Spark & beyond
Location: 321/322 Level: Intermediate
Average rating: ****.
(4.00, 1 rating)

Prerequisite Knowledge

  • A basic familiarity with Spark and Java

Materials or downloads needed in advance

What you'll learn

  • Discover why Scala is an ideal programming language for data engineers using Spark
  • Learn the core features of Scala necessary to write Spark code
  • Pick up tips and tricks for effective Scala


Apache Spark is written in Scala. Although Spark provides a Java API, many data engineers are adopting Scala since it’s the “native” language for Spark—and because Spark code written in Scala is much more concise than comparable Java code. Most data scientists, however, continue to use Python and R.

If you want to learn Scala for Spark, this is the tutorial for you. Dean Wampler offers an overview of the core features of Scala you need to use Spark effectively, using hands-on exercises with the Spark APIs. You’ll learn the most important Scala syntax, idioms, and APIs for Spark development.

Topics include:

  • Classes, methods, and functions
  • Immutable versus mutable values
  • Type inference
  • Pattern matching
  • Scala collections and the common operations on them (the basis of the RDD API)
  • Other Scala types like case classes, tuples, and options
  • Domain-specific languages in Scala
  • Effective use of the Spark shell (Scala interpreter)
  • Common mistakes (e.g., serialization errors) and how to avoid them
Photo of Dean Wampler

Dean Wampler


Dean Wampler is an expert in streaming data systems, focusing on applications of machine learning and artificial intelligence (ML/AI). He’s head of developer relations at Anyscale, which is developing Ray for distributed Python, primarily for ML/AI. Previously, he was an engineering VP at Lightbend, where he led the development of Lightbend CloudFlow, an integrated system for building and running streaming data applications with Akka Streams, Apache Spark, Apache Flink, and Apache Kafka. Dean is the author of Fast Data Architectures for Streaming Applications, Programming Scala, and Functional Programming for Java Developers, and he’s the coauthor of Programming Hive, all from O’Reilly. He’s a contributor to several open source projects. A frequent conference speaker and tutorial teacher, he’s also the co-organizer of several conferences around the world and several user groups in Chicago. He earned his PhD in physics from the University of Washington.