Presented By O'Reilly and Cloudera
December 5-6, 2016: Training
December 6–8, 2016: Tutorials & Conference

Scala and the JVM as a big data platform: Lessons from Apache Spark

Dean Wampler (Anyscale)
12:05pm–12:45pm Wednesday, December 7, 2016
Spark & beyond
Location: Summit 2 Level: Advanced
Average rating: ***..
(3.80, 5 ratings)

Prerequisite Knowledge

  • Advanced JVM experience and prior experience using Spark, MapReduce, or similar big data tools

What you'll learn

  • Understand why Scala is a great, if imperfect, language for data developers
  • Learn how the Scala interpreter (Spark shell) works and how to use it more effectively, how the JVM's object model and use of memory is suboptimal for big data, how the Tungsten project is working around these JVM limitations, and how Scala and the JVM could be improved in the future to be even better tools for big data


Apache Spark is implemented in Scala, and its user-facing Scala API is very similar to Scala’s own Collections API. The power and concision of this API have already brought many developers to Scala. The core abstractions in Spark have created a flexible, extensible platform for applications like streaming, SQL queries, machine learning, and more. Scala offers many advantages over Java:

  • A pragmatic balance of object-oriented and functional programming.
  • An interpreter mode, which allows the same sort of exploratory programming that data scientists have enjoyed with Python and other languages. Scala-centric notebooks are also now available.
  • A rich collections library that enables composition of operations for concise, powerful code.
  • Tuples are naturally expressed in Scala and are very convenient for working with data.
  • Pattern matching makes data deconstruction fast and intuitive.
  • Type inference provides safety and feedback to the developer, yet requires minimal typing of actual type signatures.
  • Scala idioms lend themselves to the construction of small domain-specific languages, which are useful for building concise and intuitive libraries for domain experts.

Spark, like almost all open source, big data tools, leverages the JVM, which is an excellent general-purpose platform for scalable computing. However, its management of objects is suboptimal for high-performance data crunching. Dean Wampler gives an overview of Spark, explaining ongoing improvements and what we should do to improve Scala and the JVM for big data to make them better tools for our needs. For example, the way objects are organized in memory and the subsequent impact that has on garbage collection can be improved for the special case of big data. Hence, the Spark project has recently started Project Tungsten to build internal optimizations using the following techniques:

  • Custom data layouts that use memory very efficiently with cache-awareness
  • Manual memory management, both on-heap and off-heap, to minimize “garbage” and GC pressure
  • Code generation to create optimal implementations of certain, heavily used expressions from user code
Photo of Dean Wampler

Dean Wampler


Dean Wampler is an expert in streaming data systems, focusing on applications of machine learning and artificial intelligence (ML/AI). He’s head of developer relations at Anyscale, which is developing Ray for distributed Python, primarily for ML/AI. Previously, he was an engineering VP at Lightbend, where he led the development of Lightbend CloudFlow, an integrated system for building and running streaming data applications with Akka Streams, Apache Spark, Apache Flink, and Apache Kafka. Dean is the author of Fast Data Architectures for Streaming Applications, Programming Scala, and Functional Programming for Java Developers, and he’s the coauthor of Programming Hive, all from O’Reilly. He’s a contributor to several open source projects. A frequent conference speaker and tutorial teacher, he’s also the co-organizer of several conferences around the world and several user groups in Chicago. He earned his PhD in physics from the University of Washington.