Presented By O'Reilly and Cloudera
Make Data Work
Feb 17–20, 2015 • San Jose, CA

Why Spark Is the Next Top (Compute) Model

Dean Wampler (Anyscale)
10:40am–11:20am Friday, 02/20/2015
Spark in Action
Location: 230 C
Average rating: ****.
(4.40, 5 ratings)
Slides:   1-PDF 

Spark is an open-source computation platform for Big Data. All the major Hadoop vendors have embraced Spark as a replacement for MapReduce, the venerable standard for writing Hadoop jobs. This talk explains why this change was necessary.

MapReduce has several major deficiencies that needed to be fixed:

  • The MapReduce programming model is limited and difficult-to-use, requiring special expertise. The API for Hadoop MapReduce, in particular, is very low-level, limiting developer productivity.
  • Hadoop MapReduce has significant performance issues, especially when complex workflows require sequencing MapReduce jobs. Iterative algorithms are also infeasible without “workarounds”, which is a problem for many machine learning algorithms that use iteration for training, such as gradient descent and back propagation.
  • MapReduce does not support event-stream processing. It is limited to its original purpose, offline (batch-mode) analysis of data sets.

We’ll see how Spark addresses all three concerns. It provides a high-level API that enables large MapReduce programs to be rewritten as small “scripts”. An integrated SQL query engine provides the best of both worlds, SQL-based queries for asking questions and a “Turing-complete”, general-purpose programming model for other chores. Spark has excellent performance, often 100x the performance of comparable MapReduce programs. Finally, Spark supports stream processing.

We’ll also see that the secret to Spark’s success is its roots in the Scala programming language and the world of Functional Programming, which together provide powerful, composable primitives that make it easier for developers to create a wide variety of high-performance applications.

We’ll demonstrate these points in the context of several example applications.

Photo of Dean Wampler

Dean Wampler


Dean Wampler is a Big Data Specialist for Typesafe. He builds scalable, distributed, “Big Data” applications using the Typesafe Reactive Platform, Spark, Hadoop, and other tools. He is the author of Programming Scala, Second Edition, the co-author of Programming Hive, and the author of Functional Programming for Java Developers, all from O’Reilly. Dean is a contributor to several open-source projects and he is the organizer of several Big Data and Scala user groups in Chicago. Dean can be found on twitter @deanwampler.