Presented By O'Reilly and Cloudera
Make Data Work
September 26–27, 2016: Training
September 27–29, 2016: Tutorials & Conference
New York, NY

Just enough Scala for Spark

Dean Wampler (Lightbend)
9:00am–12:30pm Tuesday, 09/27/2016
Spark & beyond
Location: 1 E 15/1 E 16 Level: Intermediate
Average rating: *****
(5.00, 4 ratings)

Prerequisite knowledge

  • A basic familiarity with Spark and Java
  • Materials or downloads needed in advance

  • A laptop
  • Before the tutorial, clone or download the tutorial GitHub repo and follow the setup instructions in the README.md file.
  • What you'll learn

  • Discover why Scala is an ideal programming language for data engineers using Spark
  • Learn the core features of Scala necessary to write Spark code
  • Pick up tips and tricks for effective Scala
  • Description

    Apache Spark is written in Scala. Although Spark provides a Java API, many data engineers are adopting Scala since it’s the “native” language for Spark—and because Spark code written in Scala is much more concise than comparable Java code. Most data scientists, however, continue to use Python and R. If you want to learn Scala for Spark, this is the tutorial for you. Dean Wampler offers an overview of the core features of Scala you need to use Spark effectively, using hands-on exercises with the Spark APIs. You’ll learn the most important Scala syntax, idioms, and APIs for Spark development.

    Topics include:

    • Classes, methods, and functions
    • Immutable versus mutable values
    • Type inference
    • Pattern matching
    • Scala collections and the common operations on them (the basis of the RDD API)
    • Other Scala types like case classes, tuples, and options
    • Domain-specific languages in Scala
    • Effective use of the Spark shell (Scala interpreter)
    • Common mistakes (e.g., serialization errors) and how to avoid them
    Photo of Dean Wampler

    Dean Wampler

    Lightbend

    Dean Wampler is an expert in streaming data systems, focusing on applications of ML/AI. Formerly, he was the vice president of fast data engineering at Lightbend, where he led the development Lightbend CloudFlow, an integrated system for building and running streaming data applications with Akka Streams, Apache Spark, Apache Flink, and Apache Kafka. Dean is the author of Fast Data Architectures for Streaming Applications, Programming Scala and Functional Programming for Java Developers and the coauthor of Programming Hive, all from O’Reilly. He’s a contributor to several open source projects. A frequent Strata speaker, he’s also the co-organizer of several conferences around the world and several user groups in Chicago. He has a Ph.D. in Physics from the University of Washington.

    Comments on this page are now closed.

    Comments

    Amir Bar Or
    09/27/2016 6:31am EDT

    Getting org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 6.0 failed 1 times, most recent failure: Lost task 0.0 in stage 6.0 (TID 4, localhost): java.net.ConnectException: Connection refused

    The scala code was working , but not spark. Working on a Mac. Anything I missed?

    Picture of Dean Wampler
    Dean Wampler
    09/26/2016 9:11am EDT

    Correction. http://localhost:9000 does work, even for Docker, but only if you run the image using the full command in the README, which also tunnels port 9000 to localhost.

    Picture of Dean Wampler
    Dean Wampler
    09/26/2016 8:21am EDT

    Thanks for the catch! I’ll update the README in a little while

    Igor Alekseev
    09/26/2016 6:38am EDT

    Argh, now the port got dropped. http:// “docker-vm-ip” : 9000

    Igor Alekseev
    09/26/2016 6:37am EDT

    Sorry, the angular brackets got removed in the URL. It’ll be something like http://docker-vm-ip

    Igor Alekseev
    09/26/2016 6:35am EDT

    “However you started Spark Notebook, open your browser to localhost:9000. The UI has a “SPARK NOTEBOOK” banner and shows several directories and notebooks for sample applications that come with Spark Notebook."

    On Mac (and probably on windows) If you running docker you’ll need to get your the VMs IP first, e.g. “docker-machine ls”. The notebook will be available on “http://:9000”

    Picture of Dean Wampler
    Dean Wampler
    09/24/2016 5:59am EDT

    I created a Gitter channel for the tutorial, if you find it useful.

    Picture of Dean Wampler
    Dean Wampler
    09/21/2016 5:11pm EDT

    The tutorial material is ready! https://github.com/deanwampler/JustEnoughScalaForSpark

    Picture of Dean Wampler
    Dean Wampler
    09/20/2016 12:06pm EDT

    Hi, Brian. Thanks for asking. I’m a little behind with the final edits. I’ll push updates, including final instructions in the README tomorrow (Wednesday). Also, I’ll be using a spark-notebook.io for the actual notebook.
    Dean

    09/20/2016 9:43am EDT

    Hi there. The README.md just says “Details are TODO.”

    The notebook itself links to https://render.githubusercontent.com/view/README.md (probably the wrong link) to setup the Jupyter / Toree environment, but that goes nowhere.