July 20–24, 2015
Portland, OR

Apache Spark Tutorial, with deep-dives on SparkR and Data Sources API, plus Telco applications case studies

Paco Nathan (O'Reilly Media), Haichuan Wang (Huawei), Jacky Li (Huawei technology), Vimal Das Kammath V (Huawei)
1:30pm–5:00pm Tuesday, 07/21/2015
Average rating: ***..
(3.33, 3 ratings)
Slides:   1-PDF 

Prerequisite Knowledge

  • some experience coding: Python, SQL, Scala, or R
  • some familiarity with Big Data issues and concepts

Materials or downloads needed in advance

Each tutorial participant must bring a laptop, with wifi and browser, and reasonably current hardware (+2GB RAM):

  • MacOSX, Windows, Linux -- all work fine
  • have JDK7 or JDK8 installed
  • disable corporate security controls that block network use

In advanced of the tutorial, be sure to download and install the binary for Spark pre-built for Hadoop 2.x

Temporary free accounts for Databricks will be provided to all tutorial participants, to run Apache Spark through cloud-based notebooks on Amazon AWS.

Description

Sponsored by:
Huawei

This tutorial provides a hands-on introduction to Apache Spark, with coding exercises for Spark apps showing Python, Scala, R, and SQL. We will review the Spark core API, how to build a pipeline with SQL + DataFrames, plus look through the broader Spark ecosystem: Tungsten, Streaming, MLlib, and GraphX.

We’ll follow with a detailed introduction to SparkR: a light-weight front-end which enables users run R analysis tasks on a Spark cluster. In this talk, we will present some successful efforts to improve SparkR’s functionality and performance, plus a real-world application of SparkR about modeling data centers. This is a collaboration project among UIUC, Purdue University, and Huawei. Part of that work has been contributed back to Spark and merged into the Spark 1.4 release.

SparkR Performance: generally limited by two factors, data transmission between the Spark and R processes and interpretation within the R process. We combined operation vectorization and data permutation to transform the looping-over-data execution into a sequence of vector function invocations, which dramatically reduced the interpretation overhead and improved performance up to 20x in each single R instance. As a result, the whole SparkR system can run much faster without any modifications to the original application code.

Data Center App: efficient scheduling of VMs in a data center can reduce the number of physical servers needed, and in turn reduce energy needs and other capital costs. We used SparkR to model VM workload as time series, extracting low and high frequency features of the workload, then adopting a data-driven approach to achieve efficient, pro-active scheduling.

In the second half of the tutorial, we will explore use of the Spark Data Sources API, and related Catalyst optimizations — comparing implementations for both HBase and Cube, along with their performance trade-offs. These extensions, developed by Huawei, have been contributed as open source via Spark Packages.

We’ll conclude with a review of Spark use cases and solutions in Telco scenarios, looking at how Spark gets applied in production throughout Huawei.

Examples and coding exercises will be run on a mix of Databricks notebooks (browser-based, no download required) and command line REPLs.

This tutorial is sponsored by Huawei.

Photo of Paco Nathan

Paco Nathan

O'Reilly Media

Paco Nathan is a speaker/instructor/author (Just Enough Math, Intro to Apache Spark, and Enterprise Data Workflows with Cascading). Paco’s expertise/interests: machine learning, cluster computing, cloud, functional programming, Spark, Mesos, Cassandra, Kafka, streaming analytics, approximation algorithms, advanced math applications, ag+data, open data, Scala, Python, Clojure, R, NLP.

Photo of Haichuan Wang

Haichuan Wang

Huawei

Haichuan Wang is a research scientist at the Huawei US R&D center. Haichuan holds a PhD in computer science from the University of Illinois at Urbana-Champaign (UIUC), where he worked with Professor David Padua on parallel computing, compilers, and runtime. Haichuan was a research staff member at IBM Research, where he conducted research on parallel programming models and performance tooling for the Java language.

Photo of Jacky Li

Jacky Li

Huawei technology

Jacky Li joined Huawei in 2004, he has been engaged in telecommunications protocols, network service systems, network data analysis, and visualization research and development work. In recent years, he has been dedicated to seeking opportunities for innovation in network data analysis using open source big data processes and analytic technology, like Apache Hadoop, Spark, and Tachyon.

Photo of Vimal Das Kammath V

Vimal Das Kammath V

Huawei

An avid researcher and senior technical architect in the R&D space at Huawei India, Vimal has been involved in the development, design and architecture of telco-grade business intelligence platform with focus on high performance query engine, interactive analytics and rich visualizations for Huawei since his start in 2004. Apart from generating intellectual properties with 6 filed patents in past two years, he has been into developing innovative solutions using big data technologies like Apache Hadoop and Spark for extracting insights from large volumes of telco data.

Comments on this page are now closed.

Comments

Picture of Paco Nathan
Paco Nathan
07/21/2015 4:21am PDT

Hi Srivatsa,

It depends on which languages are intended. Scala and SQL are packaged with the Spark binary that you’ve downloaded.

For Python or R, you’ll need their binaries installed, respectively.

Srivatsa Radhakrishna
07/21/2015 4:17am PDT

Hi, I have downloaded everything under http://10.10.32.101/oscon/nathan/. From a setup perspective, is there anything else I need before attending the tutorial?

Picture of Paco Nathan
Paco Nathan
07/20/2015 1:48pm PDT

Hi Colin,

No pre-registration is required. We’ve got room for up to 80 people tomorrow.

Picture of Colin Williams
Colin Williams
07/20/2015 11:55am PDT

Is there no pre-reg for this event?