Each tutorial participant must bring a laptop, with wifi and browser, and reasonably current hardware (+2GB RAM):
In advanced of the tutorial, be sure to download and install the binary for Spark pre-built for Hadoop 2.xTemporary free accounts for Databricks will be provided to all tutorial participants, to run Apache Spark through cloud-based notebooks on Amazon AWS.
This tutorial provides a hands-on introduction to Apache Spark, with coding exercises for Spark apps showing Python, Scala, R, and SQL. We will review the Spark core API, how to build a pipeline with SQL + DataFrames, plus look through the broader Spark ecosystem: Tungsten, Streaming, MLlib, and GraphX.
We’ll follow with a detailed introduction to SparkR: a light-weight front-end which enables users run R analysis tasks on a Spark cluster. In this talk, we will present some successful efforts to improve SparkR’s functionality and performance, plus a real-world application of SparkR about modeling data centers. This is a collaboration project among UIUC, Purdue University, and Huawei. Part of that work has been contributed back to Spark and merged into the Spark 1.4 release.
SparkR Performance: generally limited by two factors, data transmission between the Spark and R processes and interpretation within the R process. We combined operation vectorization and data permutation to transform the looping-over-data execution into a sequence of vector function invocations, which dramatically reduced the interpretation overhead and improved performance up to 20x in each single R instance. As a result, the whole SparkR system can run much faster without any modifications to the original application code.
Data Center App: efficient scheduling of VMs in a data center can reduce the number of physical servers needed, and in turn reduce energy needs and other capital costs. We used SparkR to model VM workload as time series, extracting low and high frequency features of the workload, then adopting a data-driven approach to achieve efficient, pro-active scheduling.
In the second half of the tutorial, we will explore use of the Spark Data Sources API, and related Catalyst optimizations — comparing implementations for both HBase and Cube, along with their performance trade-offs. These extensions, developed by Huawei, have been contributed as open source via Spark Packages.
We’ll conclude with a review of Spark use cases and solutions in Telco scenarios, looking at how Spark gets applied in production throughout Huawei.
Examples and coding exercises will be run on a mix of Databricks notebooks (browser-based, no download required) and command line REPLs.
This tutorial is sponsored by Huawei.
Paco Nathan is a speaker/instructor/author (Just Enough Math, Intro to Apache Spark, and Enterprise Data Workflows with Cascading). Paco’s expertise/interests: machine learning, cluster computing, cloud, functional programming, Spark, Mesos, Cassandra, Kafka, streaming analytics, approximation algorithms, advanced math applications, ag+data, open data, Scala, Python, Clojure, R, NLP.
Haichuan Wang is a research scientist at the Huawei US R&D center. Haichuan holds a PhD in computer science from the University of Illinois at Urbana-Champaign (UIUC), where he worked with Professor David Padua on parallel computing, compilers, and runtime. Haichuan was a research staff member at IBM Research, where he conducted research on parallel programming models and performance tooling for the Java language.
Jacky Li joined Huawei in 2004, he has been engaged in telecommunications protocols, network service systems, network data analysis, and visualization research and development work. In recent years, he has been dedicated to seeking opportunities for innovation in network data analysis using open source big data processes and analytic technology, like Apache Hadoop, Spark, and Tachyon.
An avid researcher and senior technical architect in the R&D space at Huawei India, Vimal has been involved in the development, design and architecture of telco-grade business intelligence platform with focus on high performance query engine, interactive analytics and rich visualizations for Huawei since his start in 2004. Apart from generating intellectual properties with 6 filed patents in past two years, he has been into developing innovative solutions using big data technologies like Apache Hadoop and Spark for extracting insights from large volumes of telco data.
Comments on this page are now closed.
©2015, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org