Participants should plan to attend both days of this 2-day training. Training passes do not include access to tutorials on Tuesday.
The real power and value proposition of Apache Spark is in building a unified use case that combines ETL, batch analytics, real-time stream analysis, machine learning, graph processing, and visualizations. Andy Huang employs hands-on exercises using various Wikipedia datasets to illustrate the variety of ideal programming paradigms Spark makes possible. By the end of the training, you’ll be able to create proofs of concept and prototype applications using Spark.
The course will consist of about 50% lecture and 50% hands-on labs. All participants will have access to Databricks Community Edition after class to continue working on labs and assignments.
Note that most of the hands-on labs will be taught in Scala. (PySpark architecture and code examples will be covered briefly.)
People with less than two months of hands-on experience with Spark
Introduction to Wikipedia and Spark
Demo: Logging into Databricks and a tour of the user interface
DataFrames and Spark SQL
Datasets used: Pageviews and Clickstream
Use a SQLContext to create a DataFrame from different data sources (S3, JSON, RDBMS, HDFS, Cassandra, etc.)
DataFrames and Spark SQL (cont.)
Spark core architecture
Resilient distributed datasets
Datasets used: Pagecounts and English Wikipedia
Review of Day 1
Shared variables (accumulators and broadcast variables)
Datasets used: Clickstream
Datasets used: Live edits stream of multiple languages
Spark Streaming (cont.)
Spark machine learning
Datasets used: English Wikipedia w/ edits
Spark machine learning (cont.)
Spark R&D (optional)
Andy Huang is a managing consultant in the big data analytics practice at Servian, a leading consulting company in Australia and New Zealand, where he works with clients in telco, banking, and financial services on big data analytics projects. Andy’s project portfolio includes use of Spark for data integration, streaming, and large-scale machine learning. He also leads solution architecture and implementation and evangelizes Apache Spark in the region.
Comments on this page are now closed.
©2016, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.