The real power and value proposition of Apache Spark is in building a unified use case that combines ETL, batch analytics, real-time stream analysis, machine learning, graph processing, and visualizations. Stephane Rion employs hands-on exercises using explore various Wikipedia datasets to illustrate the variety of ideal programming paradigms Spark makes possible. By the end of the training, attendees will be able to create proofs of concept and prototype applications using Spark.
The course will consist of about 50% lecture and 50% hands-on labs. All attendees will have access to Databricks for one month after class to continue working on labs and assignments.
Note that most of the hands-on labs in class will be taught in Scala. (PySpark architecture and code examples will be covered briefly.)
People with less than two months of hands-on experience with Spark
Datasets explored in class:
30 mins: Introduction to Wikipedia and Spark
Demo: Logging into Databricks and a tour of the user interface
2 hours: DataFrames and Spark SQL
Datasets used: Pageviews and Clickstream
1.5 hours: Spark core architecture
1.5 hours: Resilient distributed datasets
Datasets used: Pagecounts and English Wikipedia
30 mins: Review of Day 1
1 hour: Shared variables (accumulators and broadcast variables)
1 hour: GraphX
Datasets used: Clickstream
1.5 hours: Spark Streaming
Datasets used: Live edits stream of multiple languages
1.5 hours: Spark machine learning
Datasets used: English Wikipedia w/ edits
30 mins (optional): Spark R&D
Stephane Rion is a senior data scientist at Big Data Partnership, where he helps clients get insight into their data by developing scalable analytical solutions in industries such as finance, gaming, and social services. Stephane has a strong background in machine learning and statistics with over 6 years’ experience in data science and 10 years’ experience in mathematical modeling. He has solid hands-on skills in machine learning at scale with distributed systems like Apache Spark, which he has used to develop production rate applications. In addition to Scala with Spark, Stephane is fluent in R and Python, which he uses daily to explore data, run statistical analysis, and build statistical models. He was the first Databricks-certified Spark instructor in EMEA. Stephane enjoys splitting his time between working on data science projects and teaching Spark classes, which he feels is the best way to remain at the forefront of the technology and capture how people are attempting to use Spark within their businesses.
©2016, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.