The real power and value proposition of Apache Spark is in building a unified use case that combines ETL, batch analytics, real-time stream analysis, machine learning, graph processing, and visualizations. Through hands-on examples, Zoltan Toth explores various Wikipedia datasets to illustrate a variety of ideal programming paradigms.
The class will consist of about 60% lecture and 40% hands-on labs and demos. Note that the hands-on labs in class will be taught in Scala. All students will have access to Databricks for one month after class to continue working on labs and assignments.
9:00am – 9:30am
Introduction to Wikipedia and Spark
9:30am – 10:30am
DataFrames and Spark SQL
Datasets used: Pageviews and Clickstream
10:30am – 11:00am
11:00am – 12:00pm
Spark core architecture
12:00pm – 1:00pm
1:00pm – 2:00pm
Resilient distributed datasets
Dataset used: Pagecounts
2:00pm – 2:30pm
Datasets used: Clickstream
2:30pm – 3:00pm
Datasets used: Live edits stream from multiple languages
3:00pm – 3:30pm
3:30pm – 3:45pm
Guest talk: Choosing an optimal storage backend for your Spark use case
3:45pm – 4:45pm
Datasets used: English Wikipedia and Live edits (optional)
Zoltan Toth is a freelance data engineer and trainer with over 15 years of experience developing data-intensive applications. Zoltan spends most of his time helping companies kick off and mature their data analytics infrastructure and regularly gives Hadoop, big data, and Spark trainings. Zoltan built Prezi’s big data infrastructure and later led Prezi’s data engineering team, scaling it to serve 60 million users backed by a data volume over a petabyte. He also worked on big data and Spark-integration projects with RapidMiner, a global leader in predictive analytics. Besides working with data analytics architectures, Zoltan teaches at Central European University, one of the best independent universities in Europe.
Comments on this page are now closed.
©2016, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.