The real power and value proposition of Apache Spark is in building a unified use case that combines ETL, batch analytics, real-time stream analysis, machine learning, graph processing, and visualizations. Through hands-on examples, Sameer Farooqui explores various Wikipedia datasets to illustrate a variety of ideal programming paradigms.
The class will consist of about 60% lecture and 40% hands-on labs + demos. Note that the hands-on labs in class will be taught in Scala. All students will have access to Databricks for one month after class to continue working on labs and assignments.
Introduction to Wikipedia and Spark
DataFrames and Spark SQL
Datasets used: Pageviews and Clickstream
Spark core architecture
Resilient distributed datasets
Dataset used: Pagecounts
Datasets used: Clickstream
Datasets: Live edits stream from multiple languages
Guest talk: Choosing an optimal storage backend for your Spark use case—Vida Ha
Sameer Farooqui is a client services engineer at Databricks, where he works with customers on Apache Spark deployments. Sameer works with the Hadoop ecosystem, Cassandra, Couchbase, and general NoSQL domain. Prior to Databricks, he worked as a freelance big data consultant and trainer globally and taught big data courses. Before that, Sameer was a systems architect at Hortonworks, an emerging data platforms consultant at Accenture R&D, and an enterprise consultant for Symantec/Veritas (specializing in VCS, VVR, and SF-HA).
Comments on this page are now closed.
©2016, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.