Brian Bloechle demonstrates how to implement typical data science workflows using Apache Spark. You’ll learn how to wrangle and explore data using Spark SQL DataFrames and how to build, evaluate, and tune machine learning models using Spark MLlib.
Outline
Demonstrations and exercises will be conducted in Python using Cloudera Data Science Workbench.
Brian Bloechle is an industrial mathematician and data scientist as well as a technical instructor at Cloudera.
Glynn Durham is a senior instructor at Cloudera. Previously, he worked for Oracle, Forté Software, MySQL, and Cloudera, spending five or more years at each.
Get the Platinum pass or the Training pass to add this course to your package.
Comments on this page are now closed.
For exhibition and sponsorship opportunities, email strataconf@oreilly.com
For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com
View a complete list of Strata Data Conference contacts
©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com
Comments
Kapil, all you need is an up-to-date web browser. I do not have any personal experience with the Chromebook, but I suspect that Chrome on the Chromebook will be fine. I use it regularly from the Mac and Windows without issue.
Will a Chromebook work for this training?
Kate, I do not have any personal experience with your setup, but it should be fine if you are comfortable with the relatively small screen size. You do not need a local installation of Python.
Do you think a Surface Pro 4 would work for this training? I do have the Google Chrome desktop program (not the app version) installed. Also, do we need a local installation of python, or will all coding be done through a Chrome/Firefox session?
This course is focused on learning the Spark API via PySpark. We will not be covering the sparklyr R API.
Is the entire training focused on using Python or it will be possible to complete it using R tools such as sparklyr
(1) This course focuses on the new Spark MLlib API and uses the BinaryClassificationEvaluator from pyspark.ml.evaluation.
(2) We will use the grid search functionality that is provided by the Spark MLlib API, which is inspired by similar functionality in scikit-learn.
Two questions:
1. Does this training uses in-build functions, for example, from the library “from pyspark.mllib.evaluation import BinaryClassificationMetrics” to compute AUC?
2. Does this training uses readily available grid search methods from sci-kit learn (using integration package for Spark) for tuning hyperparameters?
Thanks!