Presented By O’Reilly and Cloudera
Make Data Work
March 5–6, 2018: Training
March 6–8, 2018: Tutorials & Conference
San Jose, CA

In-Person Training
Data science and machine learning with Apache Spark

Brian Bloechle (Cloudera)
Monday, March 5 & Tuesday, March 6, 9:00am - 5:00pm
Location: 212 C

Participants should plan to attend both days of this 2-day training course. Platinum and Training passes do not include access to tutorials on Tuesday.

Brian Bloechle demonstrates how to implement typical data science workflows using Apache Spark. You'll learn how to wrangle and explore data using Spark SQL DataFrames and how to build, evaluate, and tune machine learning models using Spark MLlib.

What you'll learn, and how you can apply it

  • Learn how to use Spark SQL DataFrames to load, explore, transform, join, and analyze data and Spark MLlib to build, evaluate, and tune machine learning models

This training is for you because...

  • You're a data scientist who wants to learn how to use Spark to scale your process up to large distributed datasets.
  • You're a data engineer, data analyst, or developer who wants to learn how to implement typical data science and machine learning workflows in Spark.

Prerequisites:

  • A working knowledge of Python
  • A basic understanding of data analysis, statistical modeling, and machine learning

Hardware and/or installation requirements:

  • A laptop with a modern version of Chrome or Firefox installed

Brian Bloechle demonstrates how to implement typical data science workflows using Apache Spark. You’ll learn how to wrangle and explore data using Spark SQL DataFrames and how to build, evaluate, and tune machine learning models using Spark MLlib.

Outline

  • Introduction to Spark SQL DataFrames
  • Reading and writing DataFrames
  • Transforming and joining DataFrames
  • Grouping and exploring DataFrames
  • Introduction to Spark MLlib
  • Extracting and transforming features
  • Building and evaluating regression, classification, and clustering models
  • Tuning hyperparameters and validating models
  • Working with machine learning pipelines

Demonstrations and exercises will be conducted in Python using Cloudera Data Science Workbench.

About your instructor

Photo of Brian Bloechle

Brian Bloechle is an industrial mathematician and data scientist as well as a technical instructor at Cloudera.

Conference registration

Get the Platinum pass or the Training pass to add this course to your package. .

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)

Comments

Picture of Brian Bloechle
Brian Bloechle | TECHNICAL INSTRUCTOR
02/08/2018 8:55pm PST

Kate, I do not have any personal experience with your setup, but it should be fine if you are comfortable with the relatively small screen size. You do not need a local installation of Python.

Kate Ardinger | ACTUARIAL TECHNICIAN
02/07/2018 9:44pm PST

Do you think a Surface Pro 4 would work for this training? I do have the Google Chrome desktop program (not the app version) installed. Also, do we need a local installation of python, or will all coding be done through a Chrome/Firefox session?

Picture of Brian Bloechle
Brian Bloechle | TECHNICAL INSTRUCTOR
02/04/2018 10:26pm PST

This course is focused on learning the Spark API via PySpark. We will not be covering the sparklyr R API.

carmelo iaria | FOUNDER & CEO
02/03/2018 8:50am PST

Is the entire training focused on using Python or it will be possible to complete it using R tools such as sparklyr

Picture of Brian Bloechle
Brian Bloechle | TECHNICAL INSTRUCTOR
12/12/2017 8:17pm PST

(1) This course focuses on the new Spark MLlib API and uses the BinaryClassificationEvaluator from pyspark.ml.evaluation.

(2) We will use the grid search functionality that is provided by the Spark MLlib API, which is inspired by similar functionality in scikit-learn.

Sacchin Lahoti | DATA SCIENCE LEAD
12/12/2017 2:57am PST

Two questions:

1. Does this training uses in-build functions, for example, from the library “from pyspark.mllib.evaluation import BinaryClassificationMetrics” to compute AUC?
2. Does this training uses readily available grid search methods from sci-kit learn (using integration package for Spark) for tuning hyperparameters?

Thanks!