Presented By O’Reilly and Cloudera

San Jose • London • New York

Make Data Work

March 5–6, 2018: Training
March 6–8, 2018: Tutorials & Conference
San Jose, CA

In-Person Training
Data science and machine learning with Apache Spark

Brian Bloechle (Cloudera), Glynn Durham (Cloudera)

Monday, March 5 & Tuesday, March 6, 9:00am - 5:00pm

Data science and machine learning
Location: 212 C

Average rating:

(5.00, 1 rating)

Participants should plan to attend both days of this 2-day training course. Platinum and Training passes do not include access to tutorials on Tuesday.

Brian Bloechle demonstrates how to implement typical data science workflows using Apache Spark. You'll learn how to wrangle and explore data using Spark SQL DataFrames and how to build, evaluate, and tune machine learning models using Spark MLlib.

What you'll learn, and how you can apply it

Learn how to use Spark SQL DataFrames to load, explore, transform, join, and analyze data and Spark MLlib to build, evaluate, and tune machine learning models

This training is for you because...

You're a data scientist who wants to learn how to use Spark to scale your process up to large distributed datasets.
You're a data engineer, data analyst, or developer who wants to learn how to implement typical data science and machine learning workflows in Spark.

Prerequisites:

A working knowledge of Python
A basic understanding of data analysis, statistical modeling, and machine learning

Hardware and/or installation requirements:

A laptop with a modern version of Chrome or Firefox installed

Brian Bloechle demonstrates how to implement typical data science workflows using Apache Spark. You’ll learn how to wrangle and explore data using Spark SQL DataFrames and how to build, evaluate, and tune machine learning models using Spark MLlib.

Outline

Introduction to Spark SQL DataFrames
Reading and writing DataFrames
Transforming and joining DataFrames
Grouping and exploring DataFrames
Introduction to Spark MLlib
Extracting and transforming features
Building and evaluating regression, classification, and clustering models
Tuning hyperparameters and validating models
Working with machine learning pipelines

Demonstrations and exercises will be conducted in Python using Cloudera Data Science Workbench.

About your instructors

Brian Bloechle is an industrial mathematician and data scientist as well as a technical instructor at Cloudera.

Glynn Durham is a senior instructor at Cloudera. Previously, he worked for Oracle, Forté Software, MySQL, and Cloudera, spending five or more years at each.

Conference registration

Get the Platinum pass or the Training pass to add this course to your package.

Comments on this page are now closed.

Comments

Brian Bloechle | TECHNICAL INSTRUCTOR

02/20/2018 10:21am PST

Kapil, all you need is an up-to-date web browser. I do not have any personal experience with the Chromebook, but I suspect that Chrome on the Chromebook will be fine. I use it regularly from the Mac and Windows without issue.

Kapil Dahiya | SOFTWARE ARCHITECT

02/20/2018 3:50am PST

Will a Chromebook work for this training?

Brian Bloechle | TECHNICAL INSTRUCTOR

02/08/2018 8:55pm PST

Kate, I do not have any personal experience with your setup, but it should be fine if you are comfortable with the relatively small screen size. You do not need a local installation of Python.

Kate Ardinger | ACTUARIAL TECHNICIAN

02/07/2018 9:44pm PST

Do you think a Surface Pro 4 would work for this training? I do have the Google Chrome desktop program (not the app version) installed. Also, do we need a local installation of python, or will all coding be done through a Chrome/Firefox session?

Brian Bloechle | TECHNICAL INSTRUCTOR

02/04/2018 10:26pm PST

This course is focused on learning the Spark API via PySpark. We will not be covering the sparklyr R API.

carmelo iaria | FOUNDER & CEO

02/03/2018 8:50am PST

Is the entire training focused on using Python or it will be possible to complete it using R tools such as sparklyr

Brian Bloechle | TECHNICAL INSTRUCTOR

12/12/2017 8:17pm PST

(1) This course focuses on the new Spark MLlib API and uses the BinaryClassificationEvaluator from pyspark.ml.evaluation.

(2) We will use the grid search functionality that is provided by the Spark MLlib API, which is inspired by similar functionality in scikit-learn.

Sacchin Lahoti | DATA SCIENCE LEAD

12/12/2017 2:57am PST

Two questions:

1. Does this training uses in-build functions, for example, from the library “from pyspark.mllib.evaluation import BinaryClassificationMetrics” to compute AUC?
2. Does this training uses readily available grid search methods from sci-kit learn (using integration package for Spark) for tuning hyperparameters?

Thanks!

Presented by

Elite Sponsors

Strategic Sponsors

Zettabyte Sponsor

Contributing Sponsors

Exabyte Sponsors

Impact Sponsors

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com

In-Person TrainingData science and machine learning with Apache Spark

What you'll learn, and how you can apply it

This training is for you because...

Prerequisites:

Hardware and/or installation requirements:

About your instructors

Conference registration

Comments

Sponsorship Opportunities

Partner Opportunities

Contact Us

In-Person Training
Data science and machine learning with Apache Spark