Presented By O’Reilly and Cloudera

San Francisco • London • New York

Make Data Work

21–22 May 2018: Training
22–24 May 2018: Tutorials & Conference
London, UK

In-Person Training
Data science and machine learning with Apache Spark (SOLD OUT)

Behzad Bordbar (Cloudera)

Monday, 21 May & Tuesday, 22 May, 9:00 - 17:00

Location: Capital Suite 1

Participants should plan to attend both days of this 2-day training course. Platinum and Training passes do not include access to tutorials on Tuesday.

Behzad Bordbar demonstrates how to implement typical data science workflows using Apache Spark. You'll learn how to wrangle and explore data using Spark SQL DataFrames and how to build, evaluate, and tune machine learning models using Spark MLlib.

What you'll learn, and how you can apply it

Learn how to use Spark SQL DataFrames to load, explore, transform, join, and analyze data and Spark MLlib to build, evaluate, and tune machine learning models

This training is for you because...

You're a data scientist who wants to learn how to use Spark to scale your process up to large, distributed datasets.
You're a data engineer, data analyst, or developer who wants to learn how to implement typical data science and machine learning workflows in Spark.

Prerequisites:

A working knowledge of Python
A basic understanding of data analysis, statistical modeling, and machine learning

Hardware and/or installation requirements:

A laptop with a modern version of Chrome or Firefox installed

Behzad Bordbar demonstrates how to implement typical data science workflows using Apache Spark. You’ll learn how to wrangle and explore data using Spark SQL DataFrames and how to build, evaluate, and tune machine learning models using Spark MLlib. Demonstrations and exercises will be conducted in Python using Cloudera Data Science Workbench.

Outline

Introduction to Spark SQL DataFrames
Reading and writing DataFrames
Transforming and joining DataFrames
Grouping and exploring DataFrames
Introduction to Spark MLlib
Extracting and transforming features
Building and evaluating regression, classification, and clustering models
Tuning hyperparameters and validating models
Working with machine learning pipelines

About your instructor

Behzad Bordbar is a mathematician, software engineer, and big data technical instructor at Cloudera, where he teaches courses on Hadoop, Hive, Impala, and Spark. Behzad has worked in academia for over 12 years and has been a visiting scientist at HP, BT, and IBM.

Conference registration

Get the Platinum pass or the Training pass to add this course to your package.

Presented by

Elite Sponsors

Exabyte Sponsor

Impact Sponsors

Supporting Sponsor

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2018, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com

In-Person TrainingData science and machine learning with Apache Spark (SOLD OUT)