Presented By O’Reilly and Cloudera
Make Data Work
March 5–6, 2018: Training
March 6–8, 2018: Tutorials & Conference
San Jose, CA

Data science and machine learning with Apache Spark (Day 2)

Brian Bloechle (Cloudera), Glynn Durham (Cloudera)
Location: 212 C

Brian Bloechle demonstrates how to implement typical data science workflows using Apache Spark. You’ll learn how to wrangle and explore data using Spark SQL DataFrames and how to build, evaluate, and tune machine learning models using Spark MLlib.

Outline

  • Introduction to Spark SQL DataFrames
  • Reading and writing DataFrames
  • Transforming and joining DataFrames
  • Grouping and exploring DataFrames
  • Introduction to Spark MLlib
  • Extracting and transforming features
  • Building and evaluating regression, classification, and clustering models
  • Tuning hyperparameters and validating models
  • Working with machine learning pipelines

Demonstrations and exercises will be conducted in Python using Cloudera Data Science Workbench.

Photo of Brian Bloechle

Brian Bloechle

Cloudera

Brian Bloechle is an industrial mathematician and data scientist as well as a technical instructor at Cloudera.

Photo of Glynn Durham

Glynn Durham

Cloudera

Glynn Durham is a senior instructor at Cloudera. Previously, he worked for Oracle, Forté Software, MySQL, and Cloudera, spending five or more years at each.