Presented By O'Reilly and Cloudera
Make Data Work
September 25–26, 2017: Training
September 26–28, 2017: Tutorials & Conference
New York, NY

In-Person Training
Apache Spark for machine learning and data science

Monday, September 25 & Tuesday, September 26, 9:00am - 5:00pm
Machine Learning, Spark & beyond
Location: 1A 15/16/17
Secondary topics:  Streaming
See pricing & packages
Best Price ends June 29

This course will sell out—sign up today!

Participants should plan to attend both days of this 2-day training course. Platinum and Training passes do not include access to tutorials on Tuesday.

The Data Science with Apache Spark workshop will show how to use Apache Spark to perform exploratory data analysis (EDA), develop machine learning pipelines, and use the APIs and algorithms available in the Spark MLlib DataFrames API. It is designed for software developers, data analysts, data engineers, and data scientists.

What you'll learn, and how you can apply it

A deeper understanding of how to perform machine learning on Spark, including a solid dive into most of the algorithms supported by the Spark MLlib APIs.

This training is for you because...

Software developers, data analysts, data engineers, and data scientists

Prerequisites:

Some experience coding in Python or Scala, a basic understanding of data science topics and terminology, and some experience using Spark are required. Familiarity with the concept of a DataFrame is helpful. Brief conceptual reviews of data science techniques will be performed before the techniques are used. Labs and demos will be available in both Python and Scala.

Hardware and/or installation requirements:

A laptop with an up-to-date version of Chrome or Firefox (Internet Explorer not supported)

The Data Science with Apache Spark workshop will show how to use Apache Spark to perform exploratory data analysis (EDA), develop machine learning pipelines, and use the APIs and algorithms available in the Spark MLlib DataFrames API. It is designed for software developers, data analysts, data engineers, and data scientists.

It will also cover parallelizing machine learning algorithms at a conceptual level. The workshop will take a pragmatic approach, with a focus on using Apache Spark for data analysis and building models using MLlib, while limiting the time spent on machine learning theory and the internal workings of Spark, although we will view Spark’s source code a couple of times.

We’ll work through examples using public datasets that will show you how to apply Apache Spark to help you iterate faster and develop models on massive datasets. This workshop will provide you the tools so that you can be productive using Spark on practical data analysis tasks and machine learning problems. You’ll learn about how to use familiar Python libraries with Spark’s distributed and scalable engine. After completing this workshop you should be comfortable using DataFrames, the DataFrames MLlib API, and related documentation. These building blocks will enable you to use Apache Spark to solve a variety of data analysis and machine learning tasks.

Topics covered include:

  • Extract, Transform, Load (ETL) and Exploratory Data Analysis (EDA)
  • DataFrames
  • Feature Extraction and Transformation using MLlib
  • MLlib Pipelines: Transformers and Estimators
  • Cross validation
  • Model Parallel vs Data Parallel
  • Reusing existing code with Spark (examples in Python)
  • Tokenizer, Bucketizer, OneHotEncoder, Normalizer, HashingTF, IDF, StandardScaler, VectorAssembler, StringIndexer, PolynomialExpansion
  • Clustering, Classification, and Regression
  • K-means, Logistic Regression, Decision Trees, and Random Forests
  • Evaluation Metrics

Conference registration

Get the Platinum pass or the Training pass to add this course to your package. Best Price ends June 29.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)