Presented By O'Reilly and Cloudera
Make Data Work
September 25–26, 2017: Training
September 26–28, 2017: Tutorials & Conference
New York, NY

In-Person Training
Apache Spark for machine learning and data science

Joseph Kambourakis (Databricks)
Monday, September 25 & Tuesday, September 26, 9:00am - 5:00pm
Machine Learning, Spark & beyond
Location: 1A 15/16/17
Secondary topics:  Streaming

Participants should plan to attend both days of this 2-day training course. Platinum and Training passes do not include access to tutorials on Tuesday.

In this training course, you'll learn how to use Apache Spark to perform exploratory data analysis (EDA), develop machine learning pipelines, and use the APIs and algorithms available in the Spark MLlib DataFrames API.

What you'll learn, and how you can apply it

  • Learn how to perform machine learning on Spark
  • Explore the algorithms supported by the Spark MLlib APIs

This training is for you because...

  • You're a software developer, data analyst, data engineer, or data scientist who wants to use Apache Spark for machine learning and data science.

Prerequisites:

  • Experience coding in Python or Scala and using Spark
  • A basic understanding of data science topics and terminology
  • Familiarity with DataFrames (useful but not required)

Hardware and/or installation requirements:

  • A laptop with an up-to-date version of Chrome or Firefox (Internet Explorer not supported)

In this training course, you’ll learn how to use Apache Spark to perform exploratory data analysis (EDA), develop machine learning pipelines, and use the APIs and algorithms available in the Spark MLlib DataFrames API. You’ll also cover parallelizing machine learning algorithms at a conceptual level.

The workshop takes a pragmatic approach, with a focus on using Apache Spark for data analysis and building models using MLlib, while limiting the time spent on machine learning theory and the internal workings of Spark. You’ll work through examples using public datasets to learn how to apply Apache Spark to help you iterate faster and develop models on massive datasets and how to use familiar Python libraries with Spark’s distributed and scalable engine. You’ll leave with the tools and knowledge you need to get started using Spark for practical data analysis tasks and machine learning problems, as well as a firm understanding of DataFrames, the DataFrames MLlib API, and related documentation.

Topics covered include:

  • Extract, transform, load (ETL) and exploratory data analysis (EDA)
  • DataFrames
  • Feature extraction and transformation using Spark MLlib
  • Spark ML Pipelines: Transformers and estimators
  • Cross-validation
  • Model parallel versus data parallel
  • Reusing existing code with Spark
  • Tokenizer, Bucketizer, OneHotEncoder, Normalizer, HashingTF, IDF, StandardScaler, VectorAssembler, StringIndexer, PolynomialExpansion
  • Clustering, classification, and regression
  • K-means, logistic regression, decision trees, and random forests
  • Evaluation metrics

About your instructor

Joseph has over ten years of experience teaching and over five years of experience data science and analytics. He has taught in over a dozen countries around the world and been featured on Japanese television and Saudi newspapers. He holds a BS in Electrical and Computer Engineering from Worcester Polytechnic Institute and an MBA with a focus in analytics from Bentley University. Previous to joining Databricks, Joseph was an instructor with Cloudera and Technical Sales Engineer with IBM. He is a rabid Arsenal FC supporter and competitive Magic: The Gathering player. He lives with his wife and daughter in Needham, MA.

Twitter for mouthorjoe

Conference registration

Get the Platinum pass or the Training pass to add this course to your package. .

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)