Presented By O'Reilly and Cloudera
Make Data Work
September 25–26, 2017: Training
September 26–28, 2017: Tutorials & Conference
New York, NY

In-Person Training
Apache Spark for machine learning and data science

Joseph Kambourakis (Databricks)
Monday, September 25 & Tuesday, September 26, 9:00am - 5:00pm
Secondary topics:  Streaming
Average rating: *****
(5.00, 1 rating)

Participants should plan to attend both days of this 2-day training course. Platinum and Training passes do not include access to tutorials on Tuesday.

Joseph Kambourakis walks you through using Apache Spark to perform exploratory data analysis (EDA), developing machine learning pipelines, and using the APIs and algorithms available in the Spark MLlib DataFrames API.

What you'll learn, and how you can apply it

  • Learn how to perform machine learning on Spark
  • Explore the algorithms supported by the Spark MLlib APIs

This training is for you because...

  • You're a software developer, data analyst, data engineer, or data scientist who wants to use Apache Spark for machine learning and data science.


  • Experience coding in Python or Scala and using Spark
  • A basic understanding of data science topics and terminology
  • Familiarity with DataFrames (useful but not required)

Hardware and/or installation requirements:

  • A laptop with an up-to-date version of Chrome or Firefox (Internet Explorer not supported)

Joseph Kambourakis walks you through using Apache Spark to perform exploratory data analysis (EDA), developing machine learning pipelines, and using the APIs and algorithms available in the Spark MLlib DataFrames API. Joseph also covers parallelizing machine learning algorithms at a conceptual level.

Joseph takes a pragmatic approach, focusing on using Apache Spark for data analysis and building models using MLlib and limiting the time spent on machine learning theory and the internal workings of Spark. You’ll work through examples using public datasets to learn how to apply Apache Spark to help you iterate faster and develop models on massive datasets and how to use familiar Python libraries with Spark’s distributed and scalable engine. You’ll leave with the tools and knowledge you need to get started using Spark for practical data analysis tasks and machine learning problems, as well as a firm understanding of DataFrames, the DataFrames MLlib API, and related documentation.

Topics include:

  • Extract, transform, load (ETL) and exploratory data analysis (EDA)
  • DataFrames
  • Feature extraction and transformation using Spark MLlib
  • Spark ML pipelines: Transformers and estimators
  • Cross-validation
  • Model parallel versus data parallel
  • Reusing existing code with Spark
  • Tokenizer, Bucketizer, OneHotEncoder, Normalizer, HashingTF, IDF, StandardScaler, VectorAssembler, StringIndexer, PolynomialExpansion
  • Clustering, classification, and regression
  • k-means, logistic regression, decision trees, and random forests
  • Evaluation metrics

About your instructor

Joseph Kambourakis is a data science instructor at Databricks. Joseph has more than 10 years of experience teaching, over five of them with data science and analytics. Previously, Joseph was an instructor at Cloudera and a technical sales engineer at IBM. He has taught in over a dozen countries around the world and been featured on Japanese television and in Saudi newspapers. He is a rabid Arsenal FC supporter and competitive Magic: The Gathering player. Joseph holds a BS in electrical and computer engineering from Worcester Polytechnic Institute and an MBA with a focus in analytics from Bentley University. He lives with his wife and daughter in Needham, MA.

Twitter for mouthorjoe

Conference registration

Get the Platinum pass or the Training pass to add this course to your package.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)


08/22/2017 7:54am EDT

I finally convinced my manager to let my entire data science team (three girls) attend the conference – precisely because of this training course. Unfortunately it seems to be full now – is there any way that we can still take this training? The other training sessions just aren’t as applicable to us and we really needed this particular training in order to make it worth the $ for my company to spend.