Presented By O’Reilly and Cloudera

San Jose • London • New York

Make Data Work

March 5–6, 2018: Training
March 6–8, 2018: Tutorials & Conference
San Jose, CA

In-Person Training
Apache Spark programming

Brooke Wenig (Databricks)

Monday, March 5 & Tuesday, March 6, 9:00am - 5:00pm

Data science and machine learning
Location: 212 A-B

View slides

Participants should plan to attend both days of this 2-day training course. Platinum and Training passes do not include access to tutorials on Tuesday.

What you'll learn, and how you can apply it

Understand Spark’s fundamental mechanics and Spark internals
Learn how to use the core Spark APIs to operate on data, build data pipelines and query large datasets using Spark SQL and DataFrames, analyze Spark jobs using the administration UIs and logs inside Databricks, and create Structured Streaming and machine learning jobs
Be able to articulate and implement typical use cases for Spark

This training is for you because...

You're a software developer, data analyst, data engineer, or data scientist who wants to use Apache Spark for machine learning and data science.

Prerequisites:

Experience coding in Python or Scala and using Spark
A basic understanding of data science topics and terminology
Familiarity with DataFrames (useful but not required)

Hardware and/or installation requirements:

A laptop with an up-to-date version of Chrome or Firefox (Internet Explorer not supported)

Brooke Wenig walks you through the core APIs for using Spark, fundamental mechanisms and basic internals of the framework, SQL and other high-level data access tools, and Spark’s streaming capabilities and machine learning APIs. Join in to learn how to perform machine learning on Spark and explore the algorithms supported by the Spark MLlib APIs.

Each topic includes lecture content along with hands-on use of Spark through an elegant web-based notebook environment. Notebooks allow attendees to code jobs, data analysis queries, and visualizations using their own Spark cluster, accessed through a web browser. You can keep the notebooks and continue to use them with the free Databricks Community Edition offering. Alternatively, each notebook can be exported as source code and run within any Spark environment.

Outline

Spark overview

The DataFrames programming API
Spark SQL
The Catalyst query optimizer
The Tungsten in-memory data format
The Dataset API, encoders, and decoders
Use of the Spark UI to help understand DataFrame behavior and performance
Caching and storage levels

Spark internals

How Spark schedules and executes jobs and tasks
Shuffling, shuffle files, and performance
How various data sources are partitioned
How Spark handles data reads and writes

Graph processing with GraphFrames

Spark ML’s Pipeline API for machine learning

Spark Structured Streaming

About your instructor

Brooke Wenig is an instructor and data science consultant for Databricks. Previously, she was a teaching associate at UCLA, where she taught graduate machine learning, senior software engineering, and introductory programming courses. Brooke also worked at Splunk and Under Armour as a KPCB fellow. She holds an MS in computer science with highest honors from UCLA with a focus on distributed machine learning. Brooke speaks Mandarin Chinese fluently and enjoys cycling.

Conference registration

Get the Platinum pass or the Training pass to add this course to your package.

Comments on this page are now closed.

Comments

Brooke Wenig | INSTRUCTOR AND DATA SCIENCE CONSULTANT

01/29/2018 7:23am PST

It will be held in Python.

Manjusha Bolishetty | BUSINESS INTELLIGENCE ANALYST

01/29/2018 6:35am PST

Hi Brooke,

Will this session be held in Python or Java?

Presented by

Elite Sponsors

Strategic Sponsors

Zettabyte Sponsor

Contributing Sponsors

Exabyte Sponsors

Impact Sponsors

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com

In-Person TrainingApache Spark programming