Expand your data science and machine learning skills with Python, R, SQL, Spark, and TensorFlow (Day 2)

Ian Cook (Cloudera)

Location: 1A 18

Data Science, Machine Learning, & AI

Who is this presentation for?

You want to expand your data science and machine learning skills without getting overwhelmed.

Level

Beginner

Description

Python and R are the leading open source languages for data science and machine learning, but getting comfortable with both of these languages requires grappling with different syntaxes, conventions, and terminology. Pairs of ostensibly comparable packages from the Python Package index (PyPI) and CRAN often have fundamentally different interfaces, and APIs connecting Python and R to the same external systems are often incongruous. Furthermore, when you attempt to scale workflows from smaller local datasets to larger distributed datasets, you have to contend with additional frameworks and interfaces with idiosyncrasies beyond those in the core Python and R ecosystems. But these differences belie a set of fundamental abstractions common to these systems.

Ian Cook explores the underlying commonalities of these systems through intuitive explanations and straightforward demonstrations.

Outline

The two-dimensional data structures familiar to data scientists (SQL tables, NumPy arrays, pandas DataFrames, R data frames, Spark DataFrames, and TensorFlow datasets) are all implementations of the same abstract concept with only a few important differences
Popular data manipulation interfaces (SQL, pandas, dplyr, and the Spark DataFrame API) are all based on the same set of relational operations
Machine learning workflows implemented using popular packages and frameworks (scikit-learn, the caret package for R, Spark MLlib, and the TensorFlow Estimator API) all follow the same fundamental steps: input training data, define features and labels, train model, evaluate model, and make predictions

By exploring and running Python and R code in Cloudera Data Science Workbench (CDSW), you’ll gain familiarity with these these two languages and their ecosystems of data science tools, plus SQL, Spark, and TensorFlow. By practicing on sets of equivalent data science and machine learning workflows implemented using these different languages and frameworks, you’ll overcome the obstacles to getting started using these tools.

Prerequisite knowledge

A basic knowledge of Python or R

What you'll learn

Understand the fundamental abstractions common to data science and machine learning systems
Learn how to implement equivalent workflows using Python, R, SQL, Spark, and TensorFlow and overcome the obstacles to getting started

Ian Cook

Cloudera

Ian Cook is a data scientist at Cloudera and the author of several R packages, including implyr. Previously, he was a data scientist at TIBCO and a statistical software developer at AMD. Ian is a cofounder of Research Triangle Analysts, the largest data science meetup group in the Raleigh, North Carolina, area, where he lives with his wife and two young children. He holds an MS in statistics from Lehigh University and a BS in applied mathematics from Stony Brook University.