Presented By O'Reilly and Cloudera
December 5-6, 2016: Training
December 6–8, 2016: Tutorials & Conference

Guerrilla guide to Python and Apache Hadoop

Juliet Hougland (Cloudera), Sean Owen (Cloudera)
1:30pm–5:00pm Tuesday, December 6, 2016
Data science and advanced analytics
Location: 321/322 Level: Intermediate
Tags: pydata
Average rating: ***..
(3.00, 2 ratings)

Prerequisite Knowledge

  • Python development skills

Materials or downloads needed in advance

  • A laptop

What you'll learn

  • Learn how to do full Python development on the Hadoop stack, at Hadoop scale


Sean Owen and Juliet Hougland offer a practical overview of the basics of using Python data tools with a Hadoop cluster—using an interactive demo format with accompanying online materials and data. Sean and Juliet cover HDFS connectivity and dealing with raw data files, running SQL queries with a SQL-on-Hadoop system like Apache Hive or Apache Impala (incubating), and the basics of accessing data with Spark, creating new Spark DataFrames, and implementing the two most common modeling workflows: fitting a model on a single node using scikit, saving the model, and performing embarrassingly parallel model application and fitting a model to distributed data using Spark MLlib.

Topics include:

  • Connecting to HDFS and reading and writing raw data files
  • Connecting to Impala and querying new datasets in HDFS using Ibis or raw SQL
  • Creating partitioned Impala or Hive tables in the Hive metastore
  • Using Python’s data visualization tools as part of an exploratory data analysis
  • Building complex analytic models
Photo of Juliet Hougland

Juliet Hougland


Juliet Hougland is a data scientist at Cloudera and contributor/committer/maintainer for the Sparkling Pandas project. Her commercial applications of data science include developing predictive maintenance models for oil and gas pipelines at Deep Signal and designing and building a platform for real-time model application, data storage, and model building at WibiData. Juliet was the technical editor for Learning Spark by Karau et al. and Advanced Analytics with Spark by Ryza et al. She holds an MS in applied mathematics from the University of Colorado, Boulder and graduated Phi Beta Kappa from Reed College with a BA in math-physics.

Photo of Sean Owen

Sean Owen


Sean Owen is director of data science at Cloudera in London. Before Cloudera, he founded Myrrix Ltd. (now the Oryx project) to commercialize large-scale real-time recommender systems on Hadoop. He is an Apache Spark committer, was a committer and VP for Apache Mahout, and is the coauthor of Advanced Analytics on Spark and Mahout in Action. Previously, Sean was a senior engineer at Google.