Presented By O'Reilly and Cloudera
Make Data Work
September 26–27, 2016: Training
September 27–29, 2016: Tutorials & Conference
New York, NY

Guerrilla guide to Python and Apache Hadoop

Juliet Hougland (Cloudera), Sean Owen (Cloudera)
1:30pm–5:00pm Tuesday, 09/27/2016
Data science & advanced analytics
Location: 1 E 09 Level: Intermediate
Tags: pydata
Average rating: ***..
(3.67, 3 ratings)

Prerequisite knowledge

  • Python development skills
  • What you'll learn

  • How to do full Python development on the Hadoop stack, at Hadoop scale
  • Description

    Juliet Hougland and Sean Owen offer a practical overview of the basics of using Python data tools with a Hadoop cluster—using an interactive demo format with accompanying online materials and data. Juliet and Sean cover HDFS connectivity and dealing with raw data files and running SQL queries with a SQL-on-Hadoop system like Apache Hive or Apache Impala (incubating) as well as the basics of accessing data with Spark, creating new Spark DataFrames, and implementing the two most common modeling workflows: fitting a model on a single node using scikit, saving the model, and performing embarrassingly parallel model application and fitting a model to distributed data using Spark MLlib.

    Topics include:

    • Connecting to HDFS and reading and writing raw data files
    • Connecting to Impala and querying new datasets in HDFS using Ibis or raw SQL
    • Creating partitioned Impala or Hive tables in the Hive metastore
    • Using Python’s data visualization tools as part of an exploratory data analysis
    • Building complex analytic models
    Photo of Juliet Hougland

    Juliet Hougland


    Juliet Hougland answers complex business problems using statistics to tame multiterabyte datasets. She succeeds in applying and explaining the results of mathematical models across a variety of industries including software, industrial energy, retail, and consumer packaged goods. Juliet is currently the head of data science, engineering at Cloudera, where she focuses on using data to help engineering build high-quality products. Juliet’s been sought after by Cloudera’s customers as a field-facing data scientist advising on which tools to use, teaching how to use them, recommending the best approach to bring together the right data to answer the business problem at hand, and building production machine-learning models. For many years, Juliet has been a contributor in the open source community working on projects such as Apache Spark, Scalding, and Kiji. Juliet holds an MS in applied mathematics from the University of Colorado, Boulder and graduated Phi Beta Kappa from Reed College with a BA in math-physics.

    Photo of Sean Owen

    Sean Owen


    Sean Owen is director of data science at Cloudera in London. Before Cloudera, he founded Myrrix Ltd. (now the Oryx project) to commercialize large-scale real-time recommender systems on Hadoop. He is an Apache Spark committer, was a committer and VP for Apache Mahout, and is the coauthor of Advanced Analytics on Spark and Mahout in Action. Previously, Sean was a senior engineer at Google.