Hadoop Data Warehousing with Hive

Dean Wampler (Anyscale), Jason Rutherglen (Datastax)
Data Science, Ballroom H
Please note: to attend, your registration must include Tutorials.
Average rating: ***..
(3.00, 1 rating)

This is a hands-on tutorial. Please download the slides and exercise materials (28MB). It will also contain instructions for installing Hive so you’ll be ready to go Tuesday morning. If you have any problems or questions, send email to training AT thinkbiganalytics DOT com beforehand.

In this hands-on tutorial, you’ll learn how to install and use Hive for Hadoop-based data warehousing. You’ll also learn some tricks of the trade and how to handle known issues.

  • Using the Hive Tutorial Tools

We’ll email instructions to you before the tutorial so you can come prepared with the necessary tools installed and ready to go. This prior preparation will let us use the whole tutorial time to learn Hive’s query language and other important topics. At the beginning of the tutorial we’ll show you how to use these tools.

  • Writing Hive Queries

We’ll spend most of the tutorial using a series of hands-on exercises with actual Hive queries, so you can learn by doing. We’ll go over all the main features of Hive’s query language, HiveQL, and how Hive works with data in Hadoop.

  • Advanced Techniques

Hive is very flexible about the formats of data files, the “schema” of records and so forth. We’ll discuss options for customizing these and other aspects of your Hive and data cluster setup. We’ll briefly examine how you can write Java user defined functions (UDFs) and other plugins that extend Hive for data formats that aren’t supported natively.

  • Hive in the Hadoop Ecosystem

We’ll conclude with a discussion of Hive’s place in the Hadoop ecosystem, such as how it compares to other available tools. We’ll discuss installation and configuration issues that ensure the best performance and ease of use in a real production cluster. In particular, we’ll discuss how to create Hive’s separate “metadata” store in a traditional relational database, such as MySQL. We’ll offer tips on data formats and layouts that improve performance in various scenarios.

Photo of Dean Wampler

Dean Wampler


Dean Wampler is Principal Consultant at Think Big Analytics, specialists in Big Data, Machine Learning, and the Hadoop ecosystem. He speaks frequently at conferences on various topics, such as the effective use of different programming languages and modularity paradigms: functional, object-oriented, and aspect-oriented programming.

Dean is the author of Functional Programming for Java Developers (O’Reilly, 2011) and the co-author of Programming Scala with Alex Payne (O’Reilly, 2009).

Jason Rutherglen


Jason is a Sr. Architect at Think Big Analytics. He has many years of experience writing Java application software, most recently for Hadoop-based applications.


  • EMC
  • Microsoft
  • HPCC Systems™ from LexisNexis® Risk Solutions
  • MarkLogic
  • Shared Learning Collaborative
  • Cloudera
  • Digital Reasoning Systems
  • Pentaho
  • Rackspace Hosting
  • Teradata Aster
  • VMware
  • IBM
  • NetApp
  • Oracle
  • 1010data
  • 10gen
  • Acxiom
  • Amazon Web Services
  • Calpont
  • Cisco
  • Couchbase
  • Cray
  • Datameer
  • DataSift
  • DataStax
  • Esri
  • Facebook
  • Feedzai
  • Hadapt
  • Hortonworks
  • Impetus
  • Jaspersoft
  • Karmasphere
  • Lucid Imagination
  • MapR Technologies
  • Pervasive
  • Platform Computing
  • Revolution Analytics
  • Scaleout Software
  • Skytree, Inc.
  • Splunk
  • Tableau Software
  • Talend

For information on exhibition and sponsorship opportunities at the conference, contact Susan Stewart at sstewart@oreilly.com.

For information on trade opportunities with O'Reilly conferences contact Kathy Yu at mediapartners

For media-related inquiries, contact Maureen Jennings at maureen@oreilly.com

View a complete list of Strata contacts