Presented By O'Reilly and Cloudera
Make Data Work
Feb 17–20, 2015 • San Jose, CA

Making HBase Accessible to Scientists

Spencer Herath (Accenture), Aaron Benz (Accenture)
1:30pm–2:10pm Thursday, 02/19/2015
Hadoop Platform
Location: 210 B/F
Average rating: ****.
(4.50, 2 ratings)
Slides:   1-PPTX 

Our case study covers two main topics:
1. Using HBase as a storage solution for hierarchical time series data
2. Using R and Python to make HBase data readily accessible

The data presented in our case study is hierarchical time series data, where individual time series datasets are organized according to some consistent hierarchy. Such a data model has use in applications ranging from stock price data to sensor data to meteorological data. For example, daily stock price datasets could be organized according a hierarchy such as exchange/stock/day (a stock exchange has many stocks listed under it, and each of those stocks have a stock price time series for each day it was traded on the open market). Each individual stock price dataset would consist of two fields: time and price. There are other time series variables we might wish to measure, such as trade volume.

Hierarchical time series data may not lend itself well to a classic relational database solution. Forcing such a data model into a classic RDBMS system might require a ton of replicated data, dealing with long query times, and a complex data model. On the other hand, hierarchical data fits well in a Google BigTable-inspired NoSQL solution such as HBase. The data modeler simply designs a rowkey structure according to the hierarchy. And instead of storing individual observations in the HBase cells, we opted to store each entire dataset as a “blob” of data in a cell. This “blob” storage approach, along with an intuitive rowkey design, provided us with very fast lookups of data.

Any data storage solution is only effective when coupled with a system that allows for easy retrieval of the data by those who need to work with it. Therefore, we show how to work with HBase using two languages that should be in every data scientist’s toolbox: R and Python. R’s rhbase and Python’s happybase allow the user to get basic functionality out of HBase without the need to know Java or to work in the HBase shell. Specifically, we will talk about how we opted to use happybase for building the HBase table and rhbase for data consumption and analysis. We will also demo a simple front-end web application developed with the help of RStudio’s new Shiny framework.

Spencer Herath

Accenture

Spencer is a Data Scientist with Accenture.

Photo of Aaron Benz

Aaron Benz

Accenture

Aaron is a Data Scientist with Accenture.