HBase and Hive at StumbleUpon

Average rating: ****.
(4.00, 4 ratings)

We deployed Hive at StumbleUpon early this year as a tool for mining our HBase production datasets. It has been quite a success with both engineering and our analysts; engineers no longer have to write the analysts’ reports and the analysts don’t have to deal with cranky engineers.

In this presentation, we will first cover the reasons why someone would use Hive with HBase instead of directly using HDFS files, and which goals can be accomplished. We will then review how the Hive-HBase integration works to better understand the state and drawbacks of the current implementation.

The second part will cover how we deployed Hive internally at StumbleUpon and how the data is fed into the system. This will include how we are live replicating the data from our MySQL and real-time HBase clusters into an analytical Hadoop/HBase cluster in a ETL fashion. We will also present some of our use cases and how they translate into the Hive query language.

The presentation will end with our lessons learned and how we expect to grow our Hive usage as the company does. At the time of writing we are signing up more than 600,000 new users per month and we just passed 15M total users.

Photo of Jean-Daniel Cryans

Jean-Daniel Cryans


Jean-Daniel is a Database Engineer at StumbleUpon. When he’s not developing HBase or supporting its usage inside the company, he’s helping others with the Hadoop stack. Jean-Daniel has been a commiter on the Apache HBase project since 2008.

Comments on this page are now closed.


Picture of Sheeri K. Cabral
Sheeri K. Cabral
09/04/2011 10:45pm PDT

Video for this talk can be found online at www.youtube.com/watch?v=WpQ...