Skip to main content
Make Data Work
Oct 15–17, 2014 • New York, NY

Bulk Loading Your Big Data into Apache HBase, a Full Walkthrough

Jean-Daniel Cryans (Cloudera)
4:15pm–4:55pm Friday, 10/17/2014
Hadoop Platform
Location: Hall A 23/24
Average rating: ****.
(4.80, 5 ratings)
Slides:   1-PDF 

Apache HBase is a database designed to store your big data and to query it randomly. One of its most compelling features is the ability to write user code that can generate files in HBase’s own format that can then be passed to the region servers, bypassing the write path with minimal effect on latency. This presentation will walk you through the whole HBase bulk loading process.

First, we will explain how the write path works in HBase, that is, how data goes from the client to being persisted in a file. This concept is important to understand and it will motivate using bulk loading instead of writing data directly in HBase via a MapReduce job using the TableOutputFormat.

Second, we will describe the main concepts related to bulk loading like: the total order partitioner, the HFileOutputFormat, the different reducers that HBase provides, and the LoadIncrementalHFiles tool. It is assumed that the audience members have basic knowledge of MapReduce.

Third, we will present an example that goes through a complete Extract, Transform, Load process (ETL) from an external data source into an HBase table.

Finally, we will explore a few issues generally experienced by first time users of this feature. For example, how do you make sure you don’t overrun the cluster with compactions? How do you configure the bulk load job to create files with the correct configurations, like compression? What about file system permissions when files are created by different users?

At the end of this presentation, new HBase users should have a good idea on how to get their data efficiently into HBase while current users will have learned how to run continuous data ingestion without affecting their production system.

Photo of Jean-Daniel Cryans

Jean-Daniel Cryans

Cloudera

Jean-Daniel Cryans works as a software engineer at Cloudera on the Storage team where he spends his days making Apache HBase better. Previous to that, he worked at StumbleUpon where he also worked on HBase while maintaining its production deployment. Jean-Daniel enjoys teaching HBase to new comers and old timers alike in the open source community or by giving presentations at Big Data and Apache Hadoop-related conferences and meetups. He became a committer and PMC member on the project in 2008 when he was still an undergrad student at ETS Montreal. Jean-Daniel now lives in San Francisco with his wife.