Presented By O'Reilly and Cloudera
Make Data Work
Feb 17–20, 2015 • San Jose, CA

One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP

Randy Guck (Dell Software)
4:50pm–5:30pm Thursday, 02/19/2015
Hadoop & Beyond
Location: 230 C
Average rating: *****
(5.00, 1 rating)
Slides:   1-PPTX 

There are many good software modules available today that provide big data analytics using distributed clusters. Some applications need fast aggregate queries on large data sets but cannot justify the cost or complexity of a large database cluster. If an application’s requirements meet certain constraints, such as the ability to partition its data into time-based shards, it can benefit from Doradus OLAP and probably get all of its data on a single node.

Doradus is an open source storage and query engine that runs on top of Cassandra. It offers a high-level data model, full text searching, and aggregate queries such as AVERAGE() with multi-level grouping. Doradus offers two storage services that use specialized storage and query execution techniques for different application domains. The Doradus OLAP service, described in this session, borrows techniques from Online Analytical Processing, columnar storage, compression, and sharding to yield extremely compact databases. Though many applications may be able to use a single node, the underlying Cassandra NoSQL database allows data to be distributed and replicated over a cluster when needed.

Some of the features that will be described are:

  • Doradus OLAP uses as little as 1% of the disk space used by other systems.
  • OLAP cubes, called shards, can be updated and rebuilt very quickly, typically in a few seconds.
  • The REST API supports JSON and XML for maximum accessibility.
  • A simple but flexible data model supports multi-valued scalar fields and bi-directional links, which provide inter-object relationships with full referential integrity.
  • All data fields are searchable using a Lucene-compatible query language that supports terms, phrases, wildcards, ranges, inequalities, etc.
  • The query language uses link paths, which are much simpler than joins. They offer rich query features such as quantifiers, path filters, transitive searches, and more.
  • Aggregate queries perform fast, complex metric computations with multi-level grouping—without indexes!
  • Though many databases will fit on a single node, the Cassandra persistence layer allows the database to be distributed on a multi-node cluster for scalability, replication, and failover.

A case study using time-stamped event data is used to demonstrate features and show how one billion objects require only 2GB of disk space.

Photo of Randy Guck

Randy Guck

Dell Software

Randy Guck is a Principal Engineer at Dell Software, focused on big data solutions for commercial software applications. He has developed software for over 30 years, building specialized databases including semantic, hybrid object/relational, and NoSQL. Randy currently heads the Doradus NoSQL database project, which leverages Cassandra to address specific application query and storage problems. Randy holds patents in the fields of eCommerce, multimedia management, and object databases.