Most monitoring systems use a time series database to store historical data. RRD and traditional relational databases such as MySQL are among the most common storage backends used in popular monitoring systems such as MRTG, Cacti, Ganglia, Munin, Nagios, and Opsview. With the advent of the “NoSQL” movement, scalable and distributed data stores have become readily available in large clusters of commodity machines. This presentation introduces OpenTSDB, an open-source, horizontally scalable, general purpose time series database built on top of HBase. We show how its design can be used to monitor large clusters at an unprecedented level of granularity. With such a system, it becomes possible to track orders of magnitude more time series from thousands of hosts and applications, with a resolution of a few seconds to provide accurate real-time monitoring as well as long term trending.
When dealing with increasingly complex distributed systems and applications,
engineers are faced with the growing challenge of understanding the complex
state of the systems they run. All modern network equipment, operating
systems, and applications export a wealth of metrics about their state and
interactions with other services. In a large cluster, collecting, indexing
and storing all the monitoring data becomes a daunting task due to the sheer
volume of information and high rate of change. Metrics are typically
collected by running an agent on the hosts. Data points are then persisted in
a chronological fashion in a time series database. Being able to plot the data
is of utmost importance, and staying on top of the trends is critical for
capacity planning and performance monitoring. Being able to correlate
different time series is tremendously helpful when trying to understand the
behavior of a service or conduct postmortem analyses.
OpenTSDB is a master-less, horizontally scalable system that uses HBase to
store time series data. HBase is an open-source, distributed, non-relational
database modeled after Google’s Bigtable. It features
low-latency, high throughput, consistent operations that are atomic at the row
level, fault tolerance, and load balancing. Thanks to those key features, it
becomes possible to easily store significant amounts of time series data.
By choosing an appropriate schema and using efficient algorithms, millions of
data points from arbitrary time series can be retrieved and graphed quickly.
OpenTSDB offers a simple yet powerful query interface that allows custom
graphs to be generated over arbitrary time periods and with an unprecedented
OpenTSDB has been in use at StumbleUpon for almost a year and has played a key role in helping operation and engineering teams to understand the behavior and performance of our systems, troubleshoot production issues, provide significant supporting material for postmortems, do capacity planning and trend analysis. We constantly collect many hundred metrics and hundred to thousands of data points per second.
Benoit Sigoure is a software engineer with a strong UNIX/Linux background. His domains of interest include (but are not limited to):
Prior to managing StumbleUpon’s infrastructure, Benoit was part of the site reliability team running Google’s planetary-scale ad serving systems (for both AdWords and AdSense).
Comments on this page are now closed.
For information on exhibition and sponsorship opportunities at the conference, contact Susan Young at firstname.lastname@example.org
Download the Strata Sponsor/Exhibitor Prospectus
View a complete list of Strata Contacts