In this talk, we describe using Redis, an open source, in-memory key/value store, to capture large volumes of data from numerous remote sources while also allowing real-time monitoring and analytics in a production environment. With this approach, we are able to capture a high volume of continuous data from numerous remote environmental sensors while consistently querying our database for real time monitoring and analytics. NoSQL data store implementations have gained mass attention in recent years, in part due to the flexibility and efficiency of working with high volumes of data without the overhead of traditional structured database systems. As these technologies mature, their potential application to big data collection and analytics continues to grow.
The two biggest I/O bottlenecks in distributed applications are network I/O and filesystem I/O. Our particular use case required large numbers of remote client deployments in which we had no control over network infrastructure, and thus was always at the mercy of network latency. However, we found we were able to successfully combat filesystem I/O by leveraging an in-memory database for incoming data, enabling us to scale data collection rates to meet requirements. Our use case required not only large volumes of data to be continually collected, but also required data to be collected in small 300 byte chunks, resulting in a proportionally large number of inserts per second. We chose Redis, a popular open source, in-memory key/value store, to collect all incoming data from our various remote deployments. We found that Redis was not only capable of handling a data collection at a high rate, but was also able to serve real-time analytics queries simultaneously, a task that traditional databases proved incapable of when tested within our system.
In implementing such a system, there are some important factors to consider, e.g.:
We will walk through our system architecture, highlighting design choices made based on the above considerations, with a specific focus on considerations that may be at odds with each other, such as designing a data model to meet both collection efficiency and real-time analytics needs. We will also present lessons learned through our production deployments and provide an introspective view of our solutions, along with proposed enhancements for future iterations and divergent requirements.
Aaron is a software engineer currently located in Pittsburgh, PA. He received his Ph.D. in 2007, developing algorithms and software for 3D medical image analysis. He currently leads a software development team at Carnegie Mellon University, focusing on web application development and cloud systems.
Aaron is a polyglot programmer, with a keen interest in open source technologies. Some favorite technologies at the moment include Node.js, Python/Django, MongoDB, and Redis.
Tim celebrates software development using many languages and frameworks, heeding less to past experience in choosing technologies. Spring MVC, Hibernate, Rails, .NET MVC, Django and the variety of languages that come with are in his L1 cache. Among other endeavors to keep him sharp, he currently provides coded solutions for the Software Engineering Institute at CMU.
Tim received a B.S. in Computer Engineering in 2003 and resides in Pittsburgh, PA.
Comments on this page are now closed.
For information on exhibition and sponsorship opportunities at the conference, contact Susan Stewart at email@example.com
For information on trade opportunities with O'Reilly conferences contact Kathy Yu at mediapartners
For media-related inquiries, contact Maureen Jennings at firstname.lastname@example.org
View a complete list of Strata contacts