Presented By
O’Reilly + Cloudera
Make Data Work
29 April–2 May 2019
London, UK
Please log in

Infinite retention using storage offloading with Apache Pulsar

Karthik Ramasamy (Streamlio), Ivan Kelly (Streamlio)
17:2518:05 Wednesday, 1 May 2019
Data Engineering and Architecture
Location: Capital Suite 4

Messaging systems are an essential part of any real time analytics engine. A common pattern is to feed a user event stream into a processing engine, show the result to the user, capture feedback from the user, push the feedback back into the event stream, and so on. The quality of the result shown to the user is often a function of the amount of data in the event stream, so the more your event stream scales, the better you can serve your users.

Messaging systems have recently started to push into the field of long term data storage and event stores, where you cannot compromise on retention. If data is written to the system, it must stay there.

Infinite retention can be challenging for a messaging system. As data grows for a single topic, you need to start storing different parts of the backlog on different sets of machines without losing consistency.

In this talk I will describe how Pulsar uses segment oriented architecture. It provides a unit of consensus called a ledger. Pulsar strings together a number of ledgers to build the complete topic backlog. Each ledger in the topic backlog is independent of all previous ledgers with regards to location. This allows us to scale the size of the topic retention simply by adding more machines. When the storage node is added to a Pulsar cluster, the brokers will detect it, and gradually start writing new data to the new node. There’s no disruptive rebalancing operation necessary.

Of course adding more machines will eventually get very expensive. This is where tiered storage comes in. With tiered storage, parts of the topic backlog can be moved to cheaper storage such as Amazon S3 or Google Cloud Storage. We will discuss the architecture of tiered storage, and how it is a natural continuation of Pulsar’s segment oriented architecture.

Finally, if you start storing data for a long time in Pulsar, you may want a means to query it. We will introduce our SQL implementation, based on the Presto query engine, which allows users to easily query topic backlog data, without having do read the whole thing.

Photo of Karthik Ramasamy

Karthik Ramasamy

Streamlio

Karthik Ramasamy is the cofounder of Streamlio, a company building next-generation real-time processing engines. Karthik has more than two decades of experience working in parallel databases, big data infrastructure, and networking. Previously, he was engineering manager and technical lead for real-time analytics at Twitter, where he was the cocreator of Heron; cofounded Locomatix, a company that specialized in real-time stream processing on Hadoop and Cassandra using SQL (acquired by Twitter); worked briefly on parallel query scheduling at Greenplum (acquired by EMC for more than $300M); and designed and delivered platforms, protocols, databases, and high-availability solutions for network routers at Juniper. He’s the author of several patents, publications, and one best-selling book, Network Routing: Algorithms, Protocols, and Architectures. Karthik holds a PhD in computer science from the University of Wisconsin–Madison with a focus on databases, where he worked extensively in parallel database systems, query processing, scale-out technologies, storage engines, and online analytical systems. Several of these research projects were spun out as a company later acquired by Teradata.

Photo of Ivan Kelly

Ivan Kelly

Streamlio

Ivan Kelly is a software engineer at Streamlio, a startup dedicated to providing a next-generation integrated real-time stream processing solution, based on Heron, Apache Pulsar (incubating), and Apache BookKeeper. Ivan has been active in Apache BookKeeper since its very early days as a project in Yahoo! Research Barcelona. Specializing in replicated logging and transaction processing, he is currently focused on Streamlio’s storage layer.