Skip to main content

Making Data Move: Stream Processing in Go

Matvey Arye (Princeton University/Cloudflare), Albert Strasheim (CloudFlare)
Hadoop and Beyond
GA Ballroom J
Average rating: ****.
(4.00, 1 rating)

Big-data is evolving. The state of the art has gone from running large
batch queries over static data sets updated rarely to handling
high-velocity data with low processing latency. We present a new data
processing framework that is geared at processing data with a very
high update frequency such as clickstreams or server web request logs.
Predefined standing queries are answered quickly as data arrives. The
framework is written in the Go language and thus can take advantage of
Go’s advanced concurrency primitives and extensibility. We concentrate
on programmability and provide an API that is easy to use and develop
on. We also integrate several storage mechanisms, one of which allows
the system to efficiently maintain aggregates of stream data in a
cluster of PostgreSQL servers.

Details:

Several stream processing systems have emerged in the last few years:
Twitter’s Storm and Rainbird, Google’s MillWheel, etc. These systems
reflect a real need to process high-velocity data in a way that
provides low-latency results. We provide a new streaming system Go
that concentrates on efficiency and scalability:

Parallelization:

- Operators are defined as function closures that the system can
execute in parallel using data parallelism for tasks that are
processing intensive
- The system can parallelize execution even when tuple order needs to
be preserved

Scalability:

- We include operators to distribute data to other nodes in a consistent manner.
- Abstractions can be integrated with cluster management software such
as Zookeeper.

Also setting this system apart from others is the storage integration.
We have created a data cube abstraction that closely mirrors the cube
abstraction used in online analytical processing (OLAP). This
abstraction allows users to define aggregate and group-by clauses and
have the system maintain aggregates over their streaming data. Unlike
traditional OLAP, our data cubes are maintained incrementally as data
arrives, allowing for low latency results.

In our talk, we will describe the features of our system, explain our
system interface, and show some example data processing pipelines.

Matvey Arye

Graduate Student, Princeton University/Cloudflare

Mat is currently a Ph.D. student at Princeton University under the supervision of Michael J. Freedman. Previously, Mat completed his undergraduate degree in general engineering at The Cooper Union in 2005. His research interests are in building scalable distributed systems for analyzing high-velocity data, programmability, security, and networking.

Photo of Albert Strasheim

Albert Strasheim

Systems Gopher at Large, CloudFlare

Albert Strasheim is a Systems Engineer at CloudFlare. His passion is big data processing and he’s excited to be working on log processing and analytics at CloudFlare. He has been using Go since its initial public release, but has also done a lot of work with Java, Python and C++. As a new resident of San Francisco, he’s looking forward to surfing up and down the California coast, which is really the reason he came to San Francisco in the first place.

Prior to CloudFlare, Albert studied at the University of Stellenbosch in South Africa, where he focused on Computer Science and Electronic Engineering. He did post-graduate research on speaker recognition systems and worked on them for Agnitio in Madrid, Spain. In South Africa, Albert worked at a variety of companies, including image processing, storage systems, databases and telecommunications.