Presented By O'Reilly and Cloudera
Make Data Work
March 28–29, 2016: Training
March 29–31, 2016: Conference
San Jose, CA

Embeddable data transformation for real-time streams

Joey Echeverria (Rocana)
11:50am–12:30pm Thursday, 03/31/2016
IoT and Real-time

Location: 210 D/H
Tags: real-time
Average rating: *****
(5.00, 2 ratings)

Prerequisite knowledge

Participants should have a basic understanding of real-time data systems such as Apache Kafka, Apache Spark, Apache Flink, and Apache Storm.


Real-time stream analysis starts with ingesting raw data and extracting structured records. While stream-processing frameworks such as Apache Spark and Apache Storm provide primitives for processing individual records, processing windows of records, and grouping/joining records, the process of performing common actions such as filtering, applying regular expressions to extract data, and converting records from one schema to another are left to developers writing business logic.

Joey Echeverria presents an alternative approach based on a reusable library that provides configuration-based data transformation. This allows users to write command data-transformation rules once and reuse them in multiple contexts. A common pattern is to consume a single, raw stream and transform it using the same rules before storing in different repositories such as Apache Solr for search and Apache Hadoop HDFS for deep storage.

Topics include:

  • An overview of the stream-processing landscape
  • The common analysis phases in stream-processing applications (filter, extract, transform, group, aggregate, and join)
  • Existing solutions for data transformation in a streaming context
  • The solution for common data filtering, transformation, and extraction
  • The extensibility of this solution with new, custom transformation actions that can be driven by configuration
  • How to use the transformation library for log analytics and IT operations events
Photo of Joey Echeverria

Joey Echeverria


Joey Echeverria is the director of engineering at Rocana, where he builds applications for scaling IT operations built on the Apache Hadoop platform. Joey is a committer on the Kite SDK, an Apache-licensed data API for the Hadoop ecosystem. Joey was previously a software engineer at Cloudera, where contributed to several ASF projects including Apache Flume, Apache Sqoop, Apache Hadoop, and Apache HBase. Joey is also a coauthor of Hadoop Security, published by O’Reilly.