Real-time stream analysis starts with ingesting raw data and extracting structured records. While stream-processing frameworks such as Apache Spark and Apache Storm provide primitives for processing individual records, processing windows of records, and grouping/joining records, the process of performing common actions such as filtering, applying regular expressions to extract data, and converting records from one schema to another are left to developers writing business logic.
Joey Echeverria presents an alternative approach based on a reusable library that provides configuration-based data transformation. This allows users to write command data-transformation rules once and reuse them in multiple contexts. A common pattern is to consume a single, raw stream and transform it using the same rules before storing in different repositories such as Apache Solr for search and Apache Hadoop HDFS for deep storage.
Joey Echeverria is the director of engineering at Rocana, where he builds applications for scaling IT operations built on the Apache Hadoop platform. Joey is a committer on the Kite SDK, an Apache-licensed data API for the Hadoop ecosystem. Joey was previously a software engineer at Cloudera, where contributed to several ASF projects including Apache Flume, Apache Sqoop, Apache Hadoop, and Apache HBase. Joey is also a coauthor of Hadoop Security, published by O’Reilly.
©2016, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.