Presented By O'Reilly and Cloudera
Make Data Work
March 13–14, 2017: Training
March 14–16, 2017: Tutorials & Conference
San Jose, CA

Developing streaming applications with Apache Apex

David Yan (DataTorrent, Inc.)
2:40pm3:20pm Thursday, March 16, 2017
Stream processing and analytics
Location: LL21 E/F Level: Intermediate
Secondary topics:  Streaming

Who is this presentation for?

  • Big data application engineers

Prerequisite knowledge

  • Basic familiarity with big data use cases, application development using frameworks such as MapReduce, Spark, and Storm, and the Hadoop ecosystem in general

What you'll learn

  • Learn how to develop fault-tolerant streaming applications with high throughput and low latency using Apache Apex


David Yan offers an overview of Apache Apex, a stream processing engine used in production by several large companies for real-time data analytics.

Apache Apex uses a programming paradigm based on a directed acyclic graph (DAG). Each node in the DAG represents an operator, which can be data input, data output, or data transformation. Each directed edge in the DAG represents a stream, which is the flow of data from one operator to another.

As part of Apex, the Malhar library provides a suite of connector operators so that Apex applications can read from or write to various data sources. It also includes utility operators that are commonly used in streaming applications, such as parsers, deduplicators and join, and generic building blocks that facilitate scalable state management and checkpointing.

In addition to processing based on ingression time and processing time, Apex supports event-time windows and session windows. It also supports windowing, watermarks, allowed lateness, accumulation mode, triggering, and retraction detailed by Apache Beam as well as feedback loops in the DAG for iterative processing and at-least-once and “end-to-end” exactly-once processing guarantees. Apex provides various ways to fine-tune applications, such as operator partitioning, locality, and affinity.

Apex is integrated with several open source projects, including Apache Beam, Apache Samoa (distributed machine learning), and Apache Calcite (SQL-based application specification). Users can choose Apex as the backend engine when running their application model based on these projects.

David explains how to develop fault-tolerant streaming applications with low latency and high throughput using Apex, presenting the programming model with examples and demonstrating how custom business logic can be integrated using both the declarative high-level API and the compositional DAG-level API.

Photo of David Yan

David Yan

DataTorrent, Inc.

David Yan is an Apache Apex PMC member and an architect at DataTorrent. Previously, David worked on the Ad Systems, Yahoo Finance, and groups at Yahoo and the Artificial Intelligence group at the Jet Propulsion Laboratory. David holds an MS in computer science from Stanford University and a BS in electrical engineering and computer science from the University of California, Berkeley.