Presented By O'Reilly and Cloudera
Make Data Work
September 25–26, 2017: Training
September 26–28, 2017: Tutorials & Conference
New York, NY

Stream analytics with SQL on Apache Flink

Fabian Hueske (data Artisans)
4:35pm5:15pm Wednesday, September 27, 2017
Data Engineering & Architecture, Stream processing and analytics
Location: 1E 07/08 Level: Intermediate
Secondary topics:  Streaming
Average rating: ****.
(4.00, 1 rating)

Who is this presentation for?

  • Data and software engineers

Prerequisite knowledge

  • A basic undertanding of SQL and the fundamentals of stream processing (windows, event processing time, etc.)

What you'll learn

  • Explore Apache Flink's relational APIs and their semantics
  • Learn how the APIs allows users to run the same query on batch and streaming data and the use cases Flink's relational APIs address

Description

SQL is undoubtedly the most widely used language for data analytics. It is declarative; many database systems and query processors feature advanced query optimizers and highly efficient execution engines; and last but not least, it is the standard that everybody knows and uses. With stream processing technology becoming mainstream a question arises: “Why isn’t SQL widely supported by open source stream processors?” One answer is that SQL’s semantics and syntax have not been designed with the characteristics of streaming data in mind. Consequently, systems that want to provide support for SQL on data streams have to overcome a conceptual gap.

Apache Flink is a distributed stream processing system. Due to its support for event-time processing, exactly once state semantics, and its high-throughput capabilities, Flink is very well suited for streaming analytics. Flink features two relational APIs for unified stream and batch processing: the Table API and SQL. The Table API is a language-integrated relational API, and the SQL interface is compliant with standard SQL. Both APIs are semantically compatible and share the same optimization and execution path based on Apache Calcite. A core principle of both APIs is to provide the same semantics for batch and streaming data sources, meaning that a query should compute the same result regardless whether it was executed on a static dataset, such as a file, or on a data stream, like a Kafka topic.

Fabian Hueske offers an overview of the semantics of Apache Flink’s relational APIs for stream analytics. Both APIs are centered around the concept of dynamic tables, which are defined on data streams and behave like regular database tables. Queries on dynamic tables produce new dynamic tables, which are similar to materialized views as known from relational database systems. Fabian demonstrates how dynamic tables are defined on streams, how dynamic tables are queried, and how the results are converted back into changelog streams or written as materialized views to external systems, such as Apache Kafka or Apache Cassandra, and updated in place with low latency.

Photo of Fabian Hueske

Fabian Hueske

data Artisans

Fabian Hueske is a committer and PMC member of the Apache Flink project. He was one of the three original authors of the Stratosphere research system, from which Apache Flink was forked in 2014. Fabian is a cofounder of data Artisans, a Berlin-based startup devoted to fostering Flink, where he works as a software engineer and contributes to Apache Flink. He holds a PhD in computer science from TU Berlin and is currently spending a lot of his time writing a book, Stream Processing with Apache Flink.