SQL is undoubtedly the most widely used language for data analytics. It is declarative; many database systems and query processors feature advanced query optimizers and highly efficient execution engines; and last but not least, it is the standard that everybody knows and uses. With stream processing technology becoming mainstream a question arises: “Why isn’t SQL widely supported by open source stream processors?” One answer is that SQL’s semantics and syntax have not been designed with the characteristics of streaming data in mind. Consequently, systems that want to provide support for SQL on data streams have to overcome a conceptual gap.
Apache Flink is a distributed stream processing system. Due to its support for event-time processing, exactly once state semantics, and its high-throughput capabilities, Flink is very well suited for streaming analytics. Flink features two relational APIs for unified stream and batch processing: the Table API and SQL. The Table API is a language-integrated relational API, and the SQL interface is compliant with standard SQL. Both APIs are semantically compatible and share the same optimization and execution path based on Apache Calcite. A core principle of both APIs is to provide the same semantics for batch and streaming data sources, meaning that a query should compute the same result regardless whether it was executed on a static dataset, such as a file, or on a data stream, like a Kafka topic.
Fabian Hueske offers an overview of the semantics of Apache Flink’s relational APIs for stream analytics. Both APIs are centered around the concept of dynamic tables, which are defined on data streams and behave like regular database tables. Queries on dynamic tables produce new dynamic tables, which are similar to materialized views as known from relational database systems. Fabian demonstrates how dynamic tables are defined on streams, how dynamic tables are queried, and how the results are converted back into changelog streams or written as materialized views to external systems, such as Apache Kafka or Apache Cassandra, and updated in place with low latency.
Fabian Hueske is a committer and PMC member of the Apache Flink project. He was one of the three original authors of the Stratosphere research system, from which Apache Flink was forked in 2014. Fabian is a cofounder of data Artisans, a Berlin-based startup devoted to fostering Flink, where he works as a software engineer and contributes to Apache Flink. He holds a PhD in computer science from TU Berlin and is currently spending a lot of his time writing a book, Stream Processing with Apache Flink.
©2017, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com