Spark Structured Streaming and Streaming Queries¶
Spark Structured Streaming (Structured Streaming or Spark Streams) is the module of Apache Spark for stream processing using streaming queries.
Streaming queries can be expressed using a high-level declarative streaming API (Dataset API) or good ol' SQL (SQL over stream / streaming SQL). The declarative streaming Dataset API and SQL are executed on the underlying highly-optimized Spark SQL engine.
The semantics of the Structured Streaming model is as follows (see the article Structured Streaming In Apache Spark):
At any time, the output of a continuous application is equivalent to executing a batch job on a prefix of the data.
Stream Execution Engines¶
Spark Structured Streaming comes with two stream execution engines for executing streaming queries:
Features¶
- Streaming Aggregation
- Streaming Join
- Streaming Watermark
- Arbitrary Stateful Streaming Aggregation
- Stateful Stream Processing
- Many more
Streaming Datasets¶
Structured Streaming introduces the concept of Streaming Datasets that are infinite datasets with one or more SparkDataStreams.
A Dataset
is streaming when its logical plan is streaming.
val batchQuery = spark.
read. // <-- batch non-streaming query
csv("sales")
assert(batchQuery.isStreaming == false)
val streamingQuery = spark.
readStream. // <-- streaming query
format("rate").
load
assert(streamingQuery.isStreaming)
Structured Streaming models a stream of data as an infinite (and hence continuous) table that could be changed every streaming batch.
Output Modes¶
You can specify output mode of a streaming dataset which is what gets written to a streaming sink (i.e. the infinite result table) when there is a new data available.
References¶
Articles¶
- SPARK-8360 Structured Streaming (aka Streaming DataFrames)
- The official Structured Streaming Programming Guide
- Structured Streaming In Apache Spark
- What Spark's Structured Streaming really means
Videos¶
- The Future of Real Time in Spark from Spark Summit East 2016 in which Reynold Xin presents the concept of Streaming DataFrames
- Structuring Spark: DataFrames, Datasets, and Streaming
- A Deep Dive Into Structured Streaming by Tathagata "TD" Das from Spark Summit 2016
- Arbitrary Stateful Aggregations in Structured Streaming in Apache Spark by Burak Yavuz