Spark Structured Streaming and Streaming Queries¶

Spark Structured Streaming (Structured Streaming or Spark Streams) is the module of Apache Spark for stream processing using streaming queries.

Streaming queries can be expressed using a high-level declarative streaming API (Dataset API) or good ol' SQL (SQL over stream / streaming SQL). The declarative streaming Dataset API and SQL are executed on the underlying highly-optimized Spark SQL engine.

The semantics of the Structured Streaming model is as follows (see the article Structured Streaming In Apache Spark):

At any time, the output of a continuous application is equivalent to executing a batch job on a prefix of the data.

Stream Execution Engines¶

Spark Structured Streaming comes with two stream execution engines for executing streaming queries:

Features¶

Streaming Datasets¶

Structured Streaming introduces the concept of Streaming Datasets that are infinite datasets with one or more SparkDataStreams.

A Dataset is streaming when its logical plan is streaming.

val batchQuery = spark.
  read. // <-- batch non-streaming query
  csv("sales")

assert(batchQuery.isStreaming == false)

val streamingQuery = spark.
  readStream. // <-- streaming query
  format("rate").
  load

assert(streamingQuery.isStreaming)

Structured Streaming models a stream of data as an infinite (and hence continuous) table that could be changed every streaming batch.

Output Modes¶

You can specify output mode of a streaming dataset which is what gets written to a streaming sink (i.e. the infinite result table) when there is a new data available.

References¶

Articles¶

Videos¶

The Future of Real Time in Spark from Spark Summit East 2016 in which Reynold Xin presents the concept of Streaming DataFrames
Structuring Spark: DataFrames, Datasets, and Streaming
A Deep Dive Into Structured Streaming by Tathagata "TD" Das from Spark Summit 2016
Arbitrary Stateful Aggregations in Structured Streaming in Apache Spark by Burak Yavuz