Skip to content

Spark Structured Streaming and Streaming Queries

Spark Structured Streaming (Structured Streaming or Spark Streams) is the module of Apache Spark for stream processing using streaming queries.

Streaming queries can be expressed using a high-level declarative streaming API (Dataset API) or good ol' SQL (SQL over stream / streaming SQL). The declarative streaming Dataset API and SQL are executed on the underlying highly-optimized Spark SQL engine.

The semantics of the Structured Streaming model is as follows (see the article Structured Streaming In Apache Spark):

At any time, the output of a continuous application is equivalent to executing a batch job on a prefix of the data.

Stream Execution Engines

Spark Structured Streaming comes with two stream execution engines for executing streaming queries:

Features

Streaming Datasets

Structured Streaming introduces the concept of Streaming Datasets that are infinite datasets with one or more SparkDataStreams.

A Dataset is streaming when its logical plan is streaming.

val batchQuery = spark.
  read. // <-- batch non-streaming query
  csv("sales")

assert(batchQuery.isStreaming == false)
val streamingQuery = spark.
  readStream. // <-- streaming query
  format("rate").
  load

assert(streamingQuery.isStreaming)

Structured Streaming models a stream of data as an infinite (and hence continuous) table that could be changed every streaming batch.

Output Modes

You can specify output mode of a streaming dataset which is what gets written to a streaming sink (i.e. the infinite result table) when there is a new data available.

References

Articles

Videos