Skip to content

Streaming Join

In Spark Structured Streaming, a streaming join is a structured query that is described (built) using the high-level streaming operators:

Streaming joins can be stateless or stateful:

Stream-Stream Joins

Spark Structured Streaming supports stream-stream joins with the following:

Stream-stream equi-joins are planned as StreamingSymmetricHashJoinExec physical operators of two ShuffleExchangeExec physical operators (per Required Partition Requirements).

Learn more in Demo: Stream-Stream Inner Join.

Join State Watermark

Stream-stream joins may have an optional Join State Watermark defined for state removal (cf. Watermark Predicates for State Removal).

A join state watermark can be specified on the following:

  1. Join keys (key state)

  2. Columns of the left and right sides (value state)

A join state watermark can be specified on key state, value state or both.

IncrementalExecution

Under the covers, the high-level operators create a logical query plan with one or more Join logical operators.

Spark Structured Streaming uses IncrementalExecution for planning streaming queries for execution.

At query planning, IncrementalExecution uses the StreamingJoinStrategy execution planning strategy for planning stream-stream joins as StreamingSymmetricHashJoinExec physical operators.

Further Reading Or Watching