Streaming Join¶
In Spark Structured Streaming, a streaming join is a structured query that is described (built) using the high-level streaming operators:
-
SQL's
JOIN
clause
Streaming joins can be stateless or stateful:
-
Joins of a streaming query and a batch query (stream-static joins) are stateless and no state management is required
-
Joins of two streaming queries (stream-stream joins) are stateful and require streaming state (with an optional join state watermark for state removal).
Stream-Stream Joins¶
Spark Structured Streaming supports stream-stream joins with the following:
-
Equality predicate (equi-joins that use only equality comparisons in the join predicate)
-
Inner
,LeftOuter
, andRightOuter
join types only
Stream-stream equi-joins are planned as StreamingSymmetricHashJoinExec physical operators of two ShuffleExchangeExec
physical operators (per Required Partition Requirements).
Learn more in Demo: Stream-Stream Inner Join.
Join State Watermark¶
Stream-stream joins may have an optional Join State Watermark defined for state removal (cf. Watermark Predicates for State Removal).
A join state watermark can be specified on the following:
-
Join keys (key state)
-
Columns of the left and right sides (value state)
A join state watermark can be specified on key state, value state or both.
IncrementalExecution¶
Under the covers, the high-level operators create a logical query plan with one or more Join
logical operators.
Spark Structured Streaming uses IncrementalExecution for planning streaming queries for execution.
At query planning, IncrementalExecution
uses the StreamingJoinStrategy execution planning strategy for planning stream-stream joins as StreamingSymmetricHashJoinExec physical operators.
Further Reading Or Watching¶
-
Stream-stream Joins in the official documentation of Apache Spark for Structured Streaming
-
Introducing Stream-Stream Joins in Apache Spark 2.3 by Databricks
-
(video) Deep Dive into Stateful Stream Processing in Structured Streaming by Tathagata Das