Streaming Join¶
[[operators]] In Spark Structured Streaming, a streaming join is a streaming query that was described (build) using the high-level streaming operators:
-
SQL's
JOIN
clause
Streaming joins can be stateless or <
-
Joins of a streaming query and a batch query (stream-static joins) are stateless and no state management is required
-
Joins of two streaming queries (<
>) are stateful and require streaming state (with an optional < >).
Stream-Stream Joins¶
Spark Structured Streaming supports stream-stream joins with the following:
-
Equality predicate (equi-joins that use only equality comparisons in the join predicate)
-
Inner
,LeftOuter
, andRightOuter
<>
Stream-stream equi-joins are planned as <ShuffleExchangeExec
physical operators (per <
=== [[join-state-watermark]] Join State Watermark for State Removal
Stream-stream joins may optionally define Join State Watermark for state removal (cf. <
A join state watermark can be specified on the following:
. <
. <
A join state watermark can be specified on key state, value state or both.
=== [[IncrementalExecution]] IncrementalExecution -- QueryExecution of Streaming Queries
Under the covers, the <Join
logical operators.
TIP: Read up on https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-LogicalPlan-Join.html[Join Logical Operator] in https://bit.ly/spark-sql-internals[The Internals of Spark SQL] online book.
In Spark Structured Streaming IncrementalExecution is responsible for planning streaming queries for execution.
At query planning, IncrementalExecution
uses the StreamingJoinStrategy execution planning strategy for planning stream-stream joins as StreamingSymmetricHashJoinExec physical operators.
Demos¶
Further Reading Or Watching¶
-
Stream-stream Joins in the official documentation of Apache Spark for Structured Streaming
-
Introducing Stream-Stream Joins in Apache Spark 2.3 by Databricks
-
(video) Deep Dive into Stateful Stream Processing in Structured Streaming by Tathagata Das