Skip to content

Storage-Partitioned Joins

Storage-Partitioned Join (SPJ) is a new type of join in Spark SQL that uses the existing storage layout for a partitioned join to avoid expensive shuffles (similarly to Bucketing).

Note

Storage-Partitioned Joins feature was added in Apache Spark 3.3.0 ([SPARK-37375] Umbrella: Storage Partitioned Join (SPJ)).

Storage-Partitioned Join is based on KeyGroupedPartitioning to determine partitions.

Out of the available built-in DataSourceV2ScanExecBase physical operators, only BatchScanExec supports storage-partitioned joins.

Storage-Partitioned Join is meant for Spark SQL connectors (yet there are none built-in at the moment).

Storage-Partitioned Join was proposed in this SPIP.

Note

It appears that SortMergeJoinExec and ShuffledHashJoinExec physical operator are the only candidates for Storage-Partitioned Joins.

Configuration Properties

Apache Iceberg

Storage-Partitioned Join is supported in Apache Iceberg 1.2.0:

Added support for storage partition joins to improve read and write performance (#6371)

Delta Lake

Storage-Partitioned Join is not supported in Delta Lake yet (as per this feature request).

Learn More

  1. What's new in Apache Spark 3.3 - joins by Bartosz Konieczny
  2. (video) Storage-Partitioned Join for Apache Spark
  3. (video) Eliminating Shuffles in Delete Update, and Merge