Storage-Partitioned Joins¶
Storage-Partitioned Join (SPJ) is a new type of join in Spark SQL that uses the existing storage layout for a partitioned join to avoid expensive shuffles (similarly to Bucketing).
Note
Storage-Partitioned Joins feature was added in Apache Spark 3.3.0 ([SPARK-37375] Umbrella: Storage Partitioned Join (SPJ)).
Storage-Partitioned Join is based on KeyGroupedPartitioning to determine partitions.
Out of the available built-in DataSourceV2ScanExecBase physical operators, only BatchScanExec supports storage-partitioned joins.
Storage-Partitioned Join is meant for Spark SQL connectors (yet there are none built-in at the moment).
Storage-Partitioned Join was proposed in this SPIP.
Note
It appears that SortMergeJoinExec and ShuffledHashJoinExec physical operator are the only candidates for Storage-Partitioned Joins.
Configuration Properties¶
- spark.sql.sources.v2.bucketing.enabled
- spark.sql.sources.v2.bucketing.pushPartValues.enabled
- spark.sql.sources.v2.bucketing.partiallyClusteredDistribution.enabled
Apache Iceberg¶
Storage-Partitioned Join is supported in Apache Iceberg 1.2.0:
Added support for storage partition joins to improve read and write performance (#6371)
Delta Lake¶
Storage-Partitioned Join is not supported in Delta Lake yet (as per this feature request).
Learn More¶
- What's new in Apache Spark 3.3 - joins by Bartosz Konieczny
- (video) Storage-Partitioned Join for Apache Spark
- (video) Eliminating Shuffles in Delete Update, and Merge