JoinSelection Execution Planning Strategy¶
JoinSelection is an execution planning strategy for Join logical operators.
JoinSelection is part of the strategies of the SparkPlanner.
Join Selection Priorities¶
Join Selection Requirements¶
The following sections are in the order of preference.
Danger
These sections have to be reviewed for correctness.
JoinSelection considers join physical operators per whether join keys are used or not:
- If used,
JoinSelectionconsiders BroadcastHashJoinExec, ShuffledHashJoinExec or SortMergeJoinExec operators - Otherwise,
JoinSelectionconsiders BroadcastNestedLoopJoinExec or CartesianProductExec
BroadcastHashJoinExec¶
JoinSelection plans a BroadcastHashJoinExec when there are join keys and one of the following holds:
-
Join type is CROSS, INNER, LEFT ANTI, LEFT OUTER, LEFT SEMI or ExistenceJoin
-
Join type is CROSS, INNER or RIGHT OUTER
BroadcastHashJoinExec is created for ExtractEquiJoinKeys-destructurable logical query plans (INNER, CROSS, LEFT OUTER, LEFT SEMI, LEFT ANTI) of which the right physical operator can be broadcast.
ShuffledHashJoinExec¶
JoinSelection plans a ShuffledHashJoinExec when there are join keys and one of the following holds:
-
spark.sql.join.preferSortMergeJoin is disabled, the join type is CROSS, INNER, LEFT ANTI, LEFT OUTER, LEFT SEMI or ExistenceJoin
-
spark.sql.join.preferSortMergeJoin is disabled, the join type is CROSS, INNER or RIGHT OUTER
-
Left join keys are not orderable
SortMergeJoinExec¶
JoinSelection plans a SortMergeJoinExec when the left join keys are orderable.
BroadcastNestedLoopJoinExec¶
JoinSelection plans a BroadcastNestedLoopJoinExec when there are no join keys and one of the following holds:
-
Join type is CROSS, INNER, LEFT ANTI, LEFT OUTER, LEFT SEMI or ExistenceJoin
-
Join type is CROSS, INNER or RIGHT OUTER
CartesianProductExec¶
JoinSelection plans a CartesianProductExec when there are no join keys and join type is CROSS or INNER
BroadcastNestedLoopJoinExec¶
JoinSelection plans a BroadcastNestedLoopJoinExec when no other join operator has matched already
Executing Rule¶
apply(
plan: LogicalPlan): Seq[SparkPlan]
apply is part of the GenericStrategy abstraction.
apply is made up of three parts (each with its own Scala extractor object to destructure the input LogicalPlan):
ExtractEquiJoinKeys¶
apply uses ExtractEquiJoinKeys to match on Join logical operators with EqualTo and EqualNullSafe condition predicate expressions.
apply does the following (in the order until a join physical operator has been determined):
- createBroadcastHashJoin (based on the hints only)
- With a hintToSortMergeJoin defined, createSortMergeJoin
- createShuffleHashJoin (based on the hints only)
- With a hintToShuffleReplicateNL defined, createCartesianProduct
- createJoinWithoutHint
ExtractSingleColumnNullAwareAntiJoin¶
apply uses ExtractSingleColumnNullAwareAntiJoin to match on Join logical operators.
For every Join operator, apply creates a BroadcastHashJoinExec physical operator with the following:
- LeftAnti join type
BuildRightbuild side- Undefined join condition expressions
- isNullAwareAntiJoin flag enabled (
true)
Other Joins¶
apply determines the desired build side. For InnerLike and FullOuter join types, apply getSmallerSide. For all other join types, apply canBuildBroadcastLeft and prefers BuildLeft over BuildRight.
In the end, apply does the following (in the order until a join physical operator has been determined):
- createBroadcastNLJoin (based on the hints only for the left and right side)
- With a hintToShuffleReplicateNL defined, createCartesianProduct
- createJoinWithoutHint
canBroadcastBySize¶
canBroadcastBySize(
plan: LogicalPlan,
conf: SQLConf): Boolean
canBroadcastBySize is enabled (true) when the size of the table (the given LogicalPlan) is small for a broadcast join (between 0 and the spark.sql.autoBroadcastJoinThreshold configuration property inclusive).
canBuildBroadcastLeft¶
canBuildBroadcastLeft(
joinType: JoinType): Boolean
canBuildBroadcastLeft is enabled (true) for InnerLike and RightOuter join types.
canBuildLocalHashMapBySize¶
canBuildLocalHashMapBySize(
plan: LogicalPlan,
conf: SQLConf): Boolean
canBuildLocalHashMapBySize is enabled (true) when the size of the table (the given LogicalPlan) is small for a shuffle hash join (below the spark.sql.autoBroadcastJoinThreshold configuration property multiplied by the configured number of shuffle partitions).
Creating BroadcastHashJoinExec¶
createBroadcastHashJoin(
onlyLookingAtHint: Boolean): Option[Seq[BroadcastHashJoinExec]]
createBroadcastHashJoin determines a BroadcastBuildSide and, if successful, creates a BroadcastHashJoinExec.
Creating BroadcastNestedLoopJoinExec¶
createBroadcastNLJoin(
buildLeft: Boolean,
buildRight: Boolean): Option[Seq[BroadcastNestedLoopJoinExec]]
createBroadcastNLJoin creates a BroadcastNestedLoopJoinExec when at least one of the buildLeft or buildRight flags are enabled.
Creating ShuffledHashJoinExec¶
createShuffleHashJoin(
onlyLookingAtHint: Boolean): Option[Seq[ShuffledHashJoinExec]]
createShuffleHashJoin tries to determine the BuildSide for a ShuffleHashJoinExec and, if successful, creates a ShuffledHashJoinExec.
createShuffleHashJoin is used when:
JoinSelectionis requested to createJoinWithoutHint (withonlyLookingAtHintdisabled) and execute (withonlyLookingAtHintenabled)
Creating SortMergeJoinExec¶
createSortMergeJoin(): Option[Seq[SortMergeJoinExec]]
createSortMergeJoin creates a SortMergeJoinExec if the left keys are orderable.
createCartesianProduct¶
createCartesianProduct(): Option[Seq[CartesianProductExec]]
createCartesianProduct creates a CartesianProductExec for InnerLike join type.
createJoinWithoutHint¶
createJoinWithoutHint(): Seq[BaseJoinExec]
createJoinWithoutHint...FIXME
Build Side¶
getBuildSide(
canBuildLeft: Boolean,
canBuildRight: Boolean,
left: LogicalPlan,
right: LogicalPlan): Option[BuildSide]
getBuildSide is the following (in the order):
- The smaller side of the left and right operators when the
canBuildLeftandcanBuildRightflags are both enabled (true) BuildLeftforcanBuildLeftflag enabledBuildRightforcanBuildRightflag enabled- Undefined (
None)
getBroadcastBuildSide¶
getBroadcastBuildSide(
left: LogicalPlan,
right: LogicalPlan,
joinType: JoinType,
hint: JoinHint,
hintOnly: Boolean,
conf: SQLConf): Option[BuildSide]
getBroadcastBuildSide determines if build on the left side (buildLeft). With hintOnly enabled (true), getBroadcastBuildSide hintToBroadcastLeft. Otherwise, getBroadcastBuildSide checks if canBroadcastBySize and not hintToNotBroadcastLeft.
getBroadcastBuildSide determines if build on the right side (buildRight). With hintOnly enabled (true), getBroadcastBuildSide hintToBroadcastRight. Otherwise, getBroadcastBuildSide checks if canBroadcastBySize and not hintToNotBroadcastRight.
In the end, getBroadcastBuildSide getBuildSide with the following:
- canBuildBroadcastLeft for the given
JoinTypeand thebuildLeftflag - canBuildBroadcastRight for the given
JoinTypeand thebuildRightflag - Left physical operator
- Right physical operator
BuildSide for ShuffleHashJoinExec¶
getShuffleHashJoinBuildSide(
left: LogicalPlan,
right: LogicalPlan,
joinType: JoinType,
hint: JoinHint,
hintOnly: Boolean,
conf: SQLConf): Option[BuildSide]
getShuffleHashJoinBuildSide determines if to build on the left side (buildLeft):
- With
hintOnlyenabled (true),getShuffleHashJoinBuildSidehintToShuffleHashJoinLeft - Otherwise,
getShuffleHashJoinBuildSidechecks if canBuildLocalHashMapBySize and the left operator is muchSmaller than the right
getShuffleHashJoinBuildSide determines if to build on the right side (buildRight):
- With
hintOnlyenabled (true),getShuffleHashJoinBuildSidehintToShuffleHashJoinRight - Otherwise,
getShuffleHashJoinBuildSidechecks if canBuildLocalHashMapBySize and the right operator is muchSmaller than the left
In the end, getShuffleHashJoinBuildSide tries to determine the BuildSide based on the following:
- Checks for BuildLeft for the given JoinType and the
buildLeftflag - canBuildShuffledHashJoinRight for the given
JoinTypeand thebuildRightflag - Left physical operator
- Right physical operator
Smaller Side¶
getSmallerSide(
left: LogicalPlan,
right: LogicalPlan): BuildSide
getSmallerSide is BuildLeft unless the size of the right table (the given right LogicalPlan) is not larger than the size of the left table (the given left LogicalPlan). Otherwise, getSmallerSide is BuildRight.
hintToBroadcastLeft¶
hintToBroadcastLeft(
hint: JoinHint): Boolean
hintToBroadcastLeft is enabled (true) when the given JoinHint has BROADCAST, BROADCASTJOIN or MAPJOIN hints associated with the left operator.
hintToBroadcastRight¶
hintToBroadcastRight(
hint: JoinHint): Boolean
hintToBroadcastRight is enabled (true) when the given JoinHint has BROADCAST, BROADCASTJOIN or MAPJOIN hints associated with the right operator.
hintToNotBroadcastLeft¶
hintToNotBroadcastLeft(
hint: JoinHint): Boolean
hintToNotBroadcastLeft is enabled (true) when the given JoinHint has the internal NO_BROADCAST_HASH hint associated with the left operator (to discourage broadcast hash join).
hintToNotBroadcastRight¶
hintToNotBroadcastRight(
hint: JoinHint): Boolean
hintToNotBroadcastRight is enabled (true) when the given JoinHint has the internal NO_BROADCAST_HASH hint associated with the right operator (to discourage broadcast hash join).
hintToShuffleHashJoinLeft¶
hintToShuffleHashJoinLeft(
hint: JoinHint): Boolean
hintToShuffleHashJoinLeft is enabled (true) when the given JoinHint has SHUFFLE_HASH hint associated with the left operator.
hintToShuffleHashJoinRight¶
hintToShuffleHashJoinRight(
hint: JoinHint): Boolean
hintToShuffleHashJoinRight is enabled (true) when the given JoinHint has SHUFFLE_HASH hint associated with the right operator.
hintToSortMergeJoin¶
hintToSortMergeJoin(
hint: JoinHint): Boolean
hintToSortMergeJoin...FIXME
hintToShuffleReplicateNL¶
hintToShuffleReplicateNL(
hint: JoinHint): Boolean
hintToShuffleReplicateNL...FIXME
muchSmaller¶
muchSmaller(
a: LogicalPlan,
b: LogicalPlan): Boolean
muchSmaller is enabled (true) when the size of the left table (the given a LogicalPlan) is at least 3 times smaller than the size of the right table (the given b LogicalPlan).
Checking Left BuildSide for ShuffledHashJoin¶
canBuildShuffledHashJoinLeft(
joinType: JoinType): Boolean
canBuildShuffledHashJoinLeft is enabled (true) for the given JoinType among the following:
- InnerLikes: Inner and Cross
- RightOuter
- FullOuter