JoinSelection Execution Planning Strategy¶
JoinSelection
is an execution planning strategy for Join logical operators.
JoinSelection
is part of the strategies of the SparkPlanner.
Join Selection Priorities¶
Join Selection Requirements¶
The following sections are in the order of preference.
Danger
These sections have to be reviewed for correctness.
JoinSelection
considers join physical operators per whether join keys are used or not:
- If used,
JoinSelection
considers BroadcastHashJoinExec, ShuffledHashJoinExec or SortMergeJoinExec operators - Otherwise,
JoinSelection
considers BroadcastNestedLoopJoinExec or CartesianProductExec
BroadcastHashJoinExec¶
JoinSelection
plans a BroadcastHashJoinExec when there are join keys and one of the following holds:
-
Join type is CROSS, INNER, LEFT ANTI, LEFT OUTER, LEFT SEMI or ExistenceJoin
-
Join type is CROSS, INNER or RIGHT OUTER
BroadcastHashJoinExec
is created for ExtractEquiJoinKeys-destructurable logical query plans (INNER, CROSS, LEFT OUTER, LEFT SEMI, LEFT ANTI) of which the right
physical operator can be broadcast.
ShuffledHashJoinExec¶
JoinSelection
plans a ShuffledHashJoinExec when there are join keys and one of the following holds:
-
spark.sql.join.preferSortMergeJoin is disabled, the join type is CROSS, INNER, LEFT ANTI, LEFT OUTER, LEFT SEMI or ExistenceJoin
-
spark.sql.join.preferSortMergeJoin is disabled, the join type is CROSS, INNER or RIGHT OUTER
-
Left join keys are not orderable
SortMergeJoinExec¶
JoinSelection
plans a SortMergeJoinExec when the left join keys are orderable.
BroadcastNestedLoopJoinExec¶
JoinSelection
plans a BroadcastNestedLoopJoinExec when there are no join keys and one of the following holds:
-
Join type is CROSS, INNER, LEFT ANTI, LEFT OUTER, LEFT SEMI or ExistenceJoin
-
Join type is CROSS, INNER or RIGHT OUTER
CartesianProductExec¶
JoinSelection
plans a CartesianProductExec
when there are no join keys and join type is CROSS or INNER
BroadcastNestedLoopJoinExec¶
JoinSelection
plans a BroadcastNestedLoopJoinExec when no other join operator has matched already
Executing Rule¶
apply(
plan: LogicalPlan): Seq[SparkPlan]
apply
is part of the GenericStrategy abstraction.
apply
is made up of three parts (each with its own Scala extractor object to destructure the input LogicalPlan):
ExtractEquiJoinKeys¶
apply
uses ExtractEquiJoinKeys to match on Join logical operators with EqualTo and EqualNullSafe condition predicate expressions.
apply
does the following (in the order until a join physical operator has been determined):
- createBroadcastHashJoin (based on the hints only)
- With a hintToSortMergeJoin defined, createSortMergeJoin
- createShuffleHashJoin (based on the hints only)
- With a hintToShuffleReplicateNL defined, createCartesianProduct
- createJoinWithoutHint
ExtractSingleColumnNullAwareAntiJoin¶
apply
uses ExtractSingleColumnNullAwareAntiJoin to match on Join logical operators.
For every Join
operator, apply
creates a BroadcastHashJoinExec physical operator with the following:
- LeftAnti join type
BuildRight
build side- Undefined join condition expressions
- isNullAwareAntiJoin flag enabled (
true
)
Other Joins¶
apply
determines the desired build side. For InnerLike
and FullOuter
join types, apply
getSmallerSide. For all other join types, apply
canBuildBroadcastLeft and prefers BuildLeft
over BuildRight
.
In the end, apply
does the following (in the order until a join physical operator has been determined):
- createBroadcastNLJoin (based on the hints only for the left and right side)
- With a hintToShuffleReplicateNL defined, createCartesianProduct
- createJoinWithoutHint
canBroadcastBySize¶
canBroadcastBySize(
plan: LogicalPlan,
conf: SQLConf): Boolean
canBroadcastBySize
is enabled (true
) when the size of the table (the given LogicalPlan) is small for a broadcast join (between 0
and the spark.sql.autoBroadcastJoinThreshold configuration property inclusive).
canBuildBroadcastLeft¶
canBuildBroadcastLeft(
joinType: JoinType): Boolean
canBuildBroadcastLeft
is enabled (true
) for InnerLike
and RightOuter
join types.
canBuildLocalHashMapBySize¶
canBuildLocalHashMapBySize(
plan: LogicalPlan,
conf: SQLConf): Boolean
canBuildLocalHashMapBySize
is enabled (true
) when the size of the table (the given LogicalPlan) is small for a shuffle hash join (below the spark.sql.autoBroadcastJoinThreshold configuration property multiplied by the configured number of shuffle partitions).
Creating BroadcastHashJoinExec¶
createBroadcastHashJoin(
onlyLookingAtHint: Boolean): Option[Seq[BroadcastHashJoinExec]]
createBroadcastHashJoin
determines a BroadcastBuildSide and, if successful, creates a BroadcastHashJoinExec.
Creating BroadcastNestedLoopJoinExec¶
createBroadcastNLJoin(
buildLeft: Boolean,
buildRight: Boolean): Option[Seq[BroadcastNestedLoopJoinExec]]
createBroadcastNLJoin
creates a BroadcastNestedLoopJoinExec when at least one of the buildLeft
or buildRight
flags are enabled.
Creating ShuffledHashJoinExec¶
createShuffleHashJoin(
onlyLookingAtHint: Boolean): Option[Seq[ShuffledHashJoinExec]]
createShuffleHashJoin
tries to determine the BuildSide for a ShuffleHashJoinExec and, if successful, creates a ShuffledHashJoinExec.
createShuffleHashJoin
is used when:
JoinSelection
is requested to createJoinWithoutHint (withonlyLookingAtHint
disabled) and execute (withonlyLookingAtHint
enabled)
Creating SortMergeJoinExec¶
createSortMergeJoin(): Option[Seq[SortMergeJoinExec]]
createSortMergeJoin
creates a SortMergeJoinExec if the left keys are orderable.
createCartesianProduct¶
createCartesianProduct(): Option[Seq[CartesianProductExec]]
createCartesianProduct
creates a CartesianProductExec
for InnerLike
join type.
createJoinWithoutHint¶
createJoinWithoutHint(): Seq[BaseJoinExec]
createJoinWithoutHint
...FIXME
Build Side¶
getBuildSide(
canBuildLeft: Boolean,
canBuildRight: Boolean,
left: LogicalPlan,
right: LogicalPlan): Option[BuildSide]
getBuildSide
is the following (in the order):
- The smaller side of the left and right operators when the
canBuildLeft
andcanBuildRight
flags are both enabled (true
) BuildLeft
forcanBuildLeft
flag enabledBuildRight
forcanBuildRight
flag enabled- Undefined (
None
)
getBroadcastBuildSide¶
getBroadcastBuildSide(
left: LogicalPlan,
right: LogicalPlan,
joinType: JoinType,
hint: JoinHint,
hintOnly: Boolean,
conf: SQLConf): Option[BuildSide]
getBroadcastBuildSide
determines if build on the left side (buildLeft
). With hintOnly
enabled (true
), getBroadcastBuildSide
hintToBroadcastLeft. Otherwise, getBroadcastBuildSide
checks if canBroadcastBySize and not hintToNotBroadcastLeft.
getBroadcastBuildSide
determines if build on the right side (buildRight
). With hintOnly
enabled (true
), getBroadcastBuildSide
hintToBroadcastRight. Otherwise, getBroadcastBuildSide
checks if canBroadcastBySize and not hintToNotBroadcastRight.
In the end, getBroadcastBuildSide
getBuildSide with the following:
- canBuildBroadcastLeft for the given
JoinType
and thebuildLeft
flag - canBuildBroadcastRight for the given
JoinType
and thebuildRight
flag - Left physical operator
- Right physical operator
BuildSide for ShuffleHashJoinExec¶
getShuffleHashJoinBuildSide(
left: LogicalPlan,
right: LogicalPlan,
joinType: JoinType,
hint: JoinHint,
hintOnly: Boolean,
conf: SQLConf): Option[BuildSide]
getShuffleHashJoinBuildSide
determines if to build on the left side (buildLeft
):
- With
hintOnly
enabled (true
),getShuffleHashJoinBuildSide
hintToShuffleHashJoinLeft - Otherwise,
getShuffleHashJoinBuildSide
checks if canBuildLocalHashMapBySize and the left operator is muchSmaller than the right
getShuffleHashJoinBuildSide
determines if to build on the right side (buildRight
):
- With
hintOnly
enabled (true
),getShuffleHashJoinBuildSide
hintToShuffleHashJoinRight - Otherwise,
getShuffleHashJoinBuildSide
checks if canBuildLocalHashMapBySize and the right operator is muchSmaller than the left
In the end, getShuffleHashJoinBuildSide
tries to determine the BuildSide based on the following:
- Checks for BuildLeft for the given JoinType and the
buildLeft
flag - canBuildShuffledHashJoinRight for the given
JoinType
and thebuildRight
flag - Left physical operator
- Right physical operator
Smaller Side¶
getSmallerSide(
left: LogicalPlan,
right: LogicalPlan): BuildSide
getSmallerSide
is BuildLeft
unless the size of the right table (the given right
LogicalPlan) is not larger than the size of the left table (the given left
LogicalPlan). Otherwise, getSmallerSide
is BuildRight
.
hintToBroadcastLeft¶
hintToBroadcastLeft(
hint: JoinHint): Boolean
hintToBroadcastLeft
is enabled (true
) when the given JoinHint
has BROADCAST
, BROADCASTJOIN
or MAPJOIN
hints associated with the left operator.
hintToBroadcastRight¶
hintToBroadcastRight(
hint: JoinHint): Boolean
hintToBroadcastRight
is enabled (true
) when the given JoinHint
has BROADCAST
, BROADCASTJOIN
or MAPJOIN
hints associated with the right operator.
hintToNotBroadcastLeft¶
hintToNotBroadcastLeft(
hint: JoinHint): Boolean
hintToNotBroadcastLeft
is enabled (true
) when the given JoinHint
has the internal NO_BROADCAST_HASH
hint associated with the left operator (to discourage broadcast hash join).
hintToNotBroadcastRight¶
hintToNotBroadcastRight(
hint: JoinHint): Boolean
hintToNotBroadcastRight
is enabled (true
) when the given JoinHint
has the internal NO_BROADCAST_HASH
hint associated with the right operator (to discourage broadcast hash join).
hintToShuffleHashJoinLeft¶
hintToShuffleHashJoinLeft(
hint: JoinHint): Boolean
hintToShuffleHashJoinLeft
is enabled (true
) when the given JoinHint
has SHUFFLE_HASH
hint associated with the left operator.
hintToShuffleHashJoinRight¶
hintToShuffleHashJoinRight(
hint: JoinHint): Boolean
hintToShuffleHashJoinRight
is enabled (true
) when the given JoinHint
has SHUFFLE_HASH
hint associated with the right operator.
hintToSortMergeJoin¶
hintToSortMergeJoin(
hint: JoinHint): Boolean
hintToSortMergeJoin
...FIXME
hintToShuffleReplicateNL¶
hintToShuffleReplicateNL(
hint: JoinHint): Boolean
hintToShuffleReplicateNL
...FIXME
muchSmaller¶
muchSmaller(
a: LogicalPlan,
b: LogicalPlan): Boolean
muchSmaller
is enabled (true
) when the size of the left table (the given a
LogicalPlan) is at least 3 times smaller than the size of the right table (the given b
LogicalPlan).
Checking Left BuildSide for ShuffledHashJoin¶
canBuildShuffledHashJoinLeft(
joinType: JoinType): Boolean
canBuildShuffledHashJoinLeft
is enabled (true
) for the given JoinType among the following:
- InnerLikes: Inner and Cross
- RightOuter
- FullOuter