Skip to content

JoinSelection Execution Planning Strategy

JoinSelection is an execution planning strategy for Join logical operators.

JoinSelection is part of the strategies of the SparkPlanner.

Join Selection Priorities

  1. Join Type
  2. Hints
  3. Size (Cost-Based Optimization, Statistics)

Join Selection Requirements

The following sections are in the order of preference.

Danger

These sections have to be reviewed for correctness.

JoinSelection considers join physical operators per whether join keys are used or not:

BroadcastHashJoinExec

JoinSelection plans a BroadcastHashJoinExec when there are join keys and one of the following holds:

BroadcastHashJoinExec is created for ExtractEquiJoinKeys-destructurable logical query plans (INNER, CROSS, LEFT OUTER, LEFT SEMI, LEFT ANTI) of which the right physical operator can be broadcast.

ShuffledHashJoinExec

JoinSelection plans a ShuffledHashJoinExec when there are join keys and one of the following holds:

SortMergeJoinExec

JoinSelection plans a SortMergeJoinExec when the left join keys are orderable.

BroadcastNestedLoopJoinExec

JoinSelection plans a BroadcastNestedLoopJoinExec when there are no join keys and one of the following holds:

CartesianProductExec

JoinSelection plans a CartesianProductExec when there are no join keys and join type is CROSS or INNER

BroadcastNestedLoopJoinExec

JoinSelection plans a BroadcastNestedLoopJoinExec when no other join operator has matched already

Executing Rule

apply(
  plan: LogicalPlan): Seq[SparkPlan]

apply is part of the GenericStrategy abstraction.

apply is made up of three parts (each with its own Scala extractor object to destructure the input LogicalPlan):

  1. ExtractEquiJoinKeys
  2. ExtractSingleColumnNullAwareAntiJoin
  3. Other Joins

ExtractEquiJoinKeys

apply uses ExtractEquiJoinKeys to match on Join logical operators with EqualTo and EqualNullSafe condition predicate expressions.

apply does the following (in the order until a join physical operator has been determined):

  1. createBroadcastHashJoin (based on the hints only)
  2. With a hintToSortMergeJoin defined, createSortMergeJoin
  3. createShuffleHashJoin (based on the hints only)
  4. With a hintToShuffleReplicateNL defined, createCartesianProduct
  5. createJoinWithoutHint

ExtractSingleColumnNullAwareAntiJoin

apply uses ExtractSingleColumnNullAwareAntiJoin to match on Join logical operators.

For every Join operator, apply creates a BroadcastHashJoinExec physical operator with the following:

Other Joins

apply determines the desired build side. For InnerLike and FullOuter join types, apply getSmallerSide. For all other join types, apply canBuildBroadcastLeft and prefers BuildLeft over BuildRight.

In the end, apply does the following (in the order until a join physical operator has been determined):

  1. createBroadcastNLJoin (based on the hints only for the left and right side)
  2. With a hintToShuffleReplicateNL defined, createCartesianProduct
  3. createJoinWithoutHint

canBroadcastBySize

canBroadcastBySize(
  plan: LogicalPlan,
  conf: SQLConf): Boolean

canBroadcastBySize is enabled (true) when the size of the table (the given LogicalPlan) is small for a broadcast join (between 0 and the spark.sql.autoBroadcastJoinThreshold configuration property inclusive).

canBuildBroadcastLeft

canBuildBroadcastLeft(
  joinType: JoinType): Boolean

canBuildBroadcastLeft is enabled (true) for InnerLike and RightOuter join types.

canBuildLocalHashMapBySize

canBuildLocalHashMapBySize(
  plan: LogicalPlan,
  conf: SQLConf): Boolean

canBuildLocalHashMapBySize is enabled (true) when the size of the table (the given LogicalPlan) is small for a shuffle hash join (below the spark.sql.autoBroadcastJoinThreshold configuration property multiplied by the configured number of shuffle partitions).

Creating BroadcastHashJoinExec

createBroadcastHashJoin(
  onlyLookingAtHint: Boolean): Option[Seq[BroadcastHashJoinExec]]

createBroadcastHashJoin determines a BroadcastBuildSide and, if successful, creates a BroadcastHashJoinExec.

Creating BroadcastNestedLoopJoinExec

createBroadcastNLJoin(
  buildLeft: Boolean,
  buildRight: Boolean): Option[Seq[BroadcastNestedLoopJoinExec]]

createBroadcastNLJoin creates a BroadcastNestedLoopJoinExec when at least one of the buildLeft or buildRight flags are enabled.

Creating ShuffledHashJoinExec

createShuffleHashJoin(
  onlyLookingAtHint: Boolean): Option[Seq[ShuffledHashJoinExec]]

createShuffleHashJoin tries to determine the BuildSide for a ShuffleHashJoinExec and, if successful, creates a ShuffledHashJoinExec.

createShuffleHashJoin is used when:

Creating SortMergeJoinExec

createSortMergeJoin(): Option[Seq[SortMergeJoinExec]]

createSortMergeJoin creates a SortMergeJoinExec if the left keys are orderable.

createCartesianProduct

createCartesianProduct(): Option[Seq[CartesianProductExec]]

createCartesianProduct creates a CartesianProductExec for InnerLike join type.

createJoinWithoutHint

createJoinWithoutHint(): Seq[BaseJoinExec]

createJoinWithoutHint...FIXME

Build Side

getBuildSide(
  canBuildLeft: Boolean,
  canBuildRight: Boolean,
  left: LogicalPlan,
  right: LogicalPlan): Option[BuildSide]

getBuildSide is the following (in the order):

  1. The smaller side of the left and right operators when the canBuildLeft and canBuildRight flags are both enabled (true)
  2. BuildLeft for canBuildLeft flag enabled
  3. BuildRight for canBuildRight flag enabled
  4. Undefined (None)

getBroadcastBuildSide

getBroadcastBuildSide(
  left: LogicalPlan,
  right: LogicalPlan,
  joinType: JoinType,
  hint: JoinHint,
  hintOnly: Boolean,
  conf: SQLConf): Option[BuildSide]

getBroadcastBuildSide determines if build on the left side (buildLeft). With hintOnly enabled (true), getBroadcastBuildSide hintToBroadcastLeft. Otherwise, getBroadcastBuildSide checks if canBroadcastBySize and not hintToNotBroadcastLeft.

getBroadcastBuildSide determines if build on the right side (buildRight). With hintOnly enabled (true), getBroadcastBuildSide hintToBroadcastRight. Otherwise, getBroadcastBuildSide checks if canBroadcastBySize and not hintToNotBroadcastRight.

In the end, getBroadcastBuildSide getBuildSide with the following:

BuildSide for ShuffleHashJoinExec

getShuffleHashJoinBuildSide(
  left: LogicalPlan,
  right: LogicalPlan,
  joinType: JoinType,
  hint: JoinHint,
  hintOnly: Boolean,
  conf: SQLConf): Option[BuildSide]

getShuffleHashJoinBuildSide determines if to build on the left side (buildLeft):

getShuffleHashJoinBuildSide determines if to build on the right side (buildRight):

In the end, getShuffleHashJoinBuildSide tries to determine the BuildSide based on the following:

Smaller Side

getSmallerSide(
  left: LogicalPlan,
  right: LogicalPlan): BuildSide

getSmallerSide is BuildLeft unless the size of the right table (the given right LogicalPlan) is not larger than the size of the left table (the given left LogicalPlan). Otherwise, getSmallerSide is BuildRight.

hintToBroadcastLeft

hintToBroadcastLeft(
  hint: JoinHint): Boolean

hintToBroadcastLeft is enabled (true) when the given JoinHint has BROADCAST, BROADCASTJOIN or MAPJOIN hints associated with the left operator.

hintToBroadcastRight

hintToBroadcastRight(
  hint: JoinHint): Boolean

hintToBroadcastRight is enabled (true) when the given JoinHint has BROADCAST, BROADCASTJOIN or MAPJOIN hints associated with the right operator.

hintToNotBroadcastLeft

hintToNotBroadcastLeft(
  hint: JoinHint): Boolean

hintToNotBroadcastLeft is enabled (true) when the given JoinHint has the internal NO_BROADCAST_HASH hint associated with the left operator (to discourage broadcast hash join).

hintToNotBroadcastRight

hintToNotBroadcastRight(
  hint: JoinHint): Boolean

hintToNotBroadcastRight is enabled (true) when the given JoinHint has the internal NO_BROADCAST_HASH hint associated with the right operator (to discourage broadcast hash join).

hintToShuffleHashJoinLeft

hintToShuffleHashJoinLeft(
  hint: JoinHint): Boolean

hintToShuffleHashJoinLeft is enabled (true) when the given JoinHint has SHUFFLE_HASH hint associated with the left operator.

hintToShuffleHashJoinRight

hintToShuffleHashJoinRight(
  hint: JoinHint): Boolean

hintToShuffleHashJoinRight is enabled (true) when the given JoinHint has SHUFFLE_HASH hint associated with the right operator.

hintToSortMergeJoin

hintToSortMergeJoin(
  hint: JoinHint): Boolean

hintToSortMergeJoin...FIXME

hintToShuffleReplicateNL

hintToShuffleReplicateNL(
  hint: JoinHint): Boolean

hintToShuffleReplicateNL...FIXME

muchSmaller

muchSmaller(
  a: LogicalPlan,
  b: LogicalPlan): Boolean

muchSmaller is enabled (true) when the size of the left table (the given a LogicalPlan) is at least 3 times smaller than the size of the right table (the given b LogicalPlan).

Checking Left BuildSide for ShuffledHashJoin

canBuildShuffledHashJoinLeft(
  joinType: JoinType): Boolean

canBuildShuffledHashJoinLeft is enabled (true) for the given JoinType among the following: