Skip to content

Distribution

Distribution is an abstraction of the data distribution requirements of physical operators.

Distribution is enforced by EnsureRequirements physical optimization.

Contract

Creating Partitioning

createPartitioning(
  numPartitions: Int): Partitioning

Creates a Partitioning for the given number of partitions

Used when:

Required Number of Partitions

requiredNumPartitions: Option[Int]

Required number of partitions for this data distribution

When defined, only Partitionings with the same number of partitions can satisfy the distribution requirement.

When undefined (None), indicates to use any number of partitions (possibly spark.sql.shuffle.partitions).

Used when:

Implementations

sealed abstract class

Distribution is a Scala sealed abstract class which means that all possible implementations (Distributions) are all in the same compilation unit (file).

Physical Operators' Distribution Requirements

Physical operators use Distributions to specify the required child distribution for every child operator.

The default Distributions are UnspecifiedDistributions for all the children.

Physical Operator Required Child Distribution
AdaptiveSparkPlanExec UnspecifiedDistribution or AQEUtils.getRequiredDistribution
BaseAggregateExec One of AllTuples, ClusteredDistribution and UnspecifiedDistribution
BroadcastHashJoinExec BroadcastDistribution with UnspecifiedDistribution or vice versa
BroadcastNestedLoopJoinExec BroadcastDistribution with UnspecifiedDistribution or vice versa
CoGroupExec HashClusteredDistributions
GlobalLimitExec AllTuples
ShuffledJoin UnspecifiedDistributions or HashClusteredDistributions
SortExec OrderedDistribution or UnspecifiedDistribution
others