Distribution¶
Distribution
is an abstraction of the data distribution requirements of physical operators.
Distribution
is enforced by EnsureRequirements physical optimization.
Contract¶
Creating Partitioning¶
createPartitioning(
numPartitions: Int): Partitioning
Creates a Partitioning for the given number of partitions
Used when:
- EnsureRequirements physical optimization is executed
Required Number of Partitions¶
requiredNumPartitions: Option[Int]
Required number of partitions for this data distribution
When defined, only Partitionings with the same number of partitions can satisfy the distribution requirement.
When undefined (None
), indicates to use any number of partitions (possibly spark.sql.shuffle.partitions).
Used when:
- EnsureRequirements physical optimization is executed
Implementations¶
sealed abstract class
Distribution
is a Scala sealed abstract class which means that all possible implementations (Distribution
s) are all in the same compilation unit (file).
- AllTuples
- BroadcastDistribution
- ClusteredDistribution
- HashClusteredDistribution
- OrderedDistribution
- UnspecifiedDistribution
Physical Operators' Distribution Requirements¶
Physical operators use Distribution
s to specify the required child distribution for every child operator.
The default Distribution
s are UnspecifiedDistributions for all the children.
Physical Operator | Required Child Distribution |
---|---|
AdaptiveSparkPlanExec | UnspecifiedDistribution or AQEUtils.getRequiredDistribution |
BaseAggregateExec | One of AllTuples, ClusteredDistribution and UnspecifiedDistribution |
BroadcastHashJoinExec | BroadcastDistribution with UnspecifiedDistribution or vice versa |
BroadcastNestedLoopJoinExec | BroadcastDistribution with UnspecifiedDistribution or vice versa |
CoGroupExec | HashClusteredDistributions |
GlobalLimitExec | AllTuples |
ShuffledJoin | UnspecifiedDistributions or HashClusteredDistributions |
SortExec | OrderedDistribution or UnspecifiedDistribution |
others |