ClusteredDistribution¶

ClusteredDistribution is a Distribution that <> for the <> and a requested number of partitions.

ClusteredDistribution requires that the <> should not be empty (i.e. Nil).

ClusteredDistribution is <> when the following physical operators are requested for a required child distribution:

MapGroupsExec, HashAggregateExec, ObjectHashAggregateExec, SortAggregateExec, WindowExec
Spark Structured Streaming's FlatMapGroupsWithStateExec, StateStoreRestoreExec, StateStoreSaveExec, StreamingDeduplicateExec, StreamingSymmetricHashJoinExec, StreamingSymmetricHashJoinExec
SparkR's FlatMapGroupsInRExec
PySpark's FlatMapGroupsInPandasExec

ClusteredDistribution is used when:

DataSourcePartitioning, SinglePartition, HashPartitioning, and RangePartitioning are requested to satisfies
EnsureRequirements is executed for Adaptive Query Execution

=== [[createPartitioning]] createPartitioning Method

[source, scala]¶

createPartitioning(numPartitions: Int): Partitioning¶

createPartitioning creates a HashPartitioning for the <> and the input numPartitions.

createPartitioning reports an AssertionError when the <> is not the input numPartitions.

[options="wrap"]

This ClusteredDistribution requires [requiredNumPartitions] partitions, but the actual number of partitions is [numPartitions].

createPartitioning is part of the Distribution abstraction.

Creating Instance¶

ClusteredDistribution takes the following to be created:

[[clustering]] Clustering expressions
[[requiredNumPartitions]] Required number of partitions (default: None)

Note

None for the required number of partitions indicates to use any number of partitions (possibly spark.sql.shuffle.partitions configuration property).