Skip to content

ClusteredDistribution

ClusteredDistribution is a Distribution that <> for the <> and a requested number of partitions.

ClusteredDistribution requires that the <> should not be empty (i.e. Nil).

ClusteredDistribution is <> when the following physical operators are requested for a required child distribution:

  • MapGroupsExec, HashAggregateExec, ObjectHashAggregateExec, SortAggregateExec, WindowExec

  • Spark Structured Streaming's FlatMapGroupsWithStateExec, StateStoreRestoreExec, StateStoreSaveExec, StreamingDeduplicateExec, StreamingSymmetricHashJoinExec, StreamingSymmetricHashJoinExec

  • SparkR's FlatMapGroupsInRExec

  • PySpark's FlatMapGroupsInPandasExec

ClusteredDistribution is used when:

  • DataSourcePartitioning, SinglePartition, HashPartitioning, and RangePartitioning are requested to satisfies

  • EnsureRequirements is executed for Adaptive Query Execution

=== [[createPartitioning]] createPartitioning Method

[source, scala]

createPartitioning(numPartitions: Int): Partitioning

createPartitioning creates a HashPartitioning for the <> and the input numPartitions.

createPartitioning reports an AssertionError when the <> is not the input numPartitions.

[options="wrap"]

This ClusteredDistribution requires [requiredNumPartitions] partitions, but the actual number of partitions is [numPartitions].

createPartitioning is part of the Distribution abstraction.

Creating Instance

ClusteredDistribution takes the following to be created:

  • [[clustering]] Clustering expressions
  • [[requiredNumPartitions]] Required number of partitions (default: None)

Note

None for the required number of partitions indicates to use any number of partitions (possibly spark.sql.shuffle.partitions configuration property).