RangePartitionIdRewrite Optimization Rule¶
RangePartitionIdRewrite
is an optimization rule (Rule[LogicalPlan]
) to rewrite RangePartitionIds to PartitionerExprs.
Creating Instance¶
RangePartitionIdRewrite
takes the following to be created:
-
SparkSession
(Spark SQL)
RangePartitionIdRewrite
is created when:
DeltaSparkSessionExtension
is requested to register delta extensions
sampleSizePerPartition Hint¶
RangePartitionIdRewrite
uses spark.sql.execution.rangeExchange.sampleSizePerPartition
(Spark SQL) configuration property for samplePointsPerPartitionHint
for a RangePartitioner
(Spark Core) when executed.
Executing Rule¶
apply(
plan: LogicalPlan): LogicalPlan
apply
is part of the Rule
(Spark SQL) abstraction.
apply
transforms UnaryNode
s with RangePartitionId unary expressions in the given LogicalPlan
(from the children first and up).
For every RangePartitionId
, apply
creates a new logical query plan for sampling (based on the child expression of the RangePartitionId
).
import org.apache.spark.sql.functions.lit
val expr = lit(5).expr
import org.apache.spark.sql.catalyst.expressions.Alias
val aliasedExpr = Alias(expr, "__RPI_child_col__")()
apply
changes the current call site to the following:
RangePartitionId([childExpr], [numPartitions]) sampling
Call Site
A call site is used in web UI and SparkListener
s for all jobs submitted from the current and child threads.
apply
creates a RangePartitioner
(Spark Core) (with the number of partition of the RangePartitionId
, the RDD of the query plan for sampling, ascending
flag enabled, and the sampleSizePerPartition hint).
In the end, apply
creates a PartitionerExpr with the child expression (of the RangePartitionId
) and the RangePartitioner
.