RangePartitionIdRewrite Optimization Rule¶

RangePartitionIdRewrite is an optimization rule (Rule[LogicalPlan]) to rewrite RangePartitionIds to PartitionerExprs.

Creating Instance¶

RangePartitionIdRewrite takes the following to be created:

SparkSession (Spark SQL)

RangePartitionIdRewrite is created when:

DeltaSparkSessionExtension is requested to register delta extensions

sampleSizePerPartition Hint¶

RangePartitionIdRewrite uses spark.sql.execution.rangeExchange.sampleSizePerPartition (Spark SQL) configuration property for samplePointsPerPartitionHint for a RangePartitioner (Spark Core) when executed.

Executing Rule¶

apply(
  plan: LogicalPlan): LogicalPlan

apply is part of the Rule (Spark SQL) abstraction.

apply transforms UnaryNodes with RangePartitionId unary expressions in the given LogicalPlan (from the children first and up).

For every RangePartitionId, apply creates a new logical query plan for sampling (based on the child expression of the RangePartitionId).

import org.apache.spark.sql.functions.lit
val expr = lit(5).expr

import org.apache.spark.sql.catalyst.expressions.Alias
val aliasedExpr = Alias(expr, "__RPI_child_col__")()

apply changes the current call site to the following:

RangePartitionId([childExpr], [numPartitions]) sampling

Call Site

A call site is used in web UI and SparkListeners for all jobs submitted from the current and child threads.

apply creates a RangePartitioner (Spark Core) (with the number of partition of the RangePartitionId, the RDD of the query plan for sampling, ascending flag enabled, and the sampleSizePerPartition hint).

In the end, apply creates a PartitionerExpr with the child expression (of the RangePartitionId) and the RangePartitioner.