Optimizer — Generic Logical Query Plan Optimizer¶

Optimizer (Catalyst Optimizer) is an extension of the RuleExecutor abstraction for logical query plan optimizers.

Optimizer: Analyzed Logical Plan ==> Optimized Logical Plan

Implementations¶

SparkOptimizer

Creating Instance¶

Optimizer takes the following to be created:

CatalogManager

Abstract Class

Optimizer is an abstract class and cannot be created directly. It is created indirectly for the concrete Optimizers.

Default Batches¶

Optimizer defines the default rule batches of logical optimizations that transform the query plan of a structured query to produce the optimized logical query plan.

The default rule batches can be further refined (extended or with rules excluded).

Eliminate Distinct¶

Rules:

EliminateDistinct

Strategy: Once

Finish Analysis¶

Rules:

EliminateResolvedHint
EliminateSubqueryAliases
EliminateView
InlineCTE
ReplaceExpressions
RewriteNonCorrelatedExists
PullOutGroupingExpressions
ComputeCurrentTime
ReplaceCurrentLike

Strategy: Once

Union¶

Rules:

CombineUnions

Strategy: Once

OptimizeLimitZero¶

Rules:

OptimizeLimitZero

Strategy: Once

LocalRelation early¶

Rules:

ConvertToLocalRelation
PropagateEmptyRelation

Strategy: fixedPoint

Pullup Correlated Expressions¶

Rules:

PullupCorrelatedPredicates

Strategy: Once

Subquery¶

Rules:

OptimizeSubqueries

Strategy: FixedPoint(1)

Replace Operators¶

Rules:

RewriteExceptAll
RewriteIntersectAll
ReplaceIntersectWithSemiJoin
ReplaceExceptWithFilter
ReplaceExceptWithAntiJoin
ReplaceDistinctWithAggregate

Strategy: fixedPoint

Aggregate¶

Rules:

RemoveLiteralFromGroupExpressions
RemoveRepetitionFromGroupExpressions

Strategy: fixedPoint

Operator Optimization before Inferring Filters¶

Rules:

PushProjectionThroughUnion
ReorderJoin
EliminateOuterJoin
PushDownPredicates
PushDownLeftSemiAntiJoin
PushLeftSemiLeftAntiThroughJoin
LimitPushDown
ColumnPruning
CollapseRepartition
CollapseProject
CollapseWindow
CombineFilters
CombineLimits
CombineUnions
TransposeWindow
NullPropagation
ConstantPropagation
FoldablePropagation
OptimizeIn
ConstantFolding
ReorderAssociativeOperator
LikeSimplification
BooleanSimplification
SimplifyConditionals
RemoveDispensableExpressions
SimplifyBinaryComparison
ReplaceNullWithFalseInPredicate
PruneFilters
SimplifyCasts
SimplifyCaseConversionExpressions
RewriteCorrelatedScalarSubquery
EliminateSerialization
RemoveRedundantAliases
RemoveNoopOperators
SimplifyExtractValueOps
CombineConcats
extendedOperatorOptimizationRules

Strategy: fixedPoint

Infer Filters¶

Rules:

InferFiltersFromConstraints

Strategy: Once

Operator Optimization after Inferring Filters¶

Rules:

The same as in Operator Optimization before Inferring Filters batch

Strategy: fixedPoint

Early Filter and Projection Push-Down¶

Rules:

As defined by the earlyScanPushDownRules extension point

Strategy: Once

Update CTE Relation Stats¶

Rules:

UpdateCTERelationStats

Strategy: Once

Join Reorder¶

Rules:

CostBasedJoinReorder

Strategy: FixedPoint(1)

Eliminate Sorts¶

Rules:

EliminateSorts

Strategy: Once

Decimal Optimizations¶

Rules:

DecimalAggregates

Strategy: fixedPoint

Object Expressions Optimization¶

Rules:

EliminateMapObjects
CombineTypedFilters
ObjectSerializerPruning
ReassignLambdaVariableID

Strategy: fixedPoint

LocalRelation¶

Rules:

ConvertToLocalRelation
PropagateEmptyRelation

Strategy: fixedPoint

Check Cartesian Products¶

Rules:

CheckCartesianProducts

Strategy: Once

RewriteSubquery¶

Rules:

RewritePredicateSubquery
ColumnPruning
CollapseProject
RemoveNoopOperators

Strategy: Once

NormalizeFloatingNumbers¶

Rules:

NormalizeFloatingNumbers

Strategy: Once

Excluded Rules¶

Optimizer uses spark.sql.optimizer.excludedRules configuration property to control what rules in the defaultBatches to exclude.

Non-Excludable Rules¶

nonExcludableRules: Seq[String]

nonExcludableRules is a collection of non-excludable optimization rules.

Non-Excludable Rules are so critical for query optimization that they can never be excluded (even using spark.sql.optimizer.excludedRules configuration property).

FinishAnalysis
RewriteDistinctAggregates
ReplaceDeduplicateWithAggregate
ReplaceIntersectWithSemiJoin
ReplaceExceptWithFilter
ReplaceExceptWithAntiJoin
RewriteExceptAll
RewriteIntersectAll
ReplaceDistinctWithAggregate
PullupCorrelatedPredicates
RewriteCorrelatedScalarSubquery
RewritePredicateSubquery
NormalizeFloatingNumbers
ReplaceUpdateFieldsExpression
RewriteLateralSubquery
OptimizeSubqueries

Accessing Optimizer¶

Optimizer is available as the optimizer property of a session-specific SessionState.

scala> :type spark.sessionState.optimizer
org.apache.spark.sql.catalyst.optimizer.Optimizer

You can access the optimized logical plan of a structured query (as a Dataset) using Dataset.explain basic action (with extended flag enabled) or SQL's EXPLAIN EXTENDED SQL command.

// sample structured query
val inventory = spark
  .range(5)
  .withColumn("new_column", 'id + 5 as "plus5")

// Using explain operator (with extended flag enabled)
scala> inventory.explain(extended = true)
== Parsed Logical Plan ==
'Project [id#0L, ('id + 5) AS plus5#2 AS new_column#3]
+- AnalysisBarrier
      +- Range (0, 5, step=1, splits=Some(8))

== Analyzed Logical Plan ==
id: bigint, new_column: bigint
Project [id#0L, (id#0L + cast(5 as bigint)) AS new_column#3L]
+- Range (0, 5, step=1, splits=Some(8))

== Optimized Logical Plan ==
Project [id#0L, (id#0L + 5) AS new_column#3L]
+- Range (0, 5, step=1, splits=Some(8))

== Physical Plan ==
*(1) Project [id#0L, (id#0L + 5) AS new_column#3L]
+- *(1) Range (0, 5, step=1, splits=8)

Alternatively, you can access the analyzed logical plan using QueryExecution and its optimizedPlan property (that together with numberedTreeString method is a very good "debugging" tool).

val optimizedPlan = inventory.queryExecution.optimizedPlan
scala> println(optimizedPlan.numberedTreeString)
00 Project [id#0L, (id#0L + 5) AS new_column#3L]
01 +- Range (0, 5, step=1, splits=Some(8))

FixedPoint Strategy¶

FixedPoint strategy with the number of iterations as defined by spark.sql.optimizer.maxIterations configuration property.

Extended Operator Optimization Rules (Extension Point)¶

extendedOperatorOptimizationRules: Seq[Rule[LogicalPlan]] = Nil

extendedOperatorOptimizationRules extension point defines additional rules for the Operator Optimization rule batch.

extendedOperatorOptimizationRules rules are executed right after Operator Optimization before Inferring Filters and Operator Optimization after Inferring Filters.

earlyScanPushDownRules (Extension Point)¶

earlyScanPushDownRules: Seq[Rule[LogicalPlan]] = Nil

earlyScanPushDownRules extension point...FIXME

blacklistedOnceBatches¶

blacklistedOnceBatches: Set[String]

blacklistedOnceBatches...FIXME

Batches¶

Signature

batches: Seq[Batch]

batches is part of the RuleExecutor abstraction.

batches uses spark.sql.optimizer.excludedRules configuration property for the rules to be excluded.

batches filters out non-excludable rules from the rules to be excluded. For any filtered-out rule, batches prints out the following WARN message to the logs:

Optimization rule '[ruleName]' was not excluded from the optimizer because this rule is a non-excludable rule.

batches filters out the excluded rules from all defaultBatches. In case a batch is left with no rules, batches prints out the following INFO message to the logs:

Optimization batch '[name]' is excluded from the optimizer as all enclosed rules have been excluded.