Optimizer — Generic Logical Query Plan Optimizer¶
Optimizer
(Catalyst Optimizer) is an extension of the RuleExecutor abstraction for logical query plan optimizers.
Optimizer: Analyzed Logical Plan ==> Optimized Logical Plan
Implementations¶
Creating Instance¶
Optimizer
takes the following to be created:
Abstract Class
Optimizer
is an abstract class and cannot be created directly. It is created indirectly for the concrete Optimizers.
Default Batches¶
Optimizer
defines the default rule batches of logical optimizations that transform the query plan of a structured query to produce the optimized logical query plan.
The default rule batches can be further refined (extended or with rules excluded).
Eliminate Distinct¶
Rules:
- EliminateDistinct
Strategy: Once
Finish Analysis¶
Rules:
- EliminateResolvedHint
- EliminateSubqueryAliases
- EliminateView
- InlineCTE
- ReplaceExpressions
- RewriteNonCorrelatedExists
- PullOutGroupingExpressions
- ComputeCurrentTime
- ReplaceCurrentLike
Strategy: Once
Union¶
Rules:
Strategy: Once
OptimizeLimitZero¶
Rules:
- OptimizeLimitZero
Strategy: Once
LocalRelation early¶
Rules:
- ConvertToLocalRelation
- PropagateEmptyRelation
Strategy: fixedPoint
Pullup Correlated Expressions¶
Rules:
Strategy: Once
Subquery¶
Rules:
Strategy: FixedPoint(1)
Replace Operators¶
Rules:
- RewriteExceptAll
- RewriteIntersectAll
- ReplaceIntersectWithSemiJoin
- ReplaceExceptWithFilter
- ReplaceExceptWithAntiJoin
- ReplaceDistinctWithAggregate
Strategy: fixedPoint
Aggregate¶
Rules:
- RemoveLiteralFromGroupExpressions
- RemoveRepetitionFromGroupExpressions
Strategy: fixedPoint
Operator Optimization before Inferring Filters¶
Rules:
- PushProjectionThroughUnion
- ReorderJoin
- EliminateOuterJoin
- PushDownPredicates
- PushDownLeftSemiAntiJoin
- PushLeftSemiLeftAntiThroughJoin
- LimitPushDown
- ColumnPruning
- CollapseRepartition
- CollapseProject
- CollapseWindow
- CombineFilters
- CombineLimits
- CombineUnions
- TransposeWindow
- NullPropagation
- ConstantPropagation
- FoldablePropagation
- OptimizeIn
- ConstantFolding
- ReorderAssociativeOperator
- LikeSimplification
- BooleanSimplification
- SimplifyConditionals
- RemoveDispensableExpressions
- SimplifyBinaryComparison
- ReplaceNullWithFalseInPredicate
- PruneFilters
- SimplifyCasts
- SimplifyCaseConversionExpressions
- RewriteCorrelatedScalarSubquery
- EliminateSerialization
- RemoveRedundantAliases
- RemoveNoopOperators
- SimplifyExtractValueOps
- CombineConcats
- extendedOperatorOptimizationRules
Strategy: fixedPoint
Infer Filters¶
Rules:
Strategy: Once
Operator Optimization after Inferring Filters¶
Rules:
- The same as in Operator Optimization before Inferring Filters batch
Strategy: fixedPoint
Early Filter and Projection Push-Down¶
Rules:
- As defined by the earlyScanPushDownRules extension point
Strategy: Once
Update CTE Relation Stats¶
Rules:
Strategy: Once
Join Reorder¶
Rules:
Strategy: FixedPoint(1)
Eliminate Sorts¶
Rules:
- EliminateSorts
Strategy: Once
Decimal Optimizations¶
Rules:
Strategy: fixedPoint
Object Expressions Optimization¶
Rules:
- EliminateMapObjects
- CombineTypedFilters
- ObjectSerializerPruning
- ReassignLambdaVariableID
Strategy: fixedPoint
LocalRelation¶
Rules:
- ConvertToLocalRelation
- PropagateEmptyRelation
Strategy: fixedPoint
Check Cartesian Products¶
Rules:
- CheckCartesianProducts
Strategy: Once
RewriteSubquery¶
Rules:
- RewritePredicateSubquery
- ColumnPruning
- CollapseProject
- RemoveNoopOperators
Strategy: Once
NormalizeFloatingNumbers¶
Rules:
- NormalizeFloatingNumbers
Strategy: Once
Excluded Rules¶
Optimizer
uses spark.sql.optimizer.excludedRules configuration property to control what rules in the defaultBatches to exclude.
Non-Excludable Rules¶
nonExcludableRules: Seq[String]
nonExcludableRules
is a collection of non-excludable optimization rules.
Non-Excludable Rules are so critical for query optimization that they can never be excluded (even using spark.sql.optimizer.excludedRules configuration property).
FinishAnalysis
RewriteDistinctAggregates
ReplaceDeduplicateWithAggregate
ReplaceIntersectWithSemiJoin
- ReplaceExceptWithFilter
- ReplaceExceptWithAntiJoin
- RewriteExceptAll
RewriteIntersectAll
ReplaceDistinctWithAggregate
- PullupCorrelatedPredicates
- RewriteCorrelatedScalarSubquery
- RewritePredicateSubquery
NormalizeFloatingNumbers
ReplaceUpdateFieldsExpression
RewriteLateralSubquery
- OptimizeSubqueries
Accessing Optimizer¶
Optimizer
is available as the optimizer property of a session-specific SessionState
.
scala> :type spark.sessionState.optimizer
org.apache.spark.sql.catalyst.optimizer.Optimizer
You can access the optimized logical plan of a structured query (as a Dataset) using Dataset.explain basic action (with extended
flag enabled) or SQL's EXPLAIN EXTENDED
SQL command.
// sample structured query
val inventory = spark
.range(5)
.withColumn("new_column", 'id + 5 as "plus5")
// Using explain operator (with extended flag enabled)
scala> inventory.explain(extended = true)
== Parsed Logical Plan ==
'Project [id#0L, ('id + 5) AS plus5#2 AS new_column#3]
+- AnalysisBarrier
+- Range (0, 5, step=1, splits=Some(8))
== Analyzed Logical Plan ==
id: bigint, new_column: bigint
Project [id#0L, (id#0L + cast(5 as bigint)) AS new_column#3L]
+- Range (0, 5, step=1, splits=Some(8))
== Optimized Logical Plan ==
Project [id#0L, (id#0L + 5) AS new_column#3L]
+- Range (0, 5, step=1, splits=Some(8))
== Physical Plan ==
*(1) Project [id#0L, (id#0L + 5) AS new_column#3L]
+- *(1) Range (0, 5, step=1, splits=8)
Alternatively, you can access the analyzed logical plan using QueryExecution
and its optimizedPlan property (that together with numberedTreeString
method is a very good "debugging" tool).
val optimizedPlan = inventory.queryExecution.optimizedPlan
scala> println(optimizedPlan.numberedTreeString)
00 Project [id#0L, (id#0L + 5) AS new_column#3L]
01 +- Range (0, 5, step=1, splits=Some(8))
FixedPoint Strategy¶
FixedPoint
strategy with the number of iterations as defined by spark.sql.optimizer.maxIterations configuration property.
Extended Operator Optimization Rules (Extension Point)¶
extendedOperatorOptimizationRules: Seq[Rule[LogicalPlan]] = Nil
extendedOperatorOptimizationRules
extension point defines additional rules for the Operator Optimization rule batch.
extendedOperatorOptimizationRules
rules are executed right after Operator Optimization before Inferring Filters and Operator Optimization after Inferring Filters.
earlyScanPushDownRules (Extension Point)¶
earlyScanPushDownRules: Seq[Rule[LogicalPlan]] = Nil
earlyScanPushDownRules
extension point...FIXME
blacklistedOnceBatches¶
blacklistedOnceBatches: Set[String]
blacklistedOnceBatches
...FIXME
Batches¶
batches
uses spark.sql.optimizer.excludedRules configuration property for the rules to be excluded.
batches
filters out non-excludable rules from the rules to be excluded. For any filtered-out rule, batches
prints out the following WARN message to the logs:
Optimization rule '[ruleName]' was not excluded from the optimizer because this rule is a non-excludable rule.
batches
filters out the excluded rules from all defaultBatches. In case a batch is left with no rules, batches
prints out the following INFO message to the logs:
Optimization batch '[name]' is excluded from the optimizer as all enclosed rules have been excluded.