Skip to content

Optimizer — Generic Logical Query Plan Optimizer

Optimizer (Catalyst Optimizer) is an extension of the RuleExecutor abstraction for logical query plan optimizers.

Optimizer: Analyzed Logical Plan ==> Optimized Logical Plan

Implementations

Creating Instance

Optimizer takes the following to be created:

Abstract Class

Optimizer is an abstract class and cannot be created directly. It is created indirectly for the concrete Optimizers.

Default Batches

Optimizer defines the default rule batches of logical optimizations that transform the query plan of a structured query to produce the optimized logical query plan.

The default rule batches can be further refined (extended or with rules excluded).

Eliminate Distinct

Rules:

  • EliminateDistinct

Strategy: Once

Finish Analysis

Rules:

Strategy: Once

Union

Rules:

Strategy: Once

OptimizeLimitZero

Rules:

  • OptimizeLimitZero

Strategy: Once

LocalRelation early

Rules:

Strategy: fixedPoint

Pullup Correlated Expressions

Rules:

Strategy: Once

Subquery

Rules:

Strategy: FixedPoint(1)

Replace Operators

Rules:

  • RewriteExceptAll
  • RewriteIntersectAll
  • ReplaceIntersectWithSemiJoin
  • ReplaceExceptWithFilter
  • ReplaceExceptWithAntiJoin
  • ReplaceDistinctWithAggregate

Strategy: fixedPoint

Aggregate

Rules:

  • RemoveLiteralFromGroupExpressions
  • RemoveRepetitionFromGroupExpressions

Strategy: fixedPoint

Operator Optimization before Inferring Filters

Rules:

  • PushProjectionThroughUnion
  • ReorderJoin
  • EliminateOuterJoin
  • PushDownPredicates
  • PushDownLeftSemiAntiJoin
  • PushLeftSemiLeftAntiThroughJoin
  • LimitPushDown
  • ColumnPruning
  • CollapseRepartition
  • CollapseProject
  • CollapseWindow
  • CombineFilters
  • CombineLimits
  • CombineUnions
  • TransposeWindow
  • NullPropagation
  • ConstantPropagation
  • FoldablePropagation
  • OptimizeIn
  • ConstantFolding
  • ReorderAssociativeOperator
  • LikeSimplification
  • BooleanSimplification
  • SimplifyConditionals
  • RemoveDispensableExpressions
  • SimplifyBinaryComparison
  • ReplaceNullWithFalseInPredicate
  • PruneFilters
  • SimplifyCasts
  • SimplifyCaseConversionExpressions
  • RewriteCorrelatedScalarSubquery
  • EliminateSerialization
  • RemoveRedundantAliases
  • RemoveNoopOperators
  • SimplifyExtractValueOps
  • CombineConcats
  • extendedOperatorOptimizationRules

Strategy: fixedPoint

Infer Filters

Rules:

Strategy: Once

Operator Optimization after Inferring Filters

Rules:

Strategy: fixedPoint

Early Filter and Projection Push-Down

Rules:

Strategy: Once

Update CTE Relation Stats

Rules:

Strategy: Once

Join Reorder

Rules:

Strategy: FixedPoint(1)

Eliminate Sorts

Rules:

  • EliminateSorts

Strategy: Once

Decimal Optimizations

Rules:

Strategy: fixedPoint

Object Expressions Optimization

Rules:

  • EliminateMapObjects
  • CombineTypedFilters
  • ObjectSerializerPruning
  • ReassignLambdaVariableID

Strategy: fixedPoint

LocalRelation

Rules:

Strategy: fixedPoint

Check Cartesian Products

Rules:

  • CheckCartesianProducts

Strategy: Once

RewriteSubquery

Rules:

Strategy: Once

NormalizeFloatingNumbers

Rules:

  • NormalizeFloatingNumbers

Strategy: Once

Excluded Rules

Optimizer uses spark.sql.optimizer.excludedRules configuration property to control what rules in the defaultBatches to exclude.

Non-Excludable Rules

nonExcludableRules: Seq[String]

nonExcludableRules is a collection of non-excludable optimization rules.

Non-Excludable Rules are so critical for query optimization that they can never be excluded (even using spark.sql.optimizer.excludedRules configuration property).

Accessing Optimizer

Optimizer is available as the optimizer property of a session-specific SessionState.

scala> :type spark.sessionState.optimizer
org.apache.spark.sql.catalyst.optimizer.Optimizer

You can access the optimized logical plan of a structured query (as a Dataset) using Dataset.explain basic action (with extended flag enabled) or SQL's EXPLAIN EXTENDED SQL command.

// sample structured query
val inventory = spark
  .range(5)
  .withColumn("new_column", 'id + 5 as "plus5")

// Using explain operator (with extended flag enabled)
scala> inventory.explain(extended = true)
== Parsed Logical Plan ==
'Project [id#0L, ('id + 5) AS plus5#2 AS new_column#3]
+- AnalysisBarrier
      +- Range (0, 5, step=1, splits=Some(8))

== Analyzed Logical Plan ==
id: bigint, new_column: bigint
Project [id#0L, (id#0L + cast(5 as bigint)) AS new_column#3L]
+- Range (0, 5, step=1, splits=Some(8))

== Optimized Logical Plan ==
Project [id#0L, (id#0L + 5) AS new_column#3L]
+- Range (0, 5, step=1, splits=Some(8))

== Physical Plan ==
*(1) Project [id#0L, (id#0L + 5) AS new_column#3L]
+- *(1) Range (0, 5, step=1, splits=8)

Alternatively, you can access the analyzed logical plan using QueryExecution and its optimizedPlan property (that together with numberedTreeString method is a very good "debugging" tool).

val optimizedPlan = inventory.queryExecution.optimizedPlan
scala> println(optimizedPlan.numberedTreeString)
00 Project [id#0L, (id#0L + 5) AS new_column#3L]
01 +- Range (0, 5, step=1, splits=Some(8))

FixedPoint Strategy

FixedPoint strategy with the number of iterations as defined by spark.sql.optimizer.maxIterations configuration property.

Extended Operator Optimization Rules (Extension Point)

extendedOperatorOptimizationRules: Seq[Rule[LogicalPlan]] = Nil

extendedOperatorOptimizationRules extension point defines additional rules for the Operator Optimization rule batch.

extendedOperatorOptimizationRules rules are executed right after Operator Optimization before Inferring Filters and Operator Optimization after Inferring Filters.

earlyScanPushDownRules (Extension Point)

earlyScanPushDownRules: Seq[Rule[LogicalPlan]] = Nil

earlyScanPushDownRules extension point...FIXME

blacklistedOnceBatches

blacklistedOnceBatches: Set[String]

blacklistedOnceBatches...FIXME

Batches

Signature
batches: Seq[Batch]

batches is part of the RuleExecutor abstraction.

batches uses spark.sql.optimizer.excludedRules configuration property for the rules to be excluded.

batches filters out non-excludable rules from the rules to be excluded. For any filtered-out rule, batches prints out the following WARN message to the logs:

Optimization rule '[ruleName]' was not excluded from the optimizer because this rule is a non-excludable rule.

batches filters out the excluded rules from all defaultBatches. In case a batch is left with no rules, batches prints out the following INFO message to the logs:

Optimization batch '[name]' is excluded from the optimizer as all enclosed rules have been excluded.