Logical Query Plan Analyzer¶

Analyzer (Spark Analyzer or Query Analyzer) is the logical query plan analyzer that validates and transforms an unresolved logical plan to an analyzed logical plan.

Analyzer is a RuleExecutor to transform logical operators (RuleExecutor[LogicalPlan]).

Analyzer: Unresolved Logical Plan ==> Analyzed Logical Plan

Analyzer is used by QueryExecution to resolve the managed LogicalPlan (and, as a sort of follow-up, assert that a structured query has already been properly analyzed, i.e. no failed or unresolved or somehow broken logical plan operators and expressions exist).

extendedResolutionRules Extension Point¶

extendedResolutionRules: Seq[Rule[LogicalPlan]] = Nil

extendedResolutionRules is an extension point for additional logical evaluation rules for Resolution batch. The rules are added at the end of the Resolution batch.

Default: empty

Note

SessionState uses its own Analyzer with custom extendedResolutionRules, postHocResolutionRules, and extendedCheckRules extension methods.

postHocResolutionRules Extension Point¶

postHocResolutionRules: Seq[Rule[LogicalPlan]] = Nil

postHocResolutionRules is an extension point for rules in Post-Hoc Resolution batch if defined (that are executed in one pass, i.e. Once strategy).

Default: empty

Batches¶

Hints¶

Rules:

Strategy: fixedPoint

Simple Sanity Check¶

Rules:

LookupFunctions

Strategy: Once

Substitution¶

Rules:

OptimizeUpdateFields
CTESubstitution
BindParameters
WindowsSubstitution
EliminateUnions
SubstituteUnresolvedOrdinals

Strategy: fixedPoint

Resolution¶

Rules:

Strategy: fixedPoint

Post-Hoc Resolution¶

Rules:

postHocResolutionRules

Strategy: Once

Normalize Alter Table¶

Rules:

ResolveAlterTableChanges

Strategy: Once

Remove Unresolved Hints¶

Rules:

RemoveAllHints

Strategy: Once

Nondeterministic¶

Rules:

PullOutNondeterministic

Strategy: Once

UDF¶

Rules:

HandleNullInputsForUDF

Strategy: Once

UpdateNullability¶

Rules:

UpdateAttributeNullability

Strategy: Once

Subquery¶

Rules:

UpdateOuterReferences

Strategy: Once

Cleanup¶

Rules:

CleanupAliases

Strategy: fixedPoint

Creating Instance¶

Analyzer takes the following to be created:

CatalogManager
SQLConf
Maximum number of iterations (of the FixedPoint rule batches)

Analyzer is created when SessionState is requested for the analyzer.

Creating Analyzer

Accessing Analyzer¶

Analyzer is available as the analyzer property of SessionState.

scala> :type spark
org.apache.spark.sql.SparkSession

scala> :type spark.sessionState.analyzer
org.apache.spark.sql.catalyst.analysis.Analyzer

You can access the analyzed logical plan of a structured query using Dataset.explain basic action (with extended flag enabled) or SQL's EXPLAIN EXTENDED SQL command.

// sample structured query
val inventory = spark
  .range(5)
  .withColumn("new_column", 'id + 5 as "plus5")

// Using explain operator (with extended flag enabled)
scala> inventory.explain(extended = true)
== Parsed Logical Plan ==
'Project [id#0L, ('id + 5) AS plus5#2 AS new_column#3]
+- AnalysisBarrier
      +- Range (0, 5, step=1, splits=Some(8))

== Analyzed Logical Plan ==
id: bigint, new_column: bigint
Project [id#0L, (id#0L + cast(5 as bigint)) AS new_column#3L]
+- Range (0, 5, step=1, splits=Some(8))

== Optimized Logical Plan ==
Project [id#0L, (id#0L + 5) AS new_column#3L]
+- Range (0, 5, step=1, splits=Some(8))

== Physical Plan ==
*(1) Project [id#0L, (id#0L + 5) AS new_column#3L]
+- *(1) Range (0, 5, step=1, splits=8)

Alternatively, you can access the analyzed logical plan using QueryExecution and its analyzed property (that together with numberedTreeString method is a very good "debugging" tool).

val analyzedPlan = inventory.queryExecution.analyzed
scala> println(analyzedPlan.numberedTreeString)
00 Project [id#0L, (id#0L + cast(5 as bigint)) AS new_column#3L]
01 +- Range (0, 5, step=1, splits=Some(8))

FixedPoint¶

FixedPoint with maxIterations for Hints, Substitution, Resolution and Cleanup batches.

expandRelationName¶

expandRelationName(
  nameParts: Seq[String]): Seq[String]

expandRelationName...FIXME

expandRelationName is used when ResolveTables and ResolveRelations logical analysis rules are executed.

Logging¶

Enable ALL logging level for the respective session-specific loggers to see what happens inside Analyzer:

org.apache.spark.sql.internal.SessionState$$anon$1
org.apache.spark.sql.hive.HiveSessionStateBuilder$$anon$1 for Hive support

Add the following line to conf/log4j2.properties:

# with no Hive support
log4j.logger.org.apache.spark.sql.internal.SessionState$$anon$1=ALL

# with Hive support enabled
log4j.logger.org.apache.spark.sql.hive.HiveSessionStateBuilder$$anon$1=ALL

Note

The reason for such weird-looking logger names is that analyzer attribute is created as an anonymous subclass of Analyzer class in the respective SessionStates.

Refer to Logging.