SparkPlanner — Spark Query Planner¶
SparkPlanner is a concrete Catalyst Query Planner that converts a logical plan to one or more physical plans using execution planning strategies with support for extension points:
- extra strategies (by means of ExperimentalMethods)
- extraPlanningStrategies
SparkPlanner is expected to plan (create) at least one physical plan for a given logical plan.
SparkPlanner extends SparkStrategies abstract class.
Execution Planning Strategies¶
- extraStrategies of the ExperimentalMethods
- extraPlanningStrategies
- LogicalQueryStageStrategy
- PythonEvals
- DataSourceV2Strategy
- FileSourceStrategy
- DataSourceStrategy
- SpecialLimits
- Aggregation
- Window
- JoinSelection
- InMemoryScans
- SparkScripts
- WithCTEStrategy
- BasicOperators
Creating Instance¶
SparkPlanner takes the following to be created:
SparkPlanner is created when the following are requested for one:
- BaseSessionStateBuilder
- HiveSessionStateBuilder
- Structured Streaming's
IncrementalExecution
Accessing SparkPlanner¶
SparkPlanner is available as planner of a SessionState.
val spark: SparkSession = ...
scala> :type spark.sessionState.planner
org.apache.spark.sql.execution.SparkPlanner
Extra Planning Strategies¶
extraPlanningStrategies: Seq[Strategy] = Nil
extraPlanningStrategies is an extension point to register extra planning strategies.
extraPlanningStrategies is used when SparkPlanner is requested for planning strategies.
extraPlanningStrategies is overriden in the SessionState builders (BaseSessionStateBuilder and HiveSessionStateBuilder).
Collecting PlanLater Physical Operators¶
collectPlaceholders(
plan: SparkPlan): Seq[(SparkPlan, LogicalPlan)]
collectPlaceholders collects all PlanLater physical operators in the given physical plan.
collectPlaceholders is part of QueryPlanner abstraction.
Filter and Project Pruning¶
pruneFilterProject(
projectList: Seq[NamedExpression],
filterPredicates: Seq[Expression],
prunePushedDownFilters: Seq[Expression] => Seq[Expression],
scanBuilder: Seq[Attribute] => SparkPlan): SparkPlan
Note
pruneFilterProject is almost like DataSourceStrategy.pruneFilterProjectRaw.
pruneFilterProject branches off per whether it is possible to use a column pruning only (to get the right projection) and the input projectList columns of this projection are enough to evaluate all input filterPredicates filter conditions.
If so, pruneFilterProject does the following:
-
Applies the input
scanBuilderfunction to the inputprojectListcolumns that creates a new physical operator -
If there are Catalyst predicate expressions in the input
prunePushedDownFiltersthat cannot be pushed down,pruneFilterProjectcreates a FilterExec unary physical operator (with the unhandled predicate expressions) -
Otherwise,
pruneFilterProjectsimply returns the physical operator
In this case no extra ProjectExec unary physical operator is created.
If not (i.e. it is neither possible to use a column pruning only nor evaluate filter conditions), pruneFilterProject does the following:
-
Applies the input
scanBuilderfunction to the projection and filtering columns that creates a new physical operator -
Creates a FilterExec unary physical operator (with the unhandled predicate expressions if available)
-
Creates a ProjectExec unary physical operator with the optional
FilterExecoperator (with the scan physical operator) or simply the scan physical operator alone
pruneFilterProject is used when HiveTableScans and InMemoryScans execution planning strategies are executed.