LogicalRelation Leaf Logical Operator¶

LogicalRelation is a leaf logical operator that represents a BaseRelation in a logical query plan.

LogicalRelation is a MultiInstanceRelation.

Creating Instance¶

LogicalRelation takes the following to be created:

BaseRelation
Output Schema (AttributeReferences)
Optional CatalogTable
isStreaming flag

LogicalRelation is created using apply factory.

apply Utility¶

apply(
  relation: BaseRelation,
  isStreaming: Boolean = false): LogicalRelation
apply(
  relation: BaseRelation,
  table: CatalogTable): LogicalRelation

apply wraps the given BaseRelation into a LogicalRelation (so it could be used in a logical query plan).

apply creates a LogicalRelation for the given BaseRelation (with a CatalogTable and isStreaming flag).

import org.apache.spark.sql.sources.BaseRelation
val baseRelation: BaseRelation = ???

val data = spark.baseRelationToDataFrame(baseRelation)

apply is used when:

SparkSession is requested for a DataFrame for a BaseRelation
CreateTempViewUsing command is executed
FallBackFileSourceV2 logical resolution rule is executed
ResolveSQLOnFile and FindDataSourceTable logical evaluation rules are executed
HiveMetastoreCatalog is requested to convert a HiveTableRelation
FileStreamSource (Spark Structured Streaming) is requested to getBatch

refresh¶

refresh(): Unit

refresh is part of LogicalPlan abstraction.

refresh requests the FileIndex (of the HadoopFsRelation) to refresh.

Note

refresh does the work for HadoopFsRelation relations only.

Simple Text Representation¶

simpleString(
  maxFields: Int): String

simpleString is part of the QueryPlan abstraction.

simpleString is made up of the output schema (truncated to maxFields) and the relation:

Relation[[output]] [relation]

Demo¶

val q = spark.read.text("README.md")
val logicalPlan = q.queryExecution.logical

scala> println(logicalPlan.simpleString)
Relation[value#2] text

computeStats¶

computeStats(): Statistics

computeStats is part of the LeafNode abstraction.

computeStats takes the optional CatalogTable.

If available, computeStats requests the CatalogTable for the CatalogStatistics that, if available, is requested to toPlanStats (with the planStatsEnabled flag enabled when either spark.sql.cbo.enabled or spark.sql.cbo.planStats.enabled is enabled).

Otherwise, computeStats creates a Statistics with the sizeInBytes only to be the sizeInBytes of the BaseRelation.

Demo¶

The following are two logically-equivalent batch queries described using different Spark APIs: Scala and SQL.

val format = "csv"
val path = "../datasets/people.csv"

val q = spark
  .read
  .option("header", true)
  .format(format)
  .load(path)

scala> println(q.queryExecution.logical.numberedTreeString)
00 Relation[id#16,name#17] csv

val q = sql(s"select * from `$format`.`$path`")

scala> println(q.queryExecution.optimizedPlan.numberedTreeString)
00 Relation[_c0#74,_c1#75] csv