Skip to content

DataSourceV2ScanExecBase Leaf Physical Operators

DataSourceV2ScanExecBase is an extension of LeafExecNode abstraction for leaf physical operators that track number of output rows when executed (with or without support for columnar reads).

Contract

Input Partitions

partitions: Seq[InputPartition]

See:

Used when:

Input RDD

inputRDD: RDD[InternalRow]

See:

Custom Partitioning Expressions

keyGroupedPartitioning: Option[Seq[Expression]]

Optional partitioning expressions (provided by connectors using SupportsReportPartitioning)

See:

Spark Structured Streaming Not Supported

ContinuousScanExec and MicroBatchScanExec physical operators are not supported (and have keyGroupedPartitioning undefined (None)).

Used when:

PartitionReaderFactory

readerFactory: PartitionReaderFactory

PartitionReaderFactory for partition readers (of the input partitions)

See:

Used when:

  • BatchScanExec physical operator is requested for an input RDD
  • ContinuousScanExec and MicroBatchScanExec physical operators (Spark Structured Streaming) are requested for an inputRDD
  • DataSourceV2ScanExecBase physical operator is requested to outputPartitioning or supportsColumnar

Scan

scan: Scan

Scan

Implementations

Executing Physical Operator

SparkPlan
doExecute(): RDD[InternalRow]

doExecute is part of the SparkPlan abstraction.

doExecute...FIXME

doExecuteColumnar

SparkPlan
doExecuteColumnar(): RDD[ColumnarBatch]

doExecuteColumnar is part of the SparkPlan abstraction.

doExecuteColumnar...FIXME

Performance Metrics

SparkPlan
metrics: Map[String, SQLMetric]

metrics is part of the SparkPlan abstraction.

metrics is the following SQLMetrics with the customMetrics:

Metric Name web UI
numOutputRows number of output rows

Output Data Partitioning

SparkPlan
outputPartitioning: physical.Partitioning

outputPartitioning is part of the SparkPlan abstraction.

outputPartitioning...FIXME

Output Data Ordering

QueryPlan
outputOrdering: Seq[SortOrder]

outputOrdering is part of the QueryPlan abstraction.

outputOrdering...FIXME

Simple Node Description

TreeNode
simpleString(
    maxFields: Int): String

simpleString is part of the TreeNode abstraction.

simpleString...FIXME

supportsColumnar

SparkPlan
supportsColumnar: Boolean

supportsColumnar is part of the SparkPlan abstraction.

supportsColumnar is true if the PartitionReaderFactory can supportColumnarReads for all the input partitions. Otherwise, supportsColumnar is false.


supportsColumnar makes sure that either all the input partitions are supportColumnarReads or none, or throws an IllegalArgumentException:

Cannot mix row-based and columnar input partitions.

Custom Metrics

customMetrics: Map[String, SQLMetric]
Lazy Value

customMetrics is a Scala lazy value to guarantee that the code to initialize it is executed once only (when accessed for the first time) and the computed value never changes afterwards.

Learn more in the Scala Language Specification.

customMetrics requests the Scan for supportedCustomMetrics that are then converted to SQLMetrics.


customMetrics is used when:

  • DataSourceV2ScanExecBase is requested for the performance metrics
  • BatchScanExec is requested for the inputRDD
  • ContinuousScanExec is requested for the inputRDD
  • MicroBatchScanExec is requested for the inputRDD (that creates a DataSourceRDD)

verboseStringWithOperatorId

QueryPlan
verboseStringWithOperatorId(): String

verboseStringWithOperatorId is part of the QueryPlan abstraction.

verboseStringWithOperatorId requests the Scan for one of the following (metaDataStr):

In the end, verboseStringWithOperatorId is as follows (based on formattedNodeName and output):

[formattedNodeName]
Output: [output]
[metaDataStr]

Input Partitions

partitions: Seq[Seq[InputPartition]]

partitions...FIXME


partitions is used when:

groupedPartitions

groupedPartitions: Option[Seq[(InternalRow, Seq[InputPartition])]]
Lazy Value

groupedPartitions is a Scala lazy value to guarantee that the code to initialize it is executed once only (when accessed for the first time) and the computed value never changes afterwards.

Learn more in the Scala Language Specification.

groupedPartitions takes the keyGroupedPartitioning, if specified, and group the input partitions.


groupedPartitions is used when:

groupPartitions

groupPartitions(
  inputPartitions: Seq[InputPartition],
  groupSplits: Boolean = !conf.v2BucketingPushPartValuesEnabled || !conf.v2BucketingPartiallyClusteredDistributionEnabled): Option[Seq[(InternalRow, Seq[InputPartition])]]

Noop

groupPartitions does nothing (and returns None) when called with spark.sql.sources.v2.bucketing.enabled disabled.

groupPartitions...FIXME


groupPartitions is used when: