DataSourceV2ScanExecBase Leaf Physical Operators¶

DataSourceV2ScanExecBase is an extension of LeafExecNode abstraction for leaf physical operators that track number of output rows when executed (with or without support for columnar reads).

Contract¶

Input Partitions¶

partitions: Seq[InputPartition]

See:

BatchScanExec

Used when:

DataSourceV2ScanExecBase is requested for the partitions, groupedPartitions, supportsColumnar

Input RDD¶

inputRDD: RDD[InternalRow]

See:

BatchScanExec

Custom Partitioning Expressions¶

keyGroupedPartitioning: Option[Seq[Expression]]

Optional partitioning expressions (provided by connectors using SupportsReportPartitioning)

See:

BatchScanExec

Spark Structured Streaming Not Supported

ContinuousScanExec and MicroBatchScanExec physical operators are not supported (and have keyGroupedPartitioning undefined (None)).

Used when:

DataSourceV2ScanExecBase is requested to groupedPartitions, groupPartitions, outputPartitioning

PartitionReaderFactory¶

readerFactory: PartitionReaderFactory

PartitionReaderFactory for partition readers (of the input partitions)

See:

BatchScanExec

Used when:

BatchScanExec physical operator is requested for an input RDD
ContinuousScanExec and MicroBatchScanExec physical operators (Spark Structured Streaming) are requested for an inputRDD
DataSourceV2ScanExecBase physical operator is requested to outputPartitioning or supportsColumnar

Scan¶

scan: Scan

Scan

Implementations¶

BatchScanExec
ContinuousScanExec (Spark Structured Streaming)
MicroBatchScanExec (Spark Structured Streaming)

Executing Physical Operator¶

SparkPlan

doExecute(): RDD[InternalRow]

doExecute is part of the SparkPlan abstraction.

doExecute...FIXME

doExecuteColumnar¶

SparkPlan

doExecuteColumnar(): RDD[ColumnarBatch]

doExecuteColumnar is part of the SparkPlan abstraction.

doExecuteColumnar...FIXME

Performance Metrics¶

SparkPlan

metrics: Map[String, SQLMetric]

metrics is part of the SparkPlan abstraction.

metrics is the following SQLMetrics with the customMetrics:

Metric Name	web UI
`numOutputRows`	number of output rows

Output Data Partitioning¶

SparkPlan

outputPartitioning: physical.Partitioning

outputPartitioning is part of the SparkPlan abstraction.

outputPartitioning...FIXME

Output Data Ordering¶

QueryPlan

outputOrdering: Seq[SortOrder]

outputOrdering is part of the QueryPlan abstraction.

outputOrdering...FIXME

Simple Node Description¶

TreeNode

simpleString(
    maxFields: Int): String

simpleString is part of the TreeNode abstraction.

simpleString...FIXME

supportsColumnar¶

SparkPlan

supportsColumnar: Boolean

supportsColumnar is part of the SparkPlan abstraction.

supportsColumnar is true if the PartitionReaderFactory can supportColumnarReads for all the input partitions. Otherwise, supportsColumnar is false.

supportsColumnar makes sure that either all the input partitions are supportColumnarReads or none, or throws an IllegalArgumentException:

Cannot mix row-based and columnar input partitions.

Custom Metrics¶

customMetrics: Map[String, SQLMetric]

Lazy Value

customMetrics is a Scala lazy value to guarantee that the code to initialize it is executed once only (when accessed for the first time) and the computed value never changes afterwards.

Learn more in the Scala Language Specification.

customMetrics requests the Scan for supportedCustomMetrics that are then converted to SQLMetrics.

customMetrics is used when:

DataSourceV2ScanExecBase is requested for the performance metrics
BatchScanExec is requested for the inputRDD
ContinuousScanExec is requested for the inputRDD
MicroBatchScanExec is requested for the inputRDD (that creates a DataSourceRDD)

verboseStringWithOperatorId¶

QueryPlan

verboseStringWithOperatorId(): String

verboseStringWithOperatorId is part of the QueryPlan abstraction.

verboseStringWithOperatorId requests the Scan for one of the following (metaDataStr):

Metadata when SupportsMetadata
Description, otherwise

In the end, verboseStringWithOperatorId is as follows (based on formattedNodeName and output):

[formattedNodeName]
Output: [output]
[metaDataStr]

Input Partitions¶

partitions: Seq[Seq[InputPartition]]

partitions...FIXME

partitions is used when:

BatchScanExec physical operator is requested to filteredPartitions
ContinuousScanExec physical operator (Spark Structured Streaming) is requested for the inputRDD
MicroBatchScanExec physical operator (Spark Structured Streaming) is requested for the inputRDD

groupedPartitions¶

groupedPartitions: Option[Seq[(InternalRow, Seq[InputPartition])]]

Lazy Value

groupedPartitions is a Scala lazy value to guarantee that the code to initialize it is executed once only (when accessed for the first time) and the computed value never changes afterwards.

Learn more in the Scala Language Specification.

groupedPartitions takes the keyGroupedPartitioning, if specified, and group the input partitions.

groupedPartitions is used when:

DataSourceV2ScanExecBase physical operator is requested for the output data ordering, output data partitioning requirements, partitions

groupPartitions¶

groupPartitions(
  inputPartitions: Seq[InputPartition],
  groupSplits: Boolean = !conf.v2BucketingPushPartValuesEnabled || !conf.v2BucketingPartiallyClusteredDistributionEnabled): Option[Seq[(InternalRow, Seq[InputPartition])]]

Noop

groupPartitions does nothing (and returns None) when called with spark.sql.sources.v2.bucketing.enabled disabled.

groupPartitions...FIXME

groupPartitions is used when:

BatchScanExec physical operator is requested for the filtered input partitions and input RDD
DataSourceV2ScanExecBase is requested for the groupedPartitions