DataSourceV2ScanExecBase Leaf Physical Operators¶
DataSourceV2ScanExecBase
is an extension of LeafExecNode abstraction for leaf physical operators that track number of output rows when executed (with or without support for columnar reads).
Contract¶
Input Partitions¶
partitions: Seq[InputPartition]
See:
Used when:
DataSourceV2ScanExecBase
is requested for the partitions, groupedPartitions, supportsColumnar
Input RDD¶
inputRDD: RDD[InternalRow]
See:
Custom Partitioning Expressions¶
keyGroupedPartitioning: Option[Seq[Expression]]
Optional partitioning expressions (provided by connectors using SupportsReportPartitioning)
See:
Spark Structured Streaming Not Supported
ContinuousScanExec
and MicroBatchScanExec
physical operators are not supported (and have keyGroupedPartitioning
undefined (None
)).
Used when:
DataSourceV2ScanExecBase
is requested to groupedPartitions, groupPartitions, outputPartitioning
PartitionReaderFactory¶
readerFactory: PartitionReaderFactory
PartitionReaderFactory for partition readers (of the input partitions)
See:
Used when:
BatchScanExec
physical operator is requested for an input RDDContinuousScanExec
andMicroBatchScanExec
physical operators (Spark Structured Streaming) are requested for aninputRDD
DataSourceV2ScanExecBase
physical operator is requested to outputPartitioning or supportsColumnar
Scan¶
scan: Scan
Implementations¶
- BatchScanExec
ContinuousScanExec
(Spark Structured Streaming)MicroBatchScanExec
(Spark Structured Streaming)
Executing Physical Operator¶
doExecute
...FIXME
doExecuteColumnar¶
SparkPlan
doExecuteColumnar(): RDD[ColumnarBatch]
doExecuteColumnar
is part of the SparkPlan abstraction.
doExecuteColumnar
...FIXME
Performance Metrics¶
metrics
is the following SQLMetrics with the customMetrics:
Metric Name | web UI |
---|---|
numOutputRows | number of output rows |
Output Data Partitioning¶
SparkPlan
outputPartitioning: physical.Partitioning
outputPartitioning
is part of the SparkPlan abstraction.
outputPartitioning
...FIXME
Output Data Ordering¶
outputOrdering
...FIXME
Simple Node Description¶
simpleString
...FIXME
supportsColumnar¶
supportsColumnar
is true
if the PartitionReaderFactory can supportColumnarReads for all the input partitions. Otherwise, supportsColumnar
is false
.
supportsColumnar
makes sure that either all the input partitions are supportColumnarReads or none, or throws an IllegalArgumentException
:
Cannot mix row-based and columnar input partitions.
Custom Metrics¶
customMetrics: Map[String, SQLMetric]
Lazy Value
customMetrics
is a Scala lazy value to guarantee that the code to initialize it is executed once only (when accessed for the first time) and the computed value never changes afterwards.
Learn more in the Scala Language Specification.
customMetrics
requests the Scan for supportedCustomMetrics that are then converted to SQLMetrics.
customMetrics
is used when:
DataSourceV2ScanExecBase
is requested for the performance metricsBatchScanExec
is requested for the inputRDDContinuousScanExec
is requested for theinputRDD
MicroBatchScanExec
is requested for theinputRDD
(that creates a DataSourceRDD)
verboseStringWithOperatorId¶
QueryPlan
verboseStringWithOperatorId(): String
verboseStringWithOperatorId
is part of the QueryPlan abstraction.
verboseStringWithOperatorId
requests the Scan for one of the following (metaDataStr
):
- Metadata when SupportsMetadata
- Description, otherwise
In the end, verboseStringWithOperatorId
is as follows (based on formattedNodeName and output):
[formattedNodeName]
Output: [output]
[metaDataStr]
Input Partitions¶
partitions: Seq[Seq[InputPartition]]
partitions
...FIXME
partitions
is used when:
BatchScanExec
physical operator is requested to filteredPartitionsContinuousScanExec
physical operator (Spark Structured Streaming) is requested for theinputRDD
MicroBatchScanExec
physical operator (Spark Structured Streaming) is requested for theinputRDD
groupedPartitions¶
groupedPartitions: Option[Seq[(InternalRow, Seq[InputPartition])]]
Lazy Value
groupedPartitions
is a Scala lazy value to guarantee that the code to initialize it is executed once only (when accessed for the first time) and the computed value never changes afterwards.
Learn more in the Scala Language Specification.
groupedPartitions
takes the keyGroupedPartitioning, if specified, and group the input partitions.
groupedPartitions
is used when:
DataSourceV2ScanExecBase
physical operator is requested for the output data ordering, output data partitioning requirements, partitions
groupPartitions¶
groupPartitions(
inputPartitions: Seq[InputPartition],
groupSplits: Boolean = !conf.v2BucketingPushPartValuesEnabled || !conf.v2BucketingPartiallyClusteredDistributionEnabled): Option[Seq[(InternalRow, Seq[InputPartition])]]
Noop
groupPartitions
does nothing (and returns None
) when called with spark.sql.sources.v2.bucketing.enabled disabled.
groupPartitions
...FIXME
groupPartitions
is used when:
BatchScanExec
physical operator is requested for the filtered input partitions and input RDDDataSourceV2ScanExecBase
is requested for the groupedPartitions