DataSourceV2ScanExecBase Leaf Physical Operators¶
DataSourceV2ScanExecBase is an extension of LeafExecNode abstraction for leaf physical operators that track number of output rows when executed (with or without support for columnar reads).
Contract¶
Input Partitions¶
partitions: Seq[InputPartition]
See:
Used when:
DataSourceV2ScanExecBaseis requested for the partitions, groupedPartitions, supportsColumnar
Input RDD¶
inputRDD: RDD[InternalRow]
See:
Custom Partitioning Expressions¶
keyGroupedPartitioning: Option[Seq[Expression]]
Optional partitioning expressions (provided by connectors using SupportsReportPartitioning)
See:
Spark Structured Streaming Not Supported
ContinuousScanExec and MicroBatchScanExec physical operators are not supported (and have keyGroupedPartitioning undefined (None)).
Used when:
DataSourceV2ScanExecBaseis requested to groupedPartitions, groupPartitions, outputPartitioning
PartitionReaderFactory¶
readerFactory: PartitionReaderFactory
PartitionReaderFactory for partition readers (of the input partitions)
See:
Used when:
BatchScanExecphysical operator is requested for an input RDDContinuousScanExecandMicroBatchScanExecphysical operators (Spark Structured Streaming) are requested for aninputRDDDataSourceV2ScanExecBasephysical operator is requested to outputPartitioning or supportsColumnar
Scan¶
scan: Scan
Implementations¶
- BatchScanExec
ContinuousScanExec(Spark Structured Streaming)MicroBatchScanExec(Spark Structured Streaming)
Executing Physical Operator¶
doExecute...FIXME
doExecuteColumnar¶
SparkPlan
doExecuteColumnar(): RDD[ColumnarBatch]
doExecuteColumnar is part of the SparkPlan abstraction.
doExecuteColumnar...FIXME
Performance Metrics¶
metrics is the following SQLMetrics with the customMetrics:
| Metric Name | web UI |
|---|---|
numOutputRows | number of output rows |
Output Data Partitioning¶
SparkPlan
outputPartitioning: physical.Partitioning
outputPartitioning is part of the SparkPlan abstraction.
outputPartitioning...FIXME
Output Data Ordering¶
outputOrdering...FIXME
Simple Node Description¶
simpleString...FIXME
supportsColumnar¶
supportsColumnar is true if the PartitionReaderFactory can supportColumnarReads for all the input partitions. Otherwise, supportsColumnar is false.
supportsColumnar makes sure that either all the input partitions are supportColumnarReads or none, or throws an IllegalArgumentException:
Cannot mix row-based and columnar input partitions.
Custom Metrics¶
customMetrics: Map[String, SQLMetric]
Lazy Value
customMetrics is a Scala lazy value to guarantee that the code to initialize it is executed once only (when accessed for the first time) and the computed value never changes afterwards.
Learn more in the Scala Language Specification.
customMetrics requests the Scan for supportedCustomMetrics that are then converted to SQLMetrics.
customMetrics is used when:
DataSourceV2ScanExecBaseis requested for the performance metricsBatchScanExecis requested for the inputRDDContinuousScanExecis requested for theinputRDDMicroBatchScanExecis requested for theinputRDD(that creates a DataSourceRDD)
verboseStringWithOperatorId¶
QueryPlan
verboseStringWithOperatorId(): String
verboseStringWithOperatorId is part of the QueryPlan abstraction.
verboseStringWithOperatorId requests the Scan for one of the following (metaDataStr):
- Metadata when SupportsMetadata
- Description, otherwise
In the end, verboseStringWithOperatorId is as follows (based on formattedNodeName and output):
[formattedNodeName]
Output: [output]
[metaDataStr]
Input Partitions¶
partitions: Seq[Seq[InputPartition]]
partitions...FIXME
partitions is used when:
BatchScanExecphysical operator is requested to filteredPartitionsContinuousScanExecphysical operator (Spark Structured Streaming) is requested for theinputRDDMicroBatchScanExecphysical operator (Spark Structured Streaming) is requested for theinputRDD
groupedPartitions¶
groupedPartitions: Option[Seq[(InternalRow, Seq[InputPartition])]]
Lazy Value
groupedPartitions is a Scala lazy value to guarantee that the code to initialize it is executed once only (when accessed for the first time) and the computed value never changes afterwards.
Learn more in the Scala Language Specification.
groupedPartitions takes the keyGroupedPartitioning, if specified, and group the input partitions.
groupedPartitions is used when:
DataSourceV2ScanExecBasephysical operator is requested for the output data ordering, output data partitioning requirements, partitions
groupPartitions¶
groupPartitions(
inputPartitions: Seq[InputPartition],
groupSplits: Boolean = !conf.v2BucketingPushPartValuesEnabled || !conf.v2BucketingPartiallyClusteredDistributionEnabled): Option[Seq[(InternalRow, Seq[InputPartition])]]
Noop
groupPartitions does nothing (and returns None) when called with spark.sql.sources.v2.bucketing.enabled disabled.
groupPartitions...FIXME
groupPartitions is used when:
BatchScanExecphysical operator is requested for the filtered input partitions and input RDDDataSourceV2ScanExecBaseis requested for the groupedPartitions