MicroBatchScanExec Physical Operator¶
MicroBatchScanExec
is a DataSourceV2ScanExecBase
(Spark SQL) that represents StreamingDataSourceV2Relation logical operator at execution.
MicroBatchScanExec
is just a very thin wrapper over MicroBatchStream and does nothing but delegates all the important execution-specific preparation to it.
Creating Instance¶
MicroBatchScanExec
takes the following to be created:
- Output
Attribute
s (Spark SQL) -
Scan
(Spark SQL) - MicroBatchStream
- Start Offset
- End Offset
- Key-Grouped Partitioning
MicroBatchScanExec
is created when:
DataSourceV2Strategy
(Spark SQL) execution planning strategy is requested to plan a logical query plan with a StreamingDataSourceV2Relation
Note
All the properties to create a MicroBatchScanExec are copied straight from a StreamingDataSourceV2Relation directly.
MicroBatchStream¶
MicroBatchScanExec
is given a MicroBatchStream when created.
The MicroBatchStream
is the SparkDataStream of the StreamingDataSourceV2Relation logical operator (it was created from).
MicroBatchScanExec
uses the MicroBatchStream
to handle inputPartitions, readerFactory and inputRDD (that are all the custom overloaded methods of MicroBatchScanExec
).
Input Partitions¶
inputPartitions: Seq[InputPartition]
inputPartitions
is part of the DataSourceV2ScanExecBase
(Spark SQL) abstraction.
inputPartitions
requests the MicroBatchStream to planInputPartitions for the start and end offsets.
PartitionReaderFactory¶
readerFactory: PartitionReaderFactory
readerFactory
is part of the DataSourceV2ScanExecBase
(Spark SQL) abstraction.
readerFactory
requests the MicroBatchStream to createReaderFactory.
Input RDD¶
inputRDD: RDD[InternalRow]
inputRDD
is part of the DataSourceV2ScanExecBase
(Spark SQL) abstraction.
inputRDD
creates a DataSourceRDD
(Spark SQL) for the partitions and the PartitionReaderFactory (and the others).