MicroBatchScanExec Physical Operator¶

MicroBatchScanExec is a DataSourceV2ScanExecBase (Spark SQL) that represents StreamingDataSourceV2Relation logical operator at execution.

MicroBatchScanExec is just a very thin wrapper over MicroBatchStream and does nothing but delegates all the important execution-specific preparation to it.

Creating Instance¶

MicroBatchScanExec takes the following to be created:

Output Attributes (Spark SQL)
Scan (Spark SQL)
MicroBatchStream
Start Offset
End Offset
Key-Grouped Partitioning

MicroBatchScanExec is created when:

DataSourceV2Strategy (Spark SQL) execution planning strategy is requested to plan a logical query plan with a StreamingDataSourceV2Relation

Note

All the properties to create a MicroBatchScanExec are copied straight from a StreamingDataSourceV2Relation directly.

MicroBatchStream¶

MicroBatchScanExec is given a MicroBatchStream when created.

The MicroBatchStream is the SparkDataStream of the StreamingDataSourceV2Relation logical operator (it was created from).

MicroBatchScanExec uses the MicroBatchStream to handle inputPartitions, readerFactory and inputRDD (that are all the custom overloaded methods of MicroBatchScanExec).

Input Partitions¶

inputPartitions: Seq[InputPartition]

inputPartitions is part of the DataSourceV2ScanExecBase (Spark SQL) abstraction.

inputPartitions requests the MicroBatchStream to planInputPartitions for the start and end offsets.

PartitionReaderFactory¶

readerFactory: PartitionReaderFactory

readerFactory is part of the DataSourceV2ScanExecBase (Spark SQL) abstraction.

readerFactory requests the MicroBatchStream to createReaderFactory.

Input RDD¶

inputRDD: RDD[InternalRow]

inputRDD is part of the DataSourceV2ScanExecBase (Spark SQL) abstraction.

inputRDD creates a DataSourceRDD (Spark SQL) for the partitions and the PartitionReaderFactory (and the others).