DataSourceRDD¶
DataSourceRDD
is an RDD of InternalRows (RDD[InternalRow]
) that acts as a thin adapter between Spark SQL's DataSource V2 and Spark Core's RDD API.
DataSourceRDD
is used as an input RDD of the following physical operators:
- BatchScanExec
MicroBatchScanExec
(Spark Structured Streaming)
DataSourceRDD
uses DataSourceRDDPartition for the partitions (that is a mere wrapper of the InputPartitions).
Creating Instance¶
DataSourceRDD
takes the following to be created:
-
SparkContext
(Spark Core) - InputPartitions
- PartitionReaderFactory
- columnarReads flag
- Custom SQLMetrics
DataSourceRDD
is created when:
BatchScanExec
physical operator is requested for an input RDDMicroBatchScanExec
(Spark Structured Streaming) physical operator is requested for aninputRDD
InputPartitions¶
inputPartitions: Seq[Seq[InputPartition]]
DataSourceRDD
is given a collection of InputPartitions when created.
The InputPartition
s are used to create RDD partitions (one for every element in the inputPartitions
collection)
Note
Number of RDD partitions is exactly the number of elements in the inputPartitions
collection.
The InputPartition
s are the filtered partitions in BatchScanExec.
columnarReads¶
DataSourceRDD
is given columnarReads
flag when created.
columnarReads
is used to determine the type of scan (row-based or columnar) when computing a partition.
columnarReads
is enabled (using supportsColumnar) when the PartitionReaderFactory can support columnar scans.
RDD Partitions¶
getPartitions: Array[Partition]
getPartitions
is part of RDD
(Spark Core) abstraction.
getPartitions
creates a DataSourceRDDPartition for every InputPartition.
Preferred Locations For Partition¶
getPreferredLocations(
split: Partition): Seq[String]
getPreferredLocations
is part of RDD
(Spark Core) abstraction.
getPreferredLocations
assumes that the given split
partition is a DataSourceRDDPartition.
getPreferredLocations
requests the given DataSourceRDDPartition for the InputPartition that is then requested for the preferred locations.
Computing Partition¶
compute(
split: Partition,
context: TaskContext): Iterator[T]
compute
is part of RDD
(Spark Core) abstraction.
compute
...FIXME