DataSourceRDD¶
DataSourceRDD
is a RDD[InternalRow]
that acts as a thin adapter between Spark SQL's DataSource V2 and Spark Core's RDD API.
DataSourceRDD
uses DataSourceRDDPartition for the partitions (that is a mere wrapper of the InputPartitions).
Creating Instance¶
DataSourceRDD
takes the following to be created:
-
SparkContext
- InputPartitions
- PartitionReaderFactory
- columnarReads flag
DataSourceRDD
is created when:
BatchScanExec
physical operator is requested for an input RDDMicroBatchScanExec
(Spark Structured Streaming) physical operator is requested for aninputRDD
columnarReads Flag¶
DataSourceRDD
is given columnarReads
flag when created.
columnarReads
is used to determine the type of scan (row-based or columnar) when computing a partition.
columnarReads
is enabled (using supportsColumnar) when the PartitionReaderFactory can support columnar scans.
Preferred Locations For Partition¶
getPreferredLocations(
split: Partition): Seq[String]
getPreferredLocations
simply requests the given split
DataSourceRDDPartition for the InputPartition that in turn is requested for the preferred locations.
getPreferredLocations
is part of Spark Core's RDD
abstraction.
RDD Partitions¶
getPartitions: Array[Partition]
getPartitions
simply creates a DataSourceRDDPartition for every <
getPartitions
is part of Spark Core's RDD
abstraction.
Computing Partition (in TaskContext)¶
compute(
split: Partition,
context: TaskContext): Iterator[T]
compute
...FIXME
compute
is part of Spark Core's RDD
abstraction.