DataSourceRDD¶

DataSourceRDD is an RDD of InternalRows (RDD[InternalRow]) that acts as a thin adapter between Spark SQL's DataSource V2 and Spark Core's RDD API.

DataSourceRDD is used as an input RDD of the following physical operators:

BatchScanExec
MicroBatchScanExec (Spark Structured Streaming)

DataSourceRDD uses DataSourceRDDPartition for the partitions (that is a mere wrapper of the InputPartitions).

Creating Instance¶

DataSourceRDD takes the following to be created:

DataSourceRDD is created when:

BatchScanExec physical operator is requested for an input RDD
MicroBatchScanExec (Spark Structured Streaming) physical operator is requested for an inputRDD

InputPartitions¶

inputPartitions: Seq[Seq[InputPartition]]

DataSourceRDD is given a collection of InputPartitions when created.

The InputPartitions are used to create RDD partitions (one for every collection of InputPartitions in the inputPartitions collection)

Note

Number of RDD partitions is exactly the number of elements in the inputPartitions collection.

The InputPartitions are the filtered partitions in BatchScanExec.

columnarReads¶

DataSourceRDD is given columnarReads flag when created.

columnarReads is used to determine the type of scan (row-based or columnar) when computing a partition.

columnarReads is enabled (using supportsColumnar) when the PartitionReaderFactory can support columnar scans.

RDD Partitions¶

Signature

getPartitions: Array[Partition]

getPartitions is part of RDD (Spark Core) abstraction.

getPartitions creates one DataSourceRDDPartition for every collection of InputPartitions in the given inputPartitions.

Preferred Locations For Partition¶

Signature

getPreferredLocations(
    split: Partition): Seq[String]

getPreferredLocations is part of RDD (Spark Core) abstraction.

getPreferredLocations assumes that the given split partition is a DataSourceRDDPartition.

getPreferredLocations requests the given DataSourceRDDPartition for the InputPartition that is then requested for the preferred locations.

Computing Partition¶

Signature

compute(
  split: Partition,
  context: TaskContext): Iterator[T]

compute is part of RDD (Spark Core) abstraction.

compute assumes that the given Partition is a DataSourceRDDPartition (or throws a SparkException).

DataSourceRDDPartition and InputPartitions

DataSourceRDDPartition can have many inputPartitions.

compute requests the PartitionReaderFactory to createColumnarReader or createReader based on columnarReads flag.