Skip to content


DataSourceRDD is a RDD[InternalRow] that acts as a thin adapter between Spark SQL's DataSource V2 and Spark Core's RDD API.

DataSourceRDD uses DataSourceRDDPartition for the partitions (that is a mere wrapper of the InputPartitions).

Creating Instance

DataSourceRDD takes the following to be created:

DataSourceRDD is created when:

columnarReads Flag

DataSourceRDD is given columnarReads flag when created.

columnarReads is used to determine the type of scan (row-based or columnar) when computing a partition.

columnarReads is enabled (using supportsColumnar) when the PartitionReaderFactory can support columnar scans.

Preferred Locations For Partition

    split: Partition): Seq[String]

getPreferredLocations simply requests the given split DataSourceRDDPartition for the InputPartition that in turn is requested for the preferred locations.

getPreferredLocations is part of Spark Core's RDD abstraction.

RDD Partitions

getPartitions: Array[Partition]

getPartitions simply creates a DataSourceRDDPartition for every <>.

getPartitions is part of Spark Core's RDD abstraction.

Computing Partition (in TaskContext)

    split: Partition,
    context: TaskContext): Iterator[T]


compute is part of Spark Core's RDD abstraction.