Skip to content

ContinuousDataSourceRDD

ContinuousDataSourceRDD is a specialized RDD (RDD[InternalRow]) that is used exclusively for the only input RDD (with the input rows) of DataSourceV2ScanExec leaf physical operator with a ContinuousReader.

ContinuousDataSourceRDD is <> exclusively when DataSourceV2ScanExec leaf physical operator is requested for the input RDDs (which there is only one actually).

[[spark.sql.streaming.continuous.executorQueueSize]] ContinuousDataSourceRDD uses spark.sql.streaming.continuous.executorQueueSize configuration property for the <>.

[[spark.sql.streaming.continuous.executorPollIntervalMs]] ContinuousDataSourceRDD uses spark.sql.streaming.continuous.executorPollIntervalMs configuration property for the <>.

[[creating-instance]] ContinuousDataSourceRDD takes the following to be created:

  • [[sc]] SparkContext
  • [[dataQueueSize]] Size of the data queue
  • [[epochPollIntervalMs]] epochPollIntervalMs
  • [[readerInputPartitions]] InputPartition[InternalRow]s

[[getPreferredLocations]] ContinuousDataSourceRDD uses InputPartition (of a ContinuousDataSourceRDDPartition) for preferred host locations (where the input partition reader can run faster).

=== [[compute]] Computing Partition -- compute Method

[source, scala]

compute( split: Partition, context: TaskContext): Iterator[InternalRow]


NOTE: compute is part of the RDD Contract to compute a given partition.

compute...FIXME

=== [[getPartitions]] getPartitions Method

[source, scala]

getPartitions: Array[Partition]

NOTE: getPartitions is part of the RDD Contract to specify the partitions to <>.

getPartitions...FIXME