WriteToContinuousDataSourceExec Physical Operator¶

WriteToContinuousDataSourceExec is a unary physical operator (Spark SQL) that <>.

WriteToContinuousDataSourceExec is <> exclusively when DataSourceV2Strategy (Spark SQL) execution planning strategy is requested to plan a WriteToContinuousDataSource unary logical operator.

[[creating-instance]] WriteToContinuousDataSourceExec takes the following to be created:

[[query]][[child]] Child physical operator (SparkPlan)

[[output]] WriteToContinuousDataSourceExec uses empty output schema (which is exactly to say that no output is expected whatsoever).

[[logging]] [TIP] ==== Enable ALL logging level for org.apache.spark.sql.execution.streaming.continuous.WriteToContinuousDataSourceExec to see what happens inside.

Add the following line to conf/log4j.properties:

log4j.logger.org.apache.spark.sql.execution.streaming.continuous.WriteToContinuousDataSourceExec=ALL

Refer to <>.¶

=== [[doExecute]] Executing Physical Operator (Generating RDD[InternalRow]) -- doExecute Method

[source, scala]¶

doExecute(): RDD[InternalRow]¶

NOTE: doExecute is part of SparkPlan Contract to generate the runtime representation of an physical operator as a distributed computation over internal binary rows on Apache Spark (i.e. RDD[InternalRow]).

doExecute requests the <> to create a DataWriterFactory.

doExecute then requests the <> to execute (that gives a RDD[InternalRow]) and uses the RDD[InternalRow] and the DataWriterFactory to create a <>.

doExecute prints out the following INFO message to the logs:

Start processing data source writer: [writer]. The input RDD has [partitions] partitions.

doExecute requests the EpochCoordinatorRef helper for a <> (using the <>).

NOTE: The <> runs on the driver as the single point to coordinate epochs across partition tasks.

doExecute requests the EpochCoordinator RPC endpoint reference to send out a <> message synchronously.

In the end, doExecute requests the ContinuousWriteRDD to collect (which simply runs a Spark job on all partitions in an RDD and returns the results in an array).

NOTE: Requesting the ContinuousWriteRDD to collect is how a Spark job is ran that in turn runs tasks (one per partition) that are described by the <> method. Since executing collect is meant to run a Spark job (with tasks on executors), it's in the discretion of the tasks themselves to decide when to finish (so if they want to run indefinitely, so be it). What a clever trick!