FileBatchWrite¶

FileBatchWrite is a BatchWrite that uses the given FileCommitProtocol to coordinate a writing job (abort or commit).

Creating Instance¶

FileBatchWrite takes the following to be created:

Hadoop Job
WriteJobDescription
FileCommitProtocol (Spark Core)

FileBatchWrite is created when:

FileWrite is requested for a BatchWrite

Aborting Write Job¶

abort(
  messages: Array[WriterCommitMessage]): Unit

abort requests the FileCommitProtocol to abort the Job.

abort is part of the BatchWrite abstraction.

Committing Write Job¶

commit(
  messages: Array[WriterCommitMessage]): Unit

commit prints out the following INFO message to the logs:

Start to commit write Job [uuid].

commit requests the FileCommitProtocol to commit the Job (with the WriteTaskResult extracted from the given WriterCommitMessages). commit measures the commit duration.

commit prints out the following INFO message to the logs:

Write Job [uuid] committed. Elapsed time: [duration] ms.

commit handles the statistics of this write job.

In the end, commit prints out the following INFO message to the logs:

Finished processing stats for write job [uuid].

commit is part of the BatchWrite abstraction.

Creating Batch DataWriterFactory¶

createBatchWriterFactory(
  info: PhysicalWriteInfo): DataWriterFactory

createBatchWriterFactory creates a new FileWriterFactory.

createBatchWriterFactory is part of the BatchWrite abstraction.

useCommitCoordinator¶

useCommitCoordinator(): Boolean

FileBatchWrite does not require a Commit Coordinator (and returns false).

useCommitCoordinator is part of the BatchWrite abstraction.

Logging¶

Enable ALL logging level for org.apache.spark.sql.execution.datasources.v2.FileBatchWrite logger to see what happens inside.

Add the following line to conf/log4j2.properties:

log4j.logger.org.apache.spark.sql.execution.datasources.v2.FileBatchWrite=ALL

Refer to Logging.