Skip to content

BasicWriteTaskStatsTracker

BasicWriteTaskStatsTracker is a WriteTaskStatsTracker.

Creating Instance

BasicWriteTaskStatsTracker takes the following to be created:

BasicWriteTaskStatsTracker is created when:

submittedFiles

BasicWriteTaskStatsTracker uses submittedFiles registry of the file paths added (written out to).

The submittedFiles registry is used to updateFileStats when getFinalStats.

A file path is removed when BasicWriteTaskStatsTracker is requested to closeFile.

All the file paths are removed in getFinalStats.

updateFileStats

updateFileStats(
  filePath: String): Unit

updateFileStats gets the size of the given filePath.

If the file length is found, it is added to the numBytes registry with the numFiles incremented.

Processing New File Notification

newFile(
  filePath: String): Unit

newFile adds the given filePath to the submittedFiles registry and increments the numSubmittedFiles counter.

newFile is part of the WriteTaskStatsTracker abstraction.

Final WriteTaskStats

getFinalStats(
  taskCommitTime: Long): WriteTaskStats

getFinalStats updateFileStats for every submittedFiles that are then cleared up.

getFinalStats sets the output metrics (of the current Spark task) as follows:

getFinalStats prints out the following INFO message when the numSubmittedFiles is different from the numFiles:

Expected [numSubmittedFiles] files, but only saw $numFiles.
This could be due to the output format not writing empty files,
or files being not immediately visible in the filesystem.

getFinalStats adds the given taskCommitTime to the taskCommitTimeMetric if defined.

In the end, creates a new BasicWriteTaskStats with the partitions, numFiles, numBytes, and numRows.


getFinalStats is part of the WriteTaskStatsTracker abstraction.

Logging

Enable ALL logging level for org.apache.spark.sql.execution.datasources.BasicWriteTaskStatsTracker logger to see what happens inside.

Add the following line to conf/log4j2.properties:

log4j.logger.org.apache.spark.sql.execution.datasources.BasicWriteTaskStatsTracker=ALL

Refer to Logging.