BasicWriteTaskStatsTracker¶
BasicWriteTaskStatsTracker is a WriteTaskStatsTracker.
Creating Instance¶
BasicWriteTaskStatsTracker takes the following to be created:
- Hadoop Configuration
- Task Commit Time SQLMetric
BasicWriteTaskStatsTracker is created when:
BasicWriteJobStatsTrackeris requested for a new WriteTaskStatsTracker instance
submittedFiles¶
BasicWriteTaskStatsTracker uses submittedFiles registry of the file paths added (written out to).
The submittedFiles registry is used to updateFileStats when getFinalStats.
A file path is removed when BasicWriteTaskStatsTracker is requested to closeFile.
All the file paths are removed in getFinalStats.
updateFileStats¶
updateFileStats(
filePath: String): Unit
updateFileStats gets the size of the given filePath.
If the file length is found, it is added to the numBytes registry with the numFiles incremented.
Processing New File Notification¶
newFile(
filePath: String): Unit
newFile adds the given filePath to the submittedFiles registry and increments the numSubmittedFiles counter.
newFile is part of the WriteTaskStatsTracker abstraction.
Final WriteTaskStats¶
getFinalStats(
taskCommitTime: Long): WriteTaskStats
getFinalStats updateFileStats for every submittedFiles that are then cleared up.
getFinalStats sets the output metrics (of the current Spark task) as follows:
getFinalStats prints out the following INFO message when the numSubmittedFiles is different from the numFiles:
Expected [numSubmittedFiles] files, but only saw $numFiles.
This could be due to the output format not writing empty files,
or files being not immediately visible in the filesystem.
getFinalStats adds the given taskCommitTime to the taskCommitTimeMetric if defined.
In the end, creates a new BasicWriteTaskStats with the partitions, numFiles, numBytes, and numRows.
getFinalStats is part of the WriteTaskStatsTracker abstraction.
Logging¶
Enable ALL logging level for org.apache.spark.sql.execution.datasources.BasicWriteTaskStatsTracker logger to see what happens inside.
Add the following line to conf/log4j2.properties:
log4j.logger.org.apache.spark.sql.execution.datasources.BasicWriteTaskStatsTracker=ALL
Refer to Logging.