DeltaJobStatisticsTracker¶
DeltaJobStatisticsTracker is a WriteJobStatsTracker (Spark SQL) for per-file statistics collection (when spark.databricks.delta.stats.collect is enabled).
Creating Instance¶
DeltaJobStatisticsTracker takes the following to be created:
- Hadoop Configuration
- Hadoop Path (to a delta table's data directory)
- Data non-partitioned column
Attributes (Spark SQL) - Statistics Column
Expression(Spark SQL)
DeltaJobStatisticsTracker is created when:
TransactionalWriteis requested to write data out
Recorded Per-File Statistics¶
recordedStats: Map[String, String]
recordedStats is a collection of recorded per-file statistics (that are collected upon processing per-job write task statistics).
recordedStats is used when:
TransactionalWriteis requested to write data out
Processing Per-Job Write Task Statistics¶
processStats(
stats: Seq[WriteTaskStats],
jobCommitTime: Long): Unit
processStats extracts a DeltaFileStatistics (from the given WriteTaskStats) to access collected per-file statistics.
processStats is part of the WriteJobStatsTracker (Spark SQL) abstraction.