DeltaJobStatisticsTracker¶
DeltaJobStatisticsTracker
is a WriteJobStatsTracker
(Spark SQL) for per-file statistics collection (when spark.databricks.delta.stats.collect is enabled).
Creating Instance¶
DeltaJobStatisticsTracker
takes the following to be created:
- Hadoop Configuration
- Hadoop Path (to a delta table's data directory)
- Data non-partitioned column
Attribute
s (Spark SQL) - Statistics Column
Expression
(Spark SQL)
DeltaJobStatisticsTracker
is created when:
TransactionalWrite
is requested to write data out
Recorded Per-File Statistics¶
recordedStats: Map[String, String]
recordedStats
is a collection of recorded per-file statistics (that are collected upon processing per-job write task statistics).
recordedStats
is used when:
TransactionalWrite
is requested to write data out
Processing Per-Job Write Task Statistics¶
processStats(
stats: Seq[WriteTaskStats],
jobCommitTime: Long): Unit
processStats
extracts a DeltaFileStatistics
(from the given WriteTaskStats
) to access collected per-file statistics.
processStats
is part of the WriteJobStatsTracker
(Spark SQL) abstraction.