StatisticsCollection¶
StatisticsCollection is an abstraction of statistics collectors.
Contract¶
Data Schema¶
Schema (StructType) of the data files
Used when:
StatisticsCollectionis requested for the statCollectionSchema and the statsSchema
Maximum Number of Indexed Columns¶
Maximum number of leaf columns to collect stats on
Used when:
StatisticsCollectionis requested for statCollectionSchema and to collectStats
SparkSession¶
Used when:
StatisticsCollectionis requested for statsCollector and statsSchema
Implementations¶
statsSchema¶
statsSchema...FIXME
statsSchema is used when:
DataSkippingReaderBaseis requested for getStatsColumnOpt, withStatsInternal0, getStatsColumnOpt
statsCollector Column¶
statsCollector takes the value of the spark.databricks.io.skipping.stringPrefixLength configuration property.
statsCollector creates a Column with stats name to be a struct of the following:
count(*)asnumRecords- collectStats as
minValues - collectStats as
maxValues - collectStats as
nullCount
Lazy Value
statsCollector is a Scala lazy value to guarantee that the code to initialize it is executed once only (when accessed for the first time) and the computed value never changes afterwards.
Learn more in the Scala Language Specification.
statsCollector is used when:
OptimisticTransactionImplis requested for the stats collector column (of the table at the snapshot within this transaction)TransactionalWriteis requested to writeFilesStatisticsCollectionis requested for statsSchema
collectStats¶
collectStats(
name: String,
schema: StructType)(
function: PartialFunction[(Column, StructField), Column]): Column
collectStats...FIXME
statCollectionSchema¶
For the number of leaf columns to collect stats on greater than or equal 0, statCollectionSchema truncate the dataSchema. Otherwise, statCollectionSchema returns the dataSchema intact.
Lazy Value
statCollectionSchema is a Scala lazy value to guarantee that the code to initialize it is executed once only (when accessed for the first time) and the computed value never changes afterwards.
Learn more in the Scala Language Specification.
truncateSchema¶
truncateSchema...FIXME
Recomputing Statistics¶
recompute(
spark: SparkSession,
deltaLog: DeltaLog,
predicates: Seq[Expression] = Seq(Literal(true)),
fileFilter: AddFile => Boolean = af => true): Unit
recompute...FIXME
recompute seems unused.