StatisticsCollection¶

StatisticsCollection is an abstraction of statistics collectors.

Contract¶

Data Schema¶

dataSchema: StructType

Schema (StructType) of the data files

Used when:

StatisticsCollection is requested for the statCollectionSchema and the statsSchema

Maximum Number of Indexed Columns¶

numIndexedCols: Int

Maximum number of leaf columns to collect stats on

Used when:

StatisticsCollection is requested for statCollectionSchema and to collectStats

SparkSession¶

spark: SparkSession

Used when:

StatisticsCollection is requested for statsCollector and statsSchema

Implementations¶

statsSchema¶

statsSchema: StructType

statsSchema...FIXME

statsSchema is used when:

DataSkippingReaderBase is requested for getStatsColumnOpt, withStatsInternal0, getStatsColumnOpt

statsCollector Column¶

statsCollector: Column

statsCollector takes the value of the spark.databricks.io.skipping.stringPrefixLength configuration property.

statsCollector creates a Column with stats name to be a struct of the following:

count(*) as numRecords
collectStats as minValues
collectStats as maxValues
collectStats as nullCount

Lazy Value

statsCollector is a Scala lazy value to guarantee that the code to initialize it is executed once only (when accessed for the first time) and the computed value never changes afterwards.

Learn more in the Scala Language Specification.

statsCollector is used when:

OptimisticTransactionImpl is requested for the stats collector column (of the table at the snapshot within this transaction)
TransactionalWrite is requested to writeFiles
StatisticsCollection is requested for statsSchema

collectStats¶

collectStats(
  name: String,
  schema: StructType)(
  function: PartialFunction[(Column, StructField), Column]): Column

collectStats...FIXME

statCollectionSchema¶

statCollectionSchema: StructType

For the number of leaf columns to collect stats on greater than or equal 0, statCollectionSchema truncate the dataSchema. Otherwise, statCollectionSchema returns the dataSchema intact.

Lazy Value

statCollectionSchema is a Scala lazy value to guarantee that the code to initialize it is executed once only (when accessed for the first time) and the computed value never changes afterwards.

Learn more in the Scala Language Specification.

truncateSchema¶

truncateSchema(
  schema: StructType,
  indexedCols: Int): (StructType, Int)

truncateSchema...FIXME

Recomputing Statistics¶

recompute(
  spark: SparkSession,
  deltaLog: DeltaLog,
  predicates: Seq[Expression] = Seq(Literal(true)),
  fileFilter: AddFile => Boolean = af => true): Unit

recompute...FIXME

recompute seems unused.