Skip to content

StatisticsCollection

StatisticsCollection is an abstraction of statistics collectors.

Contract

Data Schema

dataSchema: StructType

Schema (StructType) of the data files

Used when:

Maximum Number of Indexed Columns

numIndexedCols: Int

Maximum number of leaf columns to collect stats on

Used when:

SparkSession

spark: SparkSession

Used when:

Implementations

statsSchema

statsSchema: StructType

statsSchema...FIXME

statsSchema is used when:

statsCollector Column

statsCollector: Column

statsCollector takes the value of the spark.databricks.io.skipping.stringPrefixLength configuration property.

statsCollector creates a Column with stats name to be a struct of the following:

  1. count(*) as numRecords
  2. collectStats as minValues
  3. collectStats as maxValues
  4. collectStats as nullCount
Lazy Value

statsCollector is a Scala lazy value to guarantee that the code to initialize it is executed once only (when accessed for the first time) and the computed value never changes afterwards.

Learn more in the Scala Language Specification.

statsCollector is used when:

  • OptimisticTransactionImpl is requested for the stats collector column (of the table at the snapshot within this transaction)
  • TransactionalWrite is requested to writeFiles
  • StatisticsCollection is requested for statsSchema

collectStats

collectStats(
  name: String,
  schema: StructType)(
  function: PartialFunction[(Column, StructField), Column]): Column

collectStats...FIXME

statCollectionSchema

statCollectionSchema: StructType

For the number of leaf columns to collect stats on greater than or equal 0, statCollectionSchema truncate the dataSchema. Otherwise, statCollectionSchema returns the dataSchema intact.

Lazy Value

statCollectionSchema is a Scala lazy value to guarantee that the code to initialize it is executed once only (when accessed for the first time) and the computed value never changes afterwards.

Learn more in the Scala Language Specification.

truncateSchema

truncateSchema(
  schema: StructType,
  indexedCols: Int): (StructType, Int)

truncateSchema...FIXME

Recomputing Statistics

recompute(
  spark: SparkSession,
  deltaLog: DeltaLog,
  predicates: Seq[Expression] = Seq(Literal(true)),
  fileFilter: AddFile => Boolean = af => true): Unit

recompute...FIXME


recompute seems unused.